From Hello to Namaste: Building Real-Time, Multilingual Voice Bots with Indian AI (Powered by Sarvam)
Voice is a commonly used modality for human conversation. The channels businesses choose to support customers depend on a few factors:
- Phone: The default when urgency or anxiety is involved, or when back-and-forth coordination is required or when white-glove treatment is expected.
- Email: Ideal when you need accountability or must log a request as proof.
- Chat: Best for simple questions and quick actions—closing the loop conveniently. OpenAI is emerging as a new channel for straightforward queries and may even become a unified canvas for various interactions.
While talking to potential users, voice automation repeatedly came up as a high-value opportunity—particularly for an Indian customer who cited multilingual support as a deal breaker. With Sarvam gaining traction, we decided to build a small proof of concept to see if a voice bot could deliver real value.
Preview of Tamil voice bot:
Preview of English voice bot:
Building Scalable, Multilingual Voice Bots
Thanks to recent advances, creating a real-time, production-ready voice bot in Indian languages is no longer a distant dream. By combining the right tools, we built a bot that understands and speaks Hindi, Tamil, Kannada, Bengali, and more.
Stack Overview
- LiveKit
- Real-time media engine with SIP trunking for phone integration
- WebRTC backbone for low latency and high efficiency
- Multi-call handling in a single process
- Plugin-friendly, which lets us integrate non-native tools like Sarvam
- Sarvam STT (Speech-to-Text)
- High-accuracy support for many Indian languages
- Requires buffered chunks rather than true streaming
- Custom plugin captures LiveKit audio and sends it to Sarvam’s API in near real time
- OpenAI GPT-4.1
- Natural-language understanding and response generation
- Processes transcribed text and returns contextual replies
- Sarvam TTS (Text-to-Speech)
- Natural-sounding voices in Indian languages
- Accepts full text, returns audio, and streams it back via LiveKit
- Silero VAD (Voice Activity Detection)
- Detects when the caller speaks or pauses
- Enables turn-taking logic including support for barge-in (i.e. user interruptions)
- Implements smart buffering to balance speed and accuracy
End-to-End Flow
- Caller speaks → LiveKit captures audio
- Audio → Sarvam STT → Transcribed text
- Text → GPT-4.1 → Generated response
- Response → Sarvam TTS → Synthesized audio
- Audio → LiveKit → Played back to the caller
Key Learnings
- Feasibility: Real-time multilingual voice bots are achievable today.
- Local-Language Support: Sarvam fills a critical gap in Indian STT and TTS.
- Infrastructure: LiveKit simplifies media handling and scales efficiently.
- Integration: Custom glue code is still needed for real-time handling and API limitations.
Next Steps
- Quality Testing: Evaluate bot responses across diverse datasets.
- Brand Voice: Define name, tone, and personality for the bot.
- Localization: Enhance local-language fluency for a more natural feel.
- Scale Testing: Stress-test at production scale to validate performance.
Open Questions
- Do you think voice bots will become the future of communication?
- Would you feel comfortable speaking with a brand’s voice bot?
Feel free to share any feedback or ideas!