From Hello to Namaste: Building Real-Time, Multilingual Voice Bots with Indian AI (Powered by Sarvam)

Avaneesh

17 Oct 2025 — 2 min read

Voice is a commonly used modality for human conversation. The channels businesses choose to support customers depend on a few factors:

Phone: The default when urgency or anxiety is involved, or when back-and-forth coordination is required or when white-glove treatment is expected.
Email: Ideal when you need accountability or must log a request as proof.
Chat: Best for simple questions and quick actions—closing the loop conveniently. OpenAI is emerging as a new channel for straightforward queries and may even become a unified canvas for various interactions.

While talking to potential users, voice automation repeatedly came up as a high-value opportunity—particularly for an Indian customer who cited multilingual support as a deal breaker. With Sarvam gaining traction, we decided to build a small proof of concept to see if a voice bot could deliver real value.

Preview of Tamil voice bot:

Preview of English voice bot:

Building Scalable, Multilingual Voice Bots

Thanks to recent advances, creating a real-time, production-ready voice bot in Indian languages is no longer a distant dream. By combining the right tools, we built a bot that understands and speaks Hindi, Tamil, Kannada, Bengali, and more.

Stack Overview

LiveKit
- Real-time media engine with SIP trunking for phone integration
- WebRTC backbone for low latency and high efficiency
- Multi-call handling in a single process
- Plugin-friendly, which lets us integrate non-native tools like Sarvam
Sarvam STT (Speech-to-Text)
- High-accuracy support for many Indian languages
- Requires buffered chunks rather than true streaming
- Custom plugin captures LiveKit audio and sends it to Sarvam’s API in near real time
OpenAI GPT-4.1
- Natural-language understanding and response generation
- Processes transcribed text and returns contextual replies
Sarvam TTS (Text-to-Speech)
- Natural-sounding voices in Indian languages
- Accepts full text, returns audio, and streams it back via LiveKit
Silero VAD (Voice Activity Detection)
- Detects when the caller speaks or pauses
- Enables turn-taking logic including support for barge-in (i.e. user interruptions)
- Implements smart buffering to balance speed and accuracy

End-to-End Flow

Caller speaks → LiveKit captures audio
Audio → Sarvam STT → Transcribed text
Text → GPT-4.1 → Generated response
Response → Sarvam TTS → Synthesized audio
Audio → LiveKit → Played back to the caller

Key Learnings

Feasibility: Real-time multilingual voice bots are achievable today.
Local-Language Support: Sarvam fills a critical gap in Indian STT and TTS.
Infrastructure: LiveKit simplifies media handling and scales efficiently.
Integration: Custom glue code is still needed for real-time handling and API limitations.