From Hello to Namaste: Building Real-Time, Multilingual Voice Bots with Indian AI (Powered by Sarvam)

From Hello to Namaste: Building Real-Time, Multilingual Voice Bots with Indian AI (Powered by Sarvam)

Voice is a commonly used modality for human conversation. The channels businesses choose to support customers depend on a few factors:

  • Phone: The default when urgency or anxiety is involved, or when back-and-forth coordination is required or when white-glove treatment is expected.
  • Email: Ideal when you need accountability or must log a request as proof.
  • Chat: Best for simple questions and quick actions—closing the loop conveniently. OpenAI is emerging as a new channel for straightforward queries and may even become a unified canvas for various interactions.

While talking to potential users, voice automation repeatedly came up as a high-value opportunity—particularly for an Indian customer who cited multilingual support as a deal breaker. With Sarvam gaining traction, we decided to build a small proof of concept to see if a voice bot could deliver real value.

Preview of Tamil voice bot:

Preview of English voice bot:


Building Scalable, Multilingual Voice Bots

Thanks to recent advances, creating a real-time, production-ready voice bot in Indian languages is no longer a distant dream. By combining the right tools, we built a bot that understands and speaks Hindi, Tamil, Kannada, Bengali, and more.

Stack Overview

  1. LiveKit
    • Real-time media engine with SIP trunking for phone integration
    • WebRTC backbone for low latency and high efficiency
    • Multi-call handling in a single process
    • Plugin-friendly, which lets us integrate non-native tools like Sarvam
  2. Sarvam STT (Speech-to-Text)
    • High-accuracy support for many Indian languages
    • Requires buffered chunks rather than true streaming
    • Custom plugin captures LiveKit audio and sends it to Sarvam’s API in near real time
  3. OpenAI GPT-4.1
    • Natural-language understanding and response generation
    • Processes transcribed text and returns contextual replies
  4. Sarvam TTS (Text-to-Speech)
    • Natural-sounding voices in Indian languages
    • Accepts full text, returns audio, and streams it back via LiveKit
  5. Silero VAD (Voice Activity Detection)
    • Detects when the caller speaks or pauses
    • Enables turn-taking logic including support for barge-in (i.e. user interruptions)
    • Implements smart buffering to balance speed and accuracy

End-to-End Flow

  1. Caller speaks → LiveKit captures audio
  2. Audio → Sarvam STT → Transcribed text
  3. Text → GPT-4.1 → Generated response
  4. Response → Sarvam TTS → Synthesized audio
  5. Audio → LiveKit → Played back to the caller

Key Learnings

  • Feasibility: Real-time multilingual voice bots are achievable today.
  • Local-Language Support: Sarvam fills a critical gap in Indian STT and TTS.
  • Infrastructure: LiveKit simplifies media handling and scales efficiently.
  • Integration: Custom glue code is still needed for real-time handling and API limitations.

Next Steps

  1. Quality Testing: Evaluate bot responses across diverse datasets.
  2. Brand Voice: Define name, tone, and personality for the bot.
  3. Localization: Enhance local-language fluency for a more natural feel.
  4. Scale Testing: Stress-test at production scale to validate performance.

Open Questions

  • Do you think voice bots will become the future of communication?
  • Would you feel comfortable speaking with a brand’s voice bot?

Feel free to share any feedback or ideas!

Read more