Overview
The Sarvam Conv AI SDK enables developers to create applications that can:- Build real-time voice-to-voice conversational experiences in the browser
- Create text-based chat applications using sarvam agents
- Handle audio capture from microphone and playback to speakers automatically
- Manage conversation lifecycle with robust event handling
- Support multiple languages for conversational AI
Installation
Browser Applications
Install via npm or yarn:Node.js Applications
ws package is required as a peer dependency for Node.js environments.
Quick Start
Voice-to-Voice Conversation (Browser)
Here’s a complete React component example for voice interaction:Text-Based Conversation (Node.js)
ConversationAgent
The main class for managing conversational AI sessions. It automatically selects between voice and text modes based on theinteraction_type configuration.
Constructor Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| apiKey | string | Yes | API key for authentication |
| config | InteractionConfig | Yes | Interaction configuration (see below) |
| audioInterface | AsyncAudioInterface | No | Audio interface for mic/speaker (required for voice interactions) |
| textCallback | (msg: ServerTextMsgType) => Promise<void> | No | Receives streaming text chunks from the agent |
| audioCallback | (msg: ServerAudioChunkMsg) => Promise<void> | No | Receives audio chunks (if not using audioInterface) |
| eventCallback | (event: ServerEventBase) => Promise<void> | No | Receives events like user_interrupt, interaction_end |
| startCallback | () => Promise<void> | No | Called when conversation starts |
| endCallback | () => Promise<void> | No | Called when conversation ends |
Methods
async start(): Promise<void>
Start the conversation session and establish WebSocket connection.
async stop(): Promise<void>
Stop the conversation session and cleanup resources.
async waitForConnect(timeout?: number): Promise<boolean>
Wait until the WebSocket connection is established. Returns true if connected, false if timeout.
async waitForDisconnect(): Promise<void>
Wait until the WebSocket disconnects or the agent is stopped.
isConnected(): boolean
Check if the WebSocket is currently connected.
getInteractionId(): string | undefined
Get the current interaction identifier.
async sendAudio(audioData: Uint8Array): Promise<void>
Send raw audio data (only available for voice interactions). Audio must be 16-bit PCM mono at the configured sample rate.
async sendText(text: string): Promise<void>
Send a text message (only available for text interactions).
getAgentType(): 'voice' | 'text'
Get the type of agent currently active.
reference_id: string
Get or set the reference ID (useful for telephony integrations to store Call SID).
Configuration
InteractionConfig
The configuration object that defines the conversation parameters.Required Fields
| Field | Type | Description |
|---|---|---|
| user_identifier_type | string | One of: ‘custom’, ‘email’, ‘phone_number’, ‘unknown’ |
| user_identifier | string | User identifier value (email, phone, or custom ID) |
| org_id | string | Your organization ID |
| workspace_id | string | Your workspace ID |
| app_id | string | The target application ID |
| interaction_type | InteractionType | InteractionType.CALL (voice) or InteractionType.TEXT (text) |
| sample_rate | number | Audio sample rate: 8000 or 16000 (16-bit PCM mono) |
Optional Fields
| Field | Type | Description |
|---|---|---|
| version | number | App version number. If not provided, uses latest committed version |
| agent_variables | Record<string, any> | Key-value pairs to seed the agent context |
| initial_language_name | SarvamToolLanguageName | Starting language (e.g., ‘English’, ‘Hindi’) |
| initial_state_name | string | Starting state name (if your app uses states) |
| initial_bot_message | string | First message from the agent |
Important
Ifversionis not provided, the SDK uses the latest committed version of the app.
The connection will fail if the providedapp_idhas no committed version.
Example Configuration
Audio Interfaces
BrowserAudioInterface
Handles microphone capture and speaker playback in browser environments.- Automatic microphone access and audio capture
- Real-time audio streaming at 16kHz
- Automatic speaker playback of agent responses
- Handles user interruptions
- Manages audio permissions
- Modern browser with WebAudio API support
- HTTPS connection (required for microphone access)
- User permission for microphone access
Custom Audio Interface
You can implement your own audio interface by implementing theAsyncAudioInterface interface:
Event Handling
The SDK provides callbacks for different types of events during the conversation.Text Callback
Receives streaming text chunks from the agent:Audio Callback
Receives raw audio chunks (if not using BrowserAudioInterface):Start/End Callbacks
Track conversation lifecycle:Supported Languages
The SDK supports multilingual conversations using theSarvamToolLanguageName enum:
SarvamToolLanguageName.BENGALI- BengaliSarvamToolLanguageName.GUJARATI- GujaratiSarvamToolLanguageName.KANNADA- KannadaSarvamToolLanguageName.MALAYALAM- MalayalamSarvamToolLanguageName.TAMIL- TamilSarvamToolLanguageName.TELUGU- TeluguSarvamToolLanguageName.PUNJABI- PunjabiSarvamToolLanguageName.ODIA- OdiaSarvamToolLanguageName.MARATHI- MarathiSarvamToolLanguageName.HINDI- HindiSarvamToolLanguageName.ENGLISH- English
Node.js
Requirements:wspackage for WebSocket support- Node.js version 18 or higher
Message Types
Server Message Types
The SDK handles various message types from the server:| Event Type | Description |
|---|---|
server.media.text_chunk | Streaming text response from agent |
server.media.audio_chunk | Streaming audio response from agent |
server.action.interaction_connected | Conversation session established |
server.action.interaction_end | Conversation session ended |
server.event.user_speech_start | User started speaking |
server.event.user_speech_end | User stopped speaking |
server.event.user_interrupt | User interrupted the agent |
server.event.variable_update | Agent variables updated |
server.event.language_change | Conversation language changed |
server.event.state_transition | Agent state transitioned |
server.event.tool_call | Agent called a tool/function |
Client Message Types
Messages sent from the SDK to the server:| Message Type | Description |
|---|---|
client.action.interaction_start | Start conversation with configuration |
client.media.audio_chunk | Send audio data to agent |
client.media.text | Send text message to agent |
client.action.interaction_end | End conversation session |