The Sarvam Conversational AI SDK is a JavaScript/TypeScript library that helps developers build real-time voice-to-voice and text-based conversational AI applications. It provides a unified interface for managing conversation flow, handling audio streams, and processing real-time messages in both browser and Node.js environments.

Overview

The Sarvam Conv AI SDK enables developers to create applications that can:
  • Build real-time voice-to-voice conversational experiences in the browser
  • Create text-based chat applications using sarvam agents
  • Handle audio capture from microphone and playback to speakers automatically
  • Manage conversation lifecycle with robust event handling
  • Support multiple languages for conversational AI

Installation

Browser Applications

Install via npm or yarn:
npm install sarvam-conv-ai-sdk
or
yarn add sarvam-conv-ai-sdk
For React applications, ensure you have React installed:
npm install react react-dom
npm install --save-dev @types/react @types/react-dom

Node.js Applications

npm install sarvam-conv-ai-sdk ws
Note: The ws package is required as a peer dependency for Node.js environments.

Quick Start

Voice-to-Voice Conversation (Browser)

Here’s a complete React component example for voice interaction:
import React, { useRef, useState } from 'react';
import {
  ConversationAgent,
  BrowserAudioInterface,
  InteractionType,
  ServerTextMsgType,
  ServerEventBase,
} from 'sarvam-conv-ai-sdk';

function VoiceChat() {
  const [isConnected, setIsConnected] = useState(false);
  const [transcript, setTranscript] = useState('');
  const agentRef = useRef<ConversationAgent | null>(null);

  const startConversation = async () => {
    try {
      const audioInterface = new BrowserAudioInterface();
      
      const agent = new ConversationAgent({
        apiKey: 'your_api_key',
        config: {
          user_identifier_type: 'custom',
          user_identifier: 'user123',
          org_id: 'your_org_id',
          workspace_id: 'your_workspace_id',
          app_id: 'your_app_id',
          interaction_type: InteractionType.CALL,
          sample_rate: 16000,
        },
        audioInterface,
        textCallback: async (msg: ServerTextMsgType) => {
          setTranscript(prev => prev + msg.text);
        },
        eventCallback: async (event: ServerEventBase) => {
          console.log('Event:', event.type);
        },
        startCallback: async () => {
          setIsConnected(true);
        },
        endCallback: async () => {
          setIsConnected(false);
        },
      });

      agentRef.current = agent;
      await agent.start();
      await agent.waitForConnect(10);
    } catch (error) {
      console.error('Error:', error);
    }
  };

  const stopConversation = async () => {
    if (agentRef.current) {
      await agentRef.current.stop();
      agentRef.current = null;
    }
  };

  return (
    <div>
      <h2>Voice Chat</h2>
      {!isConnected ? (
        <button onClick={startConversation}>Start Voice Chat</button>
      ) : (
        <button onClick={stopConversation}>Stop Voice Chat</button>
      )}
      <div>Transcript: {transcript}</div>
    </div>
  );
}

export default VoiceChat;

Text-Based Conversation (Node.js)

const { ConversationAgent, InteractionType } = require('sarvam-conv-ai-sdk');

async function main() {
  const agent = new ConversationAgent({
    apiKey: 'your_api_key',
    config: {
      org_id: 'your_org_id',
      workspace_id: 'your_workspace_id',
      app_id: 'your_app_id',
      user_identifier: 'user@example.com',
      user_identifier_type: 'email',
      interaction_type: InteractionType.TEXT,
      sample_rate: 16000, // Required but not used for text-only
    },
    textCallback: async (msg) => {
      console.log('Agent:', msg.text);
    },
    eventCallback: async (event) => {
      console.log('Event:', event.type);
    },
    startCallback: async () => {
      console.log('Conversation started!');
    },
    endCallback: async () => {
      console.log('Conversation ended!');
    },
  });

  // Start the conversation
  await agent.start();
  
  // Wait for connection
  const connected = await agent.waitForConnect(10);
  if (!connected) {
    console.error('Failed to connect');
    return;
  }

  // Send a text message
  await agent.sendText('Hello, how are you?');

  // Wait for conversation to complete
  await agent.waitForDisconnect();
}

main().catch(console.error);

ConversationAgent

The main class for managing conversational AI sessions. It automatically selects between voice and text modes based on the interaction_type configuration.

Constructor Parameters

ParameterTypeRequiredDescription
apiKeystringYesAPI key for authentication
configInteractionConfigYesInteraction configuration (see below)
audioInterfaceAsyncAudioInterfaceNoAudio interface for mic/speaker (required for voice interactions)
textCallback(msg: ServerTextMsgType) => Promise<void>NoReceives streaming text chunks from the agent
audioCallback(msg: ServerAudioChunkMsg) => Promise<void>NoReceives audio chunks (if not using audioInterface)
eventCallback(event: ServerEventBase) => Promise<void>NoReceives events like user_interrupt, interaction_end
startCallback() => Promise<void>NoCalled when conversation starts
endCallback() => Promise<void>NoCalled when conversation ends

Methods

async start(): Promise<void>

Start the conversation session and establish WebSocket connection.
await agent.start();

async stop(): Promise<void>

Stop the conversation session and cleanup resources.
await agent.stop();

async waitForConnect(timeout?: number): Promise<boolean>

Wait until the WebSocket connection is established. Returns true if connected, false if timeout.
const connected = await agent.waitForConnect(10); // 10 second timeout
if (!connected) {
  console.error('Connection timeout');
}

async waitForDisconnect(): Promise<void>

Wait until the WebSocket disconnects or the agent is stopped.
await agent.waitForDisconnect();

isConnected(): boolean

Check if the WebSocket is currently connected.
if (agent.isConnected()) {
  console.log('Agent is connected');
}

getInteractionId(): string | undefined

Get the current interaction identifier.
const id = agent.getInteractionId();
console.log('Interaction ID:', id);

async sendAudio(audioData: Uint8Array): Promise<void>

Send raw audio data (only available for voice interactions). Audio must be 16-bit PCM mono at the configured sample rate.
// Only for voice mode
await agent.sendAudio(audioBytes);

async sendText(text: string): Promise<void>

Send a text message (only available for text interactions).
// Only for text mode
await agent.sendText('Hello, how can you help me?');

getAgentType(): 'voice' | 'text'

Get the type of agent currently active.
const type = agent.getAgentType();
console.log('Agent type:', type); // 'voice' or 'text'

reference_id: string

Get or set the reference ID (useful for telephony integrations to store Call SID).
// Set reference ID
agent.reference_id = 'CA1234567890abcdef';

// Get reference ID
console.log('Reference ID:', agent.reference_id);

Configuration

InteractionConfig

The configuration object that defines the conversation parameters.

Required Fields

FieldTypeDescription
user_identifier_typestringOne of: ‘custom’, ‘email’, ‘phone_number’, ‘unknown’
user_identifierstringUser identifier value (email, phone, or custom ID)
org_idstringYour organization ID
workspace_idstringYour workspace ID
app_idstringThe target application ID
interaction_typeInteractionTypeInteractionType.CALL (voice) or InteractionType.TEXT (text)
sample_ratenumberAudio sample rate: 8000 or 16000 (16-bit PCM mono)

Optional Fields

FieldTypeDescription
versionnumberApp version number. If not provided, uses latest committed version
agent_variablesRecord<string, any>Key-value pairs to seed the agent context
initial_language_nameSarvamToolLanguageNameStarting language (e.g., ‘English’, ‘Hindi’)
initial_state_namestringStarting state name (if your app uses states)
initial_bot_messagestringFirst message from the agent
Important
If version is not provided, the SDK uses the latest committed version of the app.
The connection will fail if the provided app_id has no committed version.

Example Configuration

import { InteractionType, SarvamToolLanguageName } from 'sarvam-conv-ai-sdk';

const config = {
  user_identifier_type: 'custom',
  user_identifier: 'user123',
  org_id: 'sarvamai',
  workspace_id: 'default',
  app_id: 'your_app_id',
  interaction_type: InteractionType.CALL,
  sample_rate: 16000,
  agent_variables: {
    user_language: 'Hindi',
    context: 'customer_support'
  },
  initial_language_name: SarvamToolLanguageName.HINDI,
  initial_state_name: 'greeting',
  initial_bot_message: 'Hello! How can I help you today?',
  version: 1,
};

Audio Interfaces

BrowserAudioInterface

Handles microphone capture and speaker playback in browser environments.
import { BrowserAudioInterface } from 'sarvam-conv-ai-sdk';

const audioInterface = new BrowserAudioInterface();
Features:
  • Automatic microphone access and audio capture
  • Real-time audio streaming at 16kHz
  • Automatic speaker playback of agent responses
  • Handles user interruptions
  • Manages audio permissions
Audio Format: LINEAR16 (16-bit PCM mono) at 16000 Hz Browser Requirements:
  • Modern browser with WebAudio API support
  • HTTPS connection (required for microphone access)
  • User permission for microphone access

Custom Audio Interface

You can implement your own audio interface by implementing the AsyncAudioInterface interface:
interface AsyncAudioInterface {
  start(inputCallback: (data: AudioData) => Promise<void>): Promise<void>;
  output(audio: Uint8Array, sampleRate?: number): Promise<void>;
  interrupt(): void;
  stop(): Promise<void>;
}

Event Handling

The SDK provides callbacks for different types of events during the conversation.

Text Callback

Receives streaming text chunks from the agent:
textCallback: async (msg: ServerTextMsgType) => {
  console.log('Agent says:', msg.text);
  // Update UI with agent's response
}

Audio Callback

Receives raw audio chunks (if not using BrowserAudioInterface):
audioCallback: async (msg: ServerAudioChunkMsg) => {
  // Handle raw audio data
  const audioData = msg.data; // Uint8Array
  const sampleRate = msg.sample_rate; // number
  // Process or play the audio
}

Start/End Callbacks

Track conversation lifecycle:
startCallback: async () => {
  console.log('Conversation started');
  // Update UI state
}

endCallback: async () => {
  console.log('Conversation ended');
  // Cleanup and update UI
}

Supported Languages

The SDK supports multilingual conversations using the SarvamToolLanguageName enum:
import { SarvamToolLanguageName } from 'sarvam-conv-ai-sdk';
Available languages:
  • SarvamToolLanguageName.BENGALI - Bengali
  • SarvamToolLanguageName.GUJARATI - Gujarati
  • SarvamToolLanguageName.KANNADA - Kannada
  • SarvamToolLanguageName.MALAYALAM - Malayalam
  • SarvamToolLanguageName.TAMIL - Tamil
  • SarvamToolLanguageName.TELUGU - Telugu
  • SarvamToolLanguageName.PUNJABI - Punjabi
  • SarvamToolLanguageName.ODIA - Odia
  • SarvamToolLanguageName.MARATHI - Marathi
  • SarvamToolLanguageName.HINDI - Hindi
  • SarvamToolLanguageName.ENGLISH - English
Note: The allowed languages are a subset that is preselected while defining the agent configuration on the platform. Example usage:
const config = {
  // ... other config
  initial_language_name: SarvamToolLanguageName.HINDI,
};

Node.js

Requirements:
  • ws package for WebSocket support
  • Node.js version 18 or higher
Installation:
npm install sarvam-conv-ai-sdk ws
Usage:
const { ConversationAgent } = require('sarvam-conv-ai-sdk');
// ws is automatically used in Node.js environment
Note: Audio interfaces are not automatically available in Node.js. For voice conversations in Node.js, you’ll need to provide your own audio input/output handling.

Message Types

Server Message Types

The SDK handles various message types from the server:
Event TypeDescription
server.media.text_chunkStreaming text response from agent
server.media.audio_chunkStreaming audio response from agent
server.action.interaction_connectedConversation session established
server.action.interaction_endConversation session ended
server.event.user_speech_startUser started speaking
server.event.user_speech_endUser stopped speaking
server.event.user_interruptUser interrupted the agent
server.event.variable_updateAgent variables updated
server.event.language_changeConversation language changed
server.event.state_transitionAgent state transitioned
server.event.tool_callAgent called a tool/function

Client Message Types

Messages sent from the SDK to the server:
Message TypeDescription
client.action.interaction_startStart conversation with configuration
client.media.audio_chunkSend audio data to agent
client.media.textSend text message to agent
client.action.interaction_endEnd conversation session

Best Practices

1. Resource Cleanup

Always cleanup resources when done:
useEffect(() => {
  return () => {
    if (agentRef.current) {
      agentRef.current.stop().catch(console.error);
    }
  };
}, []);

2. Connection Timeout

Always specify a timeout when waiting for connection:
const connected = await agent.waitForConnect(10); // 10 seconds
if (!connected) {
  // Handle connection failure
}

3. Error Handling

Implement comprehensive error handling:
try {
  await agent.start();
} catch (error) {
  if (error.message.includes('permission')) {
    // Handle microphone permission error
  } else if (error.message.includes('network')) {
    // Handle network error
  } else {
    // Handle other errors
  }
}