TypeScript SDK - Agent Docs

The Sarvam Conversational AI SDK is a JavaScript/TypeScript library that helps developers build real-time voice-to-voice and text-based conversational AI applications. It provides a unified interface for managing conversation flow, handling audio streams, and processing real-time messages in both browser and Node.js environments.

Overview

The Sarvam Conv AI SDK enables developers to create applications that can:

Build real-time voice-to-voice conversational experiences in the browser
Create text-based chat applications using sarvam agents
Handle audio capture from microphone and playback to speakers automatically
Manage conversation lifecycle with robust event handling
Support multiple languages for conversational AI

Installation

Browser Applications

Install via npm or yarn:

npm install sarvam-conv-ai-sdk

yarn add sarvam-conv-ai-sdk

For React applications, ensure you have React installed:

npm install react react-dom
npm install --save-dev @types/react @types/react-dom

Node.js Applications

npm install sarvam-conv-ai-sdk ws

Note: The ws package is required as a peer dependency for Node.js environments.

Quick Start

Voice-to-Voice Conversation (Browser)

Here’s a complete React component example for voice interaction:

import React, { useRef, useState } from 'react';
import {
  ConversationAgent,
  BrowserAudioInterface,
  InteractionType,
  ServerTextMsgType,
  ServerEventBase,
} from 'sarvam-conv-ai-sdk';

function VoiceChat() {
  const [isConnected, setIsConnected] = useState(false);
  const [transcript, setTranscript] = useState('');
  const agentRef = useRef<ConversationAgent | null>(null);

  const startConversation = async () => {
    try {
      const audioInterface = new BrowserAudioInterface();
      
      const agent = new ConversationAgent({
        apiKey: 'your_api_key',
        config: {
          user_identifier_type: 'custom',
          user_identifier: 'user123',
          org_id: 'your_org_id',
          workspace_id: 'your_workspace_id',
          app_id: 'your_app_id',
          interaction_type: InteractionType.CALL,
          sample_rate: 16000,
        },
        audioInterface,
        textCallback: async (msg: ServerTextMsgType) => {
          setTranscript(prev => prev + msg.text);
        },
        eventCallback: async (event: ServerEventBase) => {
          console.log('Event:', event.type);
        },
        startCallback: async () => {
          setIsConnected(true);
        },
        endCallback: async () => {
          setIsConnected(false);
        },
      });

      agentRef.current = agent;
      await agent.start();
      await agent.waitForConnect(10);
    } catch (error) {
      console.error('Error:', error);
    }
  };

  const stopConversation = async () => {
    if (agentRef.current) {
      await agentRef.current.stop();
      agentRef.current = null;
    }
  };

  return (
    <div>
      <h2>Voice Chat</h2>
      {!isConnected ? (
        <button onClick={startConversation}>Start Voice Chat</button>
      ) : (
        <button onClick={stopConversation}>Stop Voice Chat</button>
      )}
      <div>Transcript: {transcript}</div>
    </div>
  );
}

export default VoiceChat;

Text-Based Conversation (Node.js)

const { ConversationAgent, InteractionType } = require('sarvam-conv-ai-sdk');

async function main() {
  const agent = new ConversationAgent({
    apiKey: 'your_api_key',
    config: {
      org_id: 'your_org_id',
      workspace_id: 'your_workspace_id',
      app_id: 'your_app_id',
      user_identifier: 'user@example.com',
      user_identifier_type: 'email',
      interaction_type: InteractionType.TEXT,
      sample_rate: 16000, // Required but not used for text-only
    },
    textCallback: async (msg) => {
      console.log('Agent:', msg.text);
    },
    eventCallback: async (event) => {
      console.log('Event:', event.type);
    },
    startCallback: async () => {
      console.log('Conversation started!');
    },
    endCallback: async () => {
      console.log('Conversation ended!');
    },
  });

  // Start the conversation
  await agent.start();
  
  // Wait for connection
  const connected = await agent.waitForConnect(10);
  if (!connected) {
    console.error('Failed to connect');
    return;
  }

  // Send a text message
  await agent.sendText('Hello, how are you?');

  // Wait for conversation to complete
  await agent.waitForDisconnect();
}

main().catch(console.error);

ConversationAgent

The main class for managing conversational AI sessions. It automatically selects between voice and text modes based on the interaction_type configuration.

Constructor Parameters

Parameter	Type	Required	Description
apiKey	string	Yes	API key for authentication
config	InteractionConfig	Yes	Interaction configuration (see below)
audioInterface	AsyncAudioInterface	No	Audio interface for mic/speaker (required for voice interactions)
textCallback	(msg: ServerTextMsgType) => Promise<void>	No	Receives streaming text chunks from the agent
audioCallback	(msg: ServerAudioChunkMsg) => Promise<void>	No	Receives audio chunks (if not using audioInterface)
eventCallback	(event: ServerEventBase) => Promise<void>	No	Receives events like user_interrupt, interaction_end
startCallback	() => Promise<void>	No	Called when conversation starts
endCallback	() => Promise<void>	No	Called when conversation ends

Methods

`async start(): Promise<void>`

Start the conversation session and establish WebSocket connection.

await agent.start();

`async stop(): Promise<void>`

Stop the conversation session and cleanup resources.

await agent.stop();

`async waitForConnect(timeout?: number): Promise<boolean>`

Wait until the WebSocket connection is established. Returns true if connected, false if timeout.

const connected = await agent.waitForConnect(10); // 10 second timeout
if (!connected) {
  console.error('Connection timeout');
}

`async waitForDisconnect(): Promise<void>`

Wait until the WebSocket disconnects or the agent is stopped.

await agent.waitForDisconnect();

`isConnected(): boolean`

Check if the WebSocket is currently connected.

if (agent.isConnected()) {
  console.log('Agent is connected');
}

`getInteractionId(): string | undefined`

Get the current interaction identifier.

const id = agent.getInteractionId();
console.log('Interaction ID:', id);

`async sendAudio(audioData: Uint8Array): Promise<void>`

Send raw audio data (only available for voice interactions). Audio must be 16-bit PCM mono at the configured sample rate.

// Only for voice mode
await agent.sendAudio(audioBytes);

`async sendText(text: string): Promise<void>`

Send a text message (only available for text interactions).

// Only for text mode
await agent.sendText('Hello, how can you help me?');

`getAgentType(): 'voice' | 'text'`

Get the type of agent currently active.

const type = agent.getAgentType();
console.log('Agent type:', type); // 'voice' or 'text'

`reference_id: string`

Get or set the reference ID (useful for telephony integrations to store Call SID).

// Set reference ID
agent.reference_id = 'CA1234567890abcdef';

// Get reference ID
console.log('Reference ID:', agent.reference_id);

Configuration

InteractionConfig

The configuration object that defines the conversation parameters.

Required Fields

Field	Type	Description
user_identifier_type	string	One of: ‘custom’, ‘email’, ‘phone_number’, ‘unknown’
user_identifier	string	User identifier value (email, phone, or custom ID)
org_id	string	Your organization ID
workspace_id	string	Your workspace ID
app_id	string	The target application ID
interaction_type	InteractionType	InteractionType.CALL (voice) or InteractionType.TEXT (text)
sample_rate	number	Audio sample rate: 8000 or 16000 (16-bit PCM mono)

Optional Fields

Field	Type	Description
version	number	App version number. If not provided, uses latest committed version
agent_variables	Record<string, any>	Key-value pairs to seed the agent context
initial_language_name	SarvamToolLanguageName	Starting language (e.g., ‘English’, ‘Hindi’)
initial_state_name	string	Starting state name (if your app uses states)
initial_bot_message	string	First message from the agent

Important
If version is not provided, the SDK uses the latest committed version of the app.
The connection will fail if the provided app_id has no committed version.

Example Configuration

import { InteractionType, SarvamToolLanguageName } from 'sarvam-conv-ai-sdk';

const config = {
  user_identifier_type: 'custom',
  user_identifier: 'user123',
  org_id: 'sarvamai',
  workspace_id: 'default',
  app_id: 'your_app_id',
  interaction_type: InteractionType.CALL,
  sample_rate: 16000,
  agent_variables: {
    user_language: 'Hindi',
    context: 'customer_support'
  },
  initial_language_name: SarvamToolLanguageName.HINDI,
  initial_state_name: 'greeting',
  initial_bot_message: 'Hello! How can I help you today?',
  version: 1,
};

Audio Interfaces

BrowserAudioInterface

Handles microphone capture and speaker playback in browser environments.

import { BrowserAudioInterface } from 'sarvam-conv-ai-sdk';

const audioInterface = new BrowserAudioInterface();

Features:

Automatic microphone access and audio capture
Real-time audio streaming at 16kHz
Automatic speaker playback of agent responses
Handles user interruptions
Manages audio permissions

Audio Format: LINEAR16 (16-bit PCM mono) at 16000 Hz Browser Requirements:

Modern browser with WebAudio API support
HTTPS connection (required for microphone access)
User permission for microphone access

Custom Audio Interface

You can implement your own audio interface by implementing the AsyncAudioInterface interface:

interface AsyncAudioInterface {
  start(inputCallback: (data: AudioData) => Promise<void>): Promise<void>;
  output(audio: Uint8Array, sampleRate?: number): Promise<void>;
  interrupt(): void;
  stop(): Promise<void>;
}

Event Handling

The SDK provides callbacks for different types of events during the conversation.

Text Callback

Receives streaming text chunks from the agent:

textCallback: async (msg: ServerTextMsgType) => {
  console.log('Agent says:', msg.text);
  // Update UI with agent's response
}

Audio Callback

Receives raw audio chunks (if not using BrowserAudioInterface):

audioCallback: async (msg: ServerAudioChunkMsg) => {
  // Handle raw audio data
  const audioData = msg.data; // Uint8Array
  const sampleRate = msg.sample_rate; // number
  // Process or play the audio
}

Start/End Callbacks

Track conversation lifecycle:

startCallback: async () => {
  console.log('Conversation started');
  // Update UI state
}

endCallback: async () => {
  console.log('Conversation ended');
  // Cleanup and update UI
}

Supported Languages

The SDK supports multilingual conversations using the SarvamToolLanguageName enum:

import { SarvamToolLanguageName } from 'sarvam-conv-ai-sdk';

Available languages:

SarvamToolLanguageName.BENGALI - Bengali
SarvamToolLanguageName.GUJARATI - Gujarati
SarvamToolLanguageName.KANNADA - Kannada
SarvamToolLanguageName.MALAYALAM - Malayalam
SarvamToolLanguageName.TAMIL - Tamil
SarvamToolLanguageName.TELUGU - Telugu
SarvamToolLanguageName.PUNJABI - Punjabi
SarvamToolLanguageName.ODIA - Odia
SarvamToolLanguageName.MARATHI - Marathi
SarvamToolLanguageName.HINDI - Hindi
SarvamToolLanguageName.ENGLISH - English

Note: The allowed languages are a subset that is preselected while defining the agent configuration on the platform. Example usage:

const config = {
  // ... other config
  initial_language_name: SarvamToolLanguageName.HINDI,
};

Node.js

Requirements:

ws package for WebSocket support
Node.js version 18 or higher

Installation:

npm install sarvam-conv-ai-sdk ws

Usage:

const { ConversationAgent } = require('sarvam-conv-ai-sdk');
// ws is automatically used in Node.js environment

Note: Audio interfaces are not automatically available in Node.js. For voice conversations in Node.js, you’ll need to provide your own audio input/output handling.

Message Types

Server Message Types

The SDK handles various message types from the server:

Event Type	Description
`server.media.text_chunk`	Streaming text response from agent
`server.media.audio_chunk`	Streaming audio response from agent
`server.action.interaction_connected`	Conversation session established
`server.action.interaction_end`	Conversation session ended
`server.event.user_speech_start`	User started speaking
`server.event.user_speech_end`	User stopped speaking
`server.event.user_interrupt`	User interrupted the agent
`server.event.variable_update`	Agent variables updated
`server.event.language_change`	Conversation language changed
`server.event.state_transition`	Agent state transitioned
`server.event.tool_call`	Agent called a tool/function

Client Message Types

Messages sent from the SDK to the server:

Message Type	Description
`client.action.interaction_start`	Start conversation with configuration
`client.media.audio_chunk`	Send audio data to agent
`client.media.text`	Send text message to agent
`client.action.interaction_end`	End conversation session

Best Practices

1. Resource Cleanup

Always cleanup resources when done:

useEffect(() => {
  return () => {
    if (agentRef.current) {
      agentRef.current.stop().catch(console.error);
    }
  };
}, []);

2. Connection Timeout

Always specify a timeout when waiting for connection:

const connected = await agent.waitForConnect(10); // 10 seconds
if (!connected) {
  // Handle connection failure
}

3. Error Handling

Implement comprehensive error handling:

try {
  await agent.start();
} catch (error) {
  if (error.message.includes('permission')) {
    // Handle microphone permission error
  } else if (error.message.includes('network')) {
    // Handle network error
  } else {
    // Handle other errors
  }
}

Conversational AI SDKs

​Overview

​Installation

​Browser Applications

​Node.js Applications

​Quick Start

​Voice-to-Voice Conversation (Browser)

​Text-Based Conversation (Node.js)

​ConversationAgent

​Constructor Parameters

​Methods

​async start(): Promise<void>

​async stop(): Promise<void>

​async waitForConnect(timeout?: number): Promise<boolean>

​async waitForDisconnect(): Promise<void>

​isConnected(): boolean

​getInteractionId(): string | undefined

​async sendAudio(audioData: Uint8Array): Promise<void>

​async sendText(text: string): Promise<void>

​getAgentType(): 'voice' | 'text'

​reference_id: string

​Configuration

​InteractionConfig

​Required Fields

​Optional Fields

​Example Configuration

​Audio Interfaces

​BrowserAudioInterface

​Custom Audio Interface

​Event Handling

​Text Callback

​Audio Callback

​Start/End Callbacks

​Supported Languages

​Node.js

​Message Types

​Server Message Types

​Client Message Types

​Best Practices

​1. Resource Cleanup

​2. Connection Timeout

​3. Error Handling