RealtimeVoiceConversation
Browser capture and playback for direct speech-to-speech WebSocket sessions.
RealtimeVoiceConversation is the browser-side helper for Speko speech-to-speech (S2S) sessions. It connects directly to the S2S WebSocket returned by POST /v1/sessions, captures the microphone as PCM16, plays streamed PCM16 responses, and forwards transcript and status callbacks.
Use it when you want the lowest-latency S2S path and do not need the browser media transport used by VoiceConversation.
import { RealtimeVoiceConversation } from '@spekoai/client';Mint the session on your server
Create S2S sessions on your backend so SPEKO_API_KEY never reaches the browser. Return only the short-lived WebSocket credentials.
app.post('/api/realtime-session', async (_req, res) => {
const response = await fetch('https://api.speko.dev/v1/sessions', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.SPEKO_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
mode: 's2s',
s2s: {
provider: 'openai',
model: 'gpt-realtime',
voice: 'alloy',
systemPrompt: 'You are a concise voice assistant.',
},
ttlSeconds: 900,
}),
});
if (!response.ok) {
res.status(response.status).json({ error: 'Could not start realtime session' });
return;
}
const session = await response.json();
res.json({
sessionId: session.sessionId,
wsUrl: session.wsUrl,
wsToken: session.wsToken,
expiresAt: session.expiresAt,
inputSampleRate: session.inputSampleRate,
outputSampleRate: session.outputSampleRate,
});
});Connect from the browser
import { useEffect, useRef, useState } from 'react';
import { RealtimeVoiceConversation } from '@spekoai/client';
export function RealtimePanel() {
const convRef = useRef<RealtimeVoiceConversation | null>(null);
const [status, setStatus] = useState('idle');
const [transcript, setTranscript] = useState<string[]>([]);
async function start() {
setStatus('connecting');
const session = await fetch('/api/realtime-session', {
method: 'POST',
}).then((r) => r.json());
const conv = await RealtimeVoiceConversation.create({
...session,
onConnect: ({ conversationId }) => {
console.log('connected', conversationId);
},
onStatusChange: setStatus,
onMessage: ({ source, text, isFinal }) => {
if (isFinal) setTranscript((items) => [...items, `${source}: ${text}`]);
},
onError: (err) => console.error(err),
onDisconnect: () => setStatus('idle'),
});
convRef.current = conv;
}
async function stop() {
await convRef.current?.endSession();
convRef.current = null;
}
useEffect(() => () => { void convRef.current?.endSession(); }, []);
return (
<div>
<button onClick={start} disabled={status !== 'idle'}>Start</button>
<button onClick={stop} disabled={status === 'idle'}>Stop</button>
<p>Status: {status}</p>
<ul>{transcript.map((item, i) => <li key={i}>{item}</li>)}</ul>
</div>
);
}RealtimeVoiceConversation.create(options)
static create(options: RealtimeConversationOptions): Promise<RealtimeVoiceConversation>create() opens the WebSocket, waits for a ready frame, starts microphone capture, then resolves.
RealtimeConversationOptions
| Field | Type | Required | Description |
|---|---|---|---|
sessionId | string | yes | Server-assigned session id. Also returned by getId(). |
wsUrl | string | yes | Short-lived S2S WebSocket URL returned by POST /v1/sessions. |
wsToken | string | yes | Short-lived WebSocket token. Sent as the first WebSocket subprotocol. |
expiresAt | string? | ISO-8601 expiry for the WebSocket token. | |
inputSampleRate | 16000 | 24000? | Requested capture rate. Defaults to 24000; the server can negotiate it. | |
outputSampleRate | 16000 | 24000? | Requested playback rate. Defaults to 24000; the server can negotiate it. | |
inputDeviceId | string? | Specific microphone deviceId. | |
audioConstraints | AudioConstraints? | echoCancellation, noiseSuppression, autoGainControl. | |
onConnect | (d: { conversationId }) => void | Fired after the socket is ready and microphone capture has started. | |
onDisconnect | (d: DisconnectionDetails) => void | Fired when the client or socket closes. | |
onMessage | (m: ConversationMessage) => void | Transcript frames mapped to { source, text, isFinal }. | |
onStatusChange | (s: ConversationStatus) => void | connecting, connected, disconnecting, or disconnected. | |
onModeChange | (m: ConversationMode) => void | speaking while response audio is queued, otherwise listening. | |
onError | (err: Error) => void | WebSocket transport errors and provider error frames. |
Instance methods
getId(): string
Returns the sessionId passed to create().
isOpen(): boolean
true while the SDK status is connected and the WebSocket is open.
setMicMuted(muted: boolean): Promise<void>
Mute or unmute local microphone capture. Muting disables the media track and stops PCM frames from being sent.
setVolume(volume: number): void
Set response playback volume from 0 to 1. Values outside that range are clamped.
endSession(): Promise<void>
Close the WebSocket, stop microphone tracks, clear queued playback, close the AudioContext, and transition to disconnected.
Transport notes
- The SDK passes
wsTokenas the first WebSocket subprotocol because browsers cannot set custom headers onnew WebSocket(). - Outbound microphone audio is sent as 20 ms PCM16 binary frames at the negotiated input sample rate.
- Inbound binary frames are PCM16 response audio at the negotiated output sample rate.
- JSON frames with
t: 'transcript'are forwarded toonMessage. JSON frames witht: 'error'are forwarded toonError. AudioWorkletcapture is used when available; the SDK falls back toScriptProcessorNodefor older browsers.