Speko Docs

RealtimeVoiceConversation

Browser capture and playback for direct speech-to-speech WebSocket sessions.

RealtimeVoiceConversation is the browser-side helper for Speko speech-to-speech (S2S) sessions. It connects directly to the S2S WebSocket returned by POST /v1/sessions, captures the microphone as PCM16, plays streamed PCM16 responses, and forwards transcript and status callbacks.

Use it when you want the lowest-latency S2S path and do not need the browser media transport used by VoiceConversation.

import { RealtimeVoiceConversation } from '@spekoai/client';

Mint the session on your server

Create S2S sessions on your backend so SPEKO_API_KEY never reaches the browser. Return only the short-lived WebSocket credentials.

app.post('/api/realtime-session', async (_req, res) => {
  const response = await fetch('https://api.speko.dev/v1/sessions', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.SPEKO_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      mode: 's2s',
      s2s: {
        provider: 'openai',
        model: 'gpt-realtime',
        voice: 'alloy',
        systemPrompt: 'You are a concise voice assistant.',
      },
      ttlSeconds: 900,
    }),
  });

  if (!response.ok) {
    res.status(response.status).json({ error: 'Could not start realtime session' });
    return;
  }

  const session = await response.json();
  res.json({
    sessionId: session.sessionId,
    wsUrl: session.wsUrl,
    wsToken: session.wsToken,
    expiresAt: session.expiresAt,
    inputSampleRate: session.inputSampleRate,
    outputSampleRate: session.outputSampleRate,
  });
});

Connect from the browser

import { useEffect, useRef, useState } from 'react';
import { RealtimeVoiceConversation } from '@spekoai/client';

export function RealtimePanel() {
  const convRef = useRef<RealtimeVoiceConversation | null>(null);
  const [status, setStatus] = useState('idle');
  const [transcript, setTranscript] = useState<string[]>([]);

  async function start() {
    setStatus('connecting');
    const session = await fetch('/api/realtime-session', {
      method: 'POST',
    }).then((r) => r.json());

    const conv = await RealtimeVoiceConversation.create({
      ...session,
      onConnect: ({ conversationId }) => {
        console.log('connected', conversationId);
      },
      onStatusChange: setStatus,
      onMessage: ({ source, text, isFinal }) => {
        if (isFinal) setTranscript((items) => [...items, `${source}: ${text}`]);
      },
      onError: (err) => console.error(err),
      onDisconnect: () => setStatus('idle'),
    });

    convRef.current = conv;
  }

  async function stop() {
    await convRef.current?.endSession();
    convRef.current = null;
  }

  useEffect(() => () => { void convRef.current?.endSession(); }, []);

  return (
    <div>
      <button onClick={start} disabled={status !== 'idle'}>Start</button>
      <button onClick={stop} disabled={status === 'idle'}>Stop</button>
      <p>Status: {status}</p>
      <ul>{transcript.map((item, i) => <li key={i}>{item}</li>)}</ul>
    </div>
  );
}

RealtimeVoiceConversation.create(options)

static create(options: RealtimeConversationOptions): Promise<RealtimeVoiceConversation>

create() opens the WebSocket, waits for a ready frame, starts microphone capture, then resolves.

RealtimeConversationOptions

FieldTypeRequiredDescription
sessionIdstringyesServer-assigned session id. Also returned by getId().
wsUrlstringyesShort-lived S2S WebSocket URL returned by POST /v1/sessions.
wsTokenstringyesShort-lived WebSocket token. Sent as the first WebSocket subprotocol.
expiresAtstring?ISO-8601 expiry for the WebSocket token.
inputSampleRate16000 | 24000?Requested capture rate. Defaults to 24000; the server can negotiate it.
outputSampleRate16000 | 24000?Requested playback rate. Defaults to 24000; the server can negotiate it.
inputDeviceIdstring?Specific microphone deviceId.
audioConstraintsAudioConstraints?echoCancellation, noiseSuppression, autoGainControl.
onConnect(d: { conversationId }) => voidFired after the socket is ready and microphone capture has started.
onDisconnect(d: DisconnectionDetails) => voidFired when the client or socket closes.
onMessage(m: ConversationMessage) => voidTranscript frames mapped to { source, text, isFinal }.
onStatusChange(s: ConversationStatus) => voidconnecting, connected, disconnecting, or disconnected.
onModeChange(m: ConversationMode) => voidspeaking while response audio is queued, otherwise listening.
onError(err: Error) => voidWebSocket transport errors and provider error frames.

Instance methods

getId(): string

Returns the sessionId passed to create().

isOpen(): boolean

true while the SDK status is connected and the WebSocket is open.

setMicMuted(muted: boolean): Promise<void>

Mute or unmute local microphone capture. Muting disables the media track and stops PCM frames from being sent.

setVolume(volume: number): void

Set response playback volume from 0 to 1. Values outside that range are clamped.

endSession(): Promise<void>

Close the WebSocket, stop microphone tracks, clear queued playback, close the AudioContext, and transition to disconnected.

Transport notes

  • The SDK passes wsToken as the first WebSocket subprotocol because browsers cannot set custom headers on new WebSocket().
  • Outbound microphone audio is sent as 20 ms PCM16 binary frames at the negotiated input sample rate.
  • Inbound binary frames are PCM16 response audio at the negotiated output sample rate.
  • JSON frames with t: 'transcript' are forwarded to onMessage. JSON frames with t: 'error' are forwarded to onError.
  • AudioWorklet capture is used when available; the SDK falls back to ScriptProcessorNode for older browsers.

On this page