Build a voice agent

Real-time STT → LLM → TTS pipeline using @spekoai/adapter-livekit on a LiveKit Agents worker.

This guide walks through standing up a LiveKit Agents worker that uses Speko for every modality. The worker registers with LiveKit Cloud, joins rooms on demand, and runs a streaming voice pipeline backed by Speko's routing.

If you only want browser-side conversation logic and don't run your own worker, see the hosted session flow instead.

Architecture

Browser ⟷ LiveKit room ⟷ your agent worker
                              │
                              └─→ @spekoai/sdk → Speko gateway → providers

Three processes meet in a LiveKit room:

Browser uses @spekoai/client to join with a session token your server mints.
Your API server mints the token (POST /v1/sessions or your own livekit-server-sdk flow) and dispatches the agent worker.
Your agent worker (this guide) runs @livekit/agents with Speko-backed STT/LLM/TTS.

Audio flows browser ↔ LiveKit ↔ worker. Speko sits in the control path, not the audio path.

Install

npm install @spekoai/sdk @spekoai/adapter-livekit \
            @livekit/agents @livekit/agents-plugin-silero @livekit/rtc-node

@livekit/agents and @livekit/rtc-node are peers — pin the versions you actually run.

Worker entry

import {
  type JobContext,
  type JobProcess,
  ServerOptions,
  cli,
  defineAgent,
  voice,
} from '@livekit/agents';
import * as silero from '@livekit/agents-plugin-silero';
import { Speko } from '@spekoai/sdk';
import { createSpekoComponents } from '@spekoai/adapter-livekit';
import { fileURLToPath } from 'node:url';

const speko = new Speko({ apiKey: process.env.SPEKO_API_KEY! });

export default defineAgent({
  prewarm: async (proc: JobProcess) => {
    proc.userData.vad = await silero.VAD.load();
  },
  entry: async (ctx: JobContext) => {
    const vad = ctx.proc.userData.vad as silero.VAD;

    const { stt, llm, tts } = createSpekoComponents({
      speko,
      vad,
      intent: { language: 'en-US', optimizeFor: 'balanced' },
      // optional: pin providers
      // constraints: { allowedProviders: { tts: ['cartesia'] } },
    });

    const session = new voice.AgentSession({ vad, stt, llm, tts });

    await session.start({
      agent: new voice.Agent({
        instructions: 'You are a helpful voice assistant. Be concise.',
      }),
      room: ctx.room,
    });

    await ctx.connect();
    session.generateReply({ instructions: 'Greet the user and offer your assistance.' });
  },
});

cli.runApp(
  new ServerOptions({
    agent: fileURLToPath(import.meta.url),
    agentName: 'speko-demo',
  }),
);

Run it with node agent.js (after build) or your tsx setup of choice. The worker registers with LiveKit Cloud under agentName and waits for dispatches.

Per-session config from dispatch metadata

When your server creates a session, the dispatcher passes JSON metadata to the worker. Read it in entry to build pipeline-per-session:

import { z } from 'zod';

const dispatchSchema = z.object({
  sessionId: z.string(),
  intent: z.object({
    language: z.string(),
    optimizeFor: z.enum(['balanced', 'accuracy', 'latency', 'cost']).optional(),
  }),
  constraints: z.any().optional(),
  voice: z.string().optional(),
  systemPrompt: z.string().optional(),
});

const meta = dispatchSchema.parse(JSON.parse(ctx.job.metadata ?? '{}'));

const { stt, llm, tts } = createSpekoComponents({
  speko,
  vad,
  intent: meta.intent,
  constraints: meta.constraints,
  voice: meta.voice,
});

Limitations of v1

STT upload is utterance-bounded. /v1/transcribe streams transcript events back, but the LiveKit adapter still uploads one VAD-segmented WAV per utterance.
TTS is sentence-bounded in LiveKit. /v1/synthesize streams audio bytes while the adapter calls it once per tokenized sentence.
Tool calls are supported. Inline tools return to the LiveKit runtime; registered webhook, builtin, and integration tools run server-side through /v1/complete.
TTS format constraints. Cartesia (PCM) and WAV TTS work. ElevenLabs MP3 currently throws — pin a PCM-capable provider via constraints.allowedProviders.tts or rely on the router's score-driven default.
STT input. Mono PCM16 frames; multi-channel throws.

See @spekoai/adapter-livekit reference for the full surface.

Browser side

Wire @spekoai/client into your dashboard / web app.

Adapter API

Full adapter reference.