Sessions vs one-shot

Speko supports two integration shapes. Pick based on whether your call is real-time interactive or single-turn batch.

One-shot endpoints

POST /v1/transcribe, POST /v1/synthesize, POST /v1/complete. Each is a single round-trip:

Caller sends input + intent.
Speko picks a provider, runs the call (with failover), returns the result.

Use for:

Batch transcription of recorded audio.
Server-side TTS for notifications, IVR prompts, exports.
LLM completions in a non-voice flow.

No state is held between calls. There is nothing to clean up.

POST /v1/sessions. Returns a transportToken plus a transportUrl and room name. The browser uses @spekoai/client to join the media transport; Speko dispatches an agent worker into the same session to run the STT → LLM → TTS pipeline in real time over WebRTC.

Use for:

Live voice conversations between an end-user and an agent.
Anything that needs barge-in, partial transcripts, or sub-second latency.

The session has a TTL (ttlSeconds, default 900s, max 86400s). The agent worker leaves when the room empties or the token expires. The voiceSession row is retained for usage and audit.

Phone sessions

POST /v1/sessions/phone creates the same kind of voice session, then dials a PSTN destination over LiveKit SIP. Inbound calls follow the reverse path: a registered phone number receives the carrier webhook, Speko creates the voice session, hydrates the linked agent or dispatch metadata template, and bridges the caller into the room.

Use phone sessions for:

Outbound appointment reminders, sales calls, and scheduled callbacks.
Inbound receptionists and support lines.
Calls that need carrier lifecycle events, forwarded-number metadata, post-call reports, recordings, or live transfers.

See Build a phone agent for the end-to-end phone flow.

Choosing

Need	Use
Transcribe a file, return text	`/v1/transcribe`
Generate audio for a notification	`/v1/synthesize`
Single LLM reply, no voice	`/v1/complete`
Real-time voice agent in a browser	`/v1/sessions` + `@spekoai/client`
Outbound PSTN voice call	`/v1/sessions/phone` or `speko.voice.dial()`
Inbound PSTN receptionist	`/v1/phone-numbers` linked to an agent
Inspect reports, events, recordings, or transfers	`/v1/calls/{id}`
Real-time voice in a self-hosted framework worker	`@spekoai/adapter-livekit` directly, when using LiveKit

What sessions don't do

They aren't a chat history store. The dispatched agent's systemPrompt and per-turn context live in your worker.
They don't proxy audio through the REST API. Speko is in the control path (mint token, dispatch worker). Audio flows through the session's media transport.
They don't bill differently from one-shot calls. Usage is recorded per underlying STT/LLM/TTS call.

Recording

Every voice session is recorded by default. The agent worker captures both speakers — caller and agent — into a single mixed-mono Opus file, persisted to Google Cloud Storage at the end of the call. There is no separate "enable recording" call; producing a session produces a recording.

What gets captured:

Mixed mono Opus. Both sides of the conversation in one file, ~24 kbps. Stereo / per-speaker tracks are not produced — assume one combined audio stream per session.
The full call. Recording starts when the first participant joins the room and ends when the room empties or the session token expires.
Audio only. Tool-call payloads and transcripts are not in the audio file; those live on the session row and the per-turn entries.

How to fetch one:

curl -L \
  -H "Authorization: Bearer $SPEKO_API_KEY" \
  https://api.speko.dev/v1/sessions/$SESSION_ID/recording \
  --output session.opus

The endpoint 302-redirects to a short-lived (5 minute) signed GCS URL — pass -L so curl follows it. The signed URL is single-use within its TTL window; refetch the endpoint to get a fresh one rather than caching the URL itself.

The status field

Each session entry carries a recordingStatus that walks through:

pending — the call ended; the agent worker is assembling the file.
uploading — the file is being pushed to GCS.
ready — fully persisted and downloadable. recordingObjectPath and recordingDurationMs are populated.
failed — the upload errored. The recording is unrecoverable; the session itself is unaffected.
suppressed — the organization has recordingEnabled set to false, so no file was ever produced. This is the terminal state for opted-out orgs; it's not a transient one.

null on recordingStatus only appears on legacy session entries that predate this feature. Treat it as "unknown" and don't expect a download.

Retention

Recordings are kept for 30 days after the session ends, then deleted automatically by a GCS lifecycle policy. There is no in-product way to extend retention — if you need long-term storage, follow the redirect, download the file, and store it in your own bucket. If you need shorter retention, see the per-org opt-out below.

Per-org opt-out

Recording is governed at the organization level by organization.recordingEnabled (default true). Flipping it to false:

Stops new sessions from producing files. Their recordingStatus becomes suppressed and GET /v1/sessions/{id}/recording returns 404.
Does not retroactively delete prior recordings — those expire on the normal 30-day timer.

The flag is exposed in the dashboard under the Record voice sessions toggle on the Settings page. Per-call opt-out (skipping recording for a single session even when the org default is on) is out of scope for this round.

HIPAA mode

Organizations on a HIPAA-mode plan are forced into recordingEnabled: false regardless of the dashboard toggle, and the toggle is locked. This is the current bridge until an end-to-end customer-managed-key path lands; see the compliance issue tracker for the full story.

API endpoint

GET /v1/sessions/{id}/recording is the only supported way to retrieve a recording. It authenticates with the same bearer key as every other endpoint and 302-redirects to a 5-minute signed URL on success, or 404 on any of: unknown session, recording not yet ready, recording failed, recording suppressed. Inspect the parent session entry's recordingStatus to disambiguate. Never construct GCS URLs from recordingObjectPath directly — those URLs are not publicly addressable, and the field exists only as an internal handle.

Transcript

Every finalized STT and LLM turn during a cascade session is persisted server-side. The dashboard's session detail page surfaces them under a Transcript card; the API equivalent is GET /v1/sessions/{id}/transcript, which returns turns sorted by index.

The worker batches and debounces (~200ms) finalized turns and POSTs them to /v1/sessions/{id}/turns. The ingest endpoint is idempotent on (session_id, index) so retries on transient errors are safe.

Interim STT partials are not persisted — only finalized turns make it into the transcript.

S2S sessions don't currently produce transcripts (the audio path bypasses our STT/LLM components). That's a follow-up.