Speko Docs
Concepts

How routing works

Intent + benchmark scores → ranked candidates → live failover. The model behind every Speko call.

Speko continuously benchmarks every supported STT, LLM, TTS, and S2S provider across language and region. Every API call carries a RoutingIntent. The router applies hard filters, normalizes the surviving candidates against each other, picks the top-ranked provider, and falls back through runners-up if the primary fails.

Intent

type RoutingIntent = {
  language: string;          // BCP-47, e.g. "en-US", "es-MX"
  region?: string;           // 'global' | 'us-east4' | 'europe-west3' | 'asia-southeast1' (default: 'global')
  optimizeFor?: 'balanced' | 'accuracy' | 'latency' | 'cost'; // default: balanced
};

language is required. region selects which streaming-latency measurements the router uses; it defaults to 'global'. If a provider only published a global row (typical for batch endpoints), Speko falls back to that row when no per-region data matches your intent. The TypeScript SDK exposes region on RoutingIntent; raw HTTP callers send the same value in X-Speko-Intent.

optimizeFor chooses a weight preset that biases the per-modality composite. Defaults are tuned for production-leaning balance:

ModalityQuality axisLatency axisCost axis
STTWER (lower is better)TTFP p50 by region$/min
TTSRound-trip CERTTFB p50 by region$/min (chars-billed providers converted via 900 chars/min)
LLMQuality scoreTTFT p50Blended $/1M tokens
S2STask-success % (higher is better)Tool-call p50 by region$/min

balanced weights for STT/TTS are 0.5 quality / 0.3 latency / 0.2 cost. S2S is 0.4 / 0.4 / 0.2 (success and turn-latency carry equal weight). LLM is 0.5 / 0.3 / 0.2. The other presets shift weight toward their named axis.

Selection

For each modality the selector:

  1. Filters to providers with measurements for (language, region). If no region-specific row exists, falls back to region='global' rows.
  2. Applies hard filters from the active routing policy: e.g. STT drops anything above max_ttfp_p50_ms = 3000; all modalities exclude providers with status='warned'.
  3. Min-max-inverts each axis over the surviving candidate set, so scores are relative to who's still in the running, not to a fixed scale.
  4. Computes the weighted composite, sorts, and returns the top candidate plus an ordered runnersUp list.

Providers shipping with status='provisional' (scaffolded but not measured) and status='warned' (measured but flagged unsafe to route to) are visible in the admin UI but excluded from selector output.

Each call returns scoresRunId — the benchmark snapshot the decision was based on. Useful for audit and bug repro.

Failover

The runners-up are the next-best providers for your exact intent. If the primary throws, Speko transparently retries the same request against the next candidate. The response includes failoverCount (how many providers it tried before one succeeded) and provider / model (what actually ran).

If every candidate fails, the call returns ALL_PROVIDERS_FAILED.

Constraints

Pin or restrict the candidate pool per modality:

{
  "constraints": {
    "allowedProviders": {
      "stt": ["deepgram"],
      "tts": ["cartesia"],
      "s2s": ["openai"]
    }
  }
}

Speko still ranks by composite — it just picks the highest-ranking candidate that's in your allow-list. Use this to:

  • Pin a provider while debugging.
  • Honor compliance constraints (data residency, BAA coverage).
  • Cap costs by excluding premium providers.

Allowlists are model-aware. Each entry is either a vendor id ("deepgram" — any Deepgram model) or "<vendor>:<model>" ("deepgram:nova-3" — only Nova-3, no fallback to other Deepgram models within the vendor). Failover stays active across all entries in the layer. Enumerate the valid id strings via GET /v1/providers/known.

Preview before you ship

GET /v1/benchmarks/stack?language=en&region=us-east4&optimize_for=balanced returns the current pick per modality plus runners-up, scoresRunId, and a filtered_out[] list explaining why each excluded candidate was dropped (warned status, missed latency cutoff, missing region data, etc.). No usage is recorded.

Headers on every response

Every /v1/transcribe, /v1/synthesize, /v1/complete response carries:

  • X-Speko-Provider — provider that handled the request
  • X-Speko-Model — specific model
  • X-Speko-Failover-Count — how many providers we tried
  • X-Speko-Scores-Run-Id — benchmark snapshot id

Log these. They're how you correlate prod behavior with the routing decision.

On this page