Ensemble is an intelligent inference gateway that coordinates multiple LLM providers and API keys for optimal routing, cache affinity, and cost optimization. Written in Go, it acts as the single point of contact for all LLM inference requests across the platform.

What It Does

  • Routes requests to the optimal provider endpoint based on cache value, rate limits, and cost
  • Pools multiple API keys per provider, distributing load across capacity pools
  • Caches prompt prefix fingerprints to route follow-up requests to endpoints that already have cached context
  • Streams responses via SSE or WebSocket with unified event blocks
  • Persists completed responses to S3 so disconnected clients never lose paid-for inference
  • Tracks costs, tokens, and performance metrics across all providers

Performance

MetricValue
Peak throughput1,789 req/s (single instance)
Routing overhead1.9ms at 100 concurrency
Scaling coefficient4.3x (1.9ms → 8.2ms from 100→1000 concurrency)
Concurrent streams10,000 (HTTP/2 h2c)

Supported Providers

ProviderModelsCache SupportMultimodal
AnthropicClaude Opus 4.6/4.5/4.1/4.0, Sonnet 4.6/4.5/4.0, Haiku 4.5Multi-breakpoint prompt caching (90% discount, 5m–1h TTL, 1024 min tokens)Images, PDFs
AWS BedrockSame Claude models via Bedrock APISame caching (90% discount)Images, PDFs
GCP Vertex AISame Claude models via Vertex APISame caching (90% discount)Images, PDFs
OpenAIGPT-5/5.1/5.2/5.3/5.4, GPT-5 Pro/Mini/Nano, Codex variantsPrefix caching (90% discount, 24h TTL)Images, PDFs
Google GeminiGemini 2.5 Pro/Flash, 3.0 Pro/Flash, 3.1 Pro/Flash, Veo 3.1 (video)Implicit ephemeral caching (75% discount, 60m TTL, 2048 min tokens)Images, PDFs, Audio, Video
xAIGrok 4, 4.1, 4.20 (beta), Grok Code Fast, Grok 4 FastPrefix caching (75% discount, 10m TTL)Images
FireworksKimi K2/K2.5, MiniMax M2.5, GLM-5, Nemotron 3 SuperVaries by model (50–80% discount)Images, Video (K2.5 only)
Self-hostedMiniMax M2.1 (GCP), Qwen3-Coder-Next (SGLang)Prefix caching (90% discount)Text only
OpenRouter100+ models (Claude, GPT, Gemini, Grok, DeepSeek, Llama, etc.)Varies by underlying provider (75% discount)Varies

Key Design Decisions

  • SQLite + Redis rather than Postgres: SQLite for embedded local state, Redis only for cross-instance rate limit coordination
  • Lock-free hot path: sync.Map and atomic operations for request-path data structures — no mutex contention
  • Async batched logging: 65,536 message ring buffer, non-blocking writes
  • Configuration-driven: All provider configs, model pricing, routing thresholds, and timeouts in YAML