Ensemble Overview - iGent Concert

Ensemble is an intelligent inference gateway that coordinates multiple LLM providers and API keys for optimal routing, cache affinity, and cost optimization. Written in Go, it acts as the single point of contact for all LLM inference requests across the platform.

What It Does

Routes requests to the optimal provider endpoint based on cache value, rate limits, and cost
Pools multiple API keys per provider, distributing load across capacity pools
Caches prompt prefix fingerprints to route follow-up requests to endpoints that already have cached context
Streams responses via SSE or WebSocket with unified event blocks
Persists completed responses to S3 so disconnected clients never lose paid-for inference
Tracks costs, tokens, and performance metrics across all providers

Performance

Metric	Value
Peak throughput	1,789 req/s (single instance)
Routing overhead	1.9ms at 100 concurrency
Scaling coefficient	4.3x (1.9ms → 8.2ms from 100→1000 concurrency)
Concurrent streams	10,000 (HTTP/2 h2c)

Supported Providers

Provider	Models	Cache Support	Multimodal
Anthropic	Claude Opus 4.6/4.5/4.1/4.0, Sonnet 4.6/4.5/4.0, Haiku 4.5	Multi-breakpoint prompt caching (90% discount, 5m–1h TTL, 1024 min tokens)	Images, PDFs
AWS Bedrock	Same Claude models via Bedrock API	Same caching (90% discount)	Images, PDFs
GCP Vertex AI	Same Claude models via Vertex API	Same caching (90% discount)	Images, PDFs
OpenAI	GPT-5/5.1/5.2/5.3/5.4, GPT-5 Pro/Mini/Nano, Codex variants	Prefix caching (90% discount, 24h TTL)	Images, PDFs
Google Gemini	Gemini 2.5 Pro/Flash, 3.0 Pro/Flash, 3.1 Pro/Flash, Veo 3.1 (video)	Implicit ephemeral caching (75% discount, 60m TTL, 2048 min tokens)	Images, PDFs, Audio, Video
xAI	Grok 4, 4.1, 4.20 (beta), Grok Code Fast, Grok 4 Fast	Prefix caching (75% discount, 10m TTL)	Images
Fireworks	Kimi K2/K2.5, MiniMax M2.5, GLM-5, Nemotron 3 Super	Varies by model (50–80% discount)	Images, Video (K2.5 only)
Self-hosted	MiniMax M2.1 (GCP), Qwen3-Coder-Next (SGLang)	Prefix caching (90% discount)	Text only
OpenRouter	100+ models (Claude, GPT, Gemini, Grok, DeepSeek, Llama, etc.)	Varies by underlying provider (75% discount)	Varies

Key Design Decisions

SQLite + Redis rather than Postgres: SQLite for embedded local state, Redis only for cross-instance rate limit coordination
Lock-free hot path: sync.Map and atomic operations for request-path data structures — no mutex contention
Async batched logging: 65,536 message ring buffer, non-blocking writes
Configuration-driven: All provider configs, model pricing, routing thresholds, and timeouts in YAML

Ensemble

​What It Does

​Performance

​Supported Providers

​Key Design Decisions

What It Does

Performance

Supported Providers

Key Design Decisions