What It Does
- Routes requests to the optimal provider endpoint based on cache value, rate limits, and cost
- Pools multiple API keys per provider, distributing load across capacity pools
- Caches prompt prefix fingerprints to route follow-up requests to endpoints that already have cached context
- Streams responses via SSE or WebSocket with unified event blocks
- Persists completed responses to S3 so disconnected clients never lose paid-for inference
- Tracks costs, tokens, and performance metrics across all providers
Performance
| Metric | Value |
|---|---|
| Peak throughput | 1,789 req/s (single instance) |
| Routing overhead | 1.9ms at 100 concurrency |
| Scaling coefficient | 4.3x (1.9ms → 8.2ms from 100→1000 concurrency) |
| Concurrent streams | 10,000 (HTTP/2 h2c) |
Supported Providers
| Provider | Models | Cache Support | Multimodal |
|---|---|---|---|
| Anthropic | Claude Opus 4.6/4.5/4.1/4.0, Sonnet 4.6/4.5/4.0, Haiku 4.5 | Multi-breakpoint prompt caching (90% discount, 5m–1h TTL, 1024 min tokens) | Images, PDFs |
| AWS Bedrock | Same Claude models via Bedrock API | Same caching (90% discount) | Images, PDFs |
| GCP Vertex AI | Same Claude models via Vertex API | Same caching (90% discount) | Images, PDFs |
| OpenAI | GPT-5/5.1/5.2/5.3/5.4, GPT-5 Pro/Mini/Nano, Codex variants | Prefix caching (90% discount, 24h TTL) | Images, PDFs |
| Google Gemini | Gemini 2.5 Pro/Flash, 3.0 Pro/Flash, 3.1 Pro/Flash, Veo 3.1 (video) | Implicit ephemeral caching (75% discount, 60m TTL, 2048 min tokens) | Images, PDFs, Audio, Video |
| xAI | Grok 4, 4.1, 4.20 (beta), Grok Code Fast, Grok 4 Fast | Prefix caching (75% discount, 10m TTL) | Images |
| Fireworks | Kimi K2/K2.5, MiniMax M2.5, GLM-5, Nemotron 3 Super | Varies by model (50–80% discount) | Images, Video (K2.5 only) |
| Self-hosted | MiniMax M2.1 (GCP), Qwen3-Coder-Next (SGLang) | Prefix caching (90% discount) | Text only |
| OpenRouter | 100+ models (Claude, GPT, Gemini, Grok, DeepSeek, Llama, etc.) | Varies by underlying provider (75% discount) | Varies |
Key Design Decisions
- SQLite + Redis rather than Postgres: SQLite for embedded local state, Redis only for cross-instance rate limit coordination
- Lock-free hot path:
sync.Mapand atomic operations for request-path data structures — no mutex contention - Async batched logging: 65,536 message ring buffer, non-blocking writes
- Configuration-driven: All provider configs, model pricing, routing thresholds, and timeouts in YAML