How It Works
CRC-Based Prefix Fingerprinting
Instead of comparing full message content (expensive at 150k-800k token contexts), Ensemble computes lightweight CRC fingerprints of message prefixes:- On each request: Compute CRC of the conversation prefix (system prompt + first N messages)
- Session affinity: Route follow-up messages in the same session to the same endpoint
- Cross-session sharing: Detect when different sessions share the same system prompt prefix
Cache Value Estimation
For each routing decision, Ensemble estimates the cache value:cache_savings_per_token varies by provider (from config):
- Anthropic: 90% discount — cached reads at 3.00/MTok input (multi-breakpoint, 5m–1h TTL, 1024 min tokens)
- OpenAI: 90% discount — cached reads at 1.25/MTok input (prefix-based, 24h TTL, 1024 min tokens)
- Gemini: 75% discount — cached reads at 1.25/MTok input (implicit ephemeral, 60m TTL, 2048 min tokens)
- xAI: 75% discount — cached reads at 3.00/MTok input (prefix-based, 10m TTL)
- Fireworks: 50–90% discount depending on model
- OpenRouter: 75% discount (varies by underlying provider)
Routing Thresholds
| Cache Value | Routing Decision |
|---|---|
| High (>$0.25) | Strong affinity — route to cached endpoint even if busier |
| Medium (0.25) | Moderate affinity — prefer cached endpoint if available |
| Low (<$0.05) | Cost/load balance — route to least-utilized endpoint |
Multi-Endpoint Cache Sharing
Capacity pools (multiple endpoints for the same provider) share cached content for failover. When endpoint A is rate-limited, endpoint B in the same pool often has overlapping cache from shared system prompts.Session Affinity
TheX-Session-ID header controls cache affinity: