Ensemble’s cache optimization is the primary driver of cost savings. Anthropic’s prompt caching offers 90% cost reduction on cached tokens — routing a request to an endpoint that already has your prompt prefix cached saves real money.

How It Works

CRC-Based Prefix Fingerprinting

Instead of comparing full message content (expensive at 150k-800k token contexts), Ensemble computes lightweight CRC fingerprints of message prefixes:
  1. On each request: Compute CRC of the conversation prefix (system prompt + first N messages)
  2. Session affinity: Route follow-up messages in the same session to the same endpoint
  3. Cross-session sharing: Detect when different sessions share the same system prompt prefix

Cache Value Estimation

For each routing decision, Ensemble estimates the cache value:
cache_value = cached_tokens × cache_savings_per_token × confidence
Where cache_savings_per_token varies by provider (from config):
  • Anthropic: 90% discount — cached reads at 0.30/MTokvs0.30/MTok vs 3.00/MTok input (multi-breakpoint, 5m–1h TTL, 1024 min tokens)
  • OpenAI: 90% discount — cached reads at 0.125/MTokvs0.125/MTok vs 1.25/MTok input (prefix-based, 24h TTL, 1024 min tokens)
  • Gemini: 75% discount — cached reads at 0.125/MTokvs0.125/MTok vs 1.25/MTok input (implicit ephemeral, 60m TTL, 2048 min tokens)
  • xAI: 75% discount — cached reads at 0.75/MTokvs0.75/MTok vs 3.00/MTok input (prefix-based, 10m TTL)
  • Fireworks: 50–90% discount depending on model
  • OpenRouter: 75% discount (varies by underlying provider)

Routing Thresholds

Cache ValueRouting Decision
High (>$0.25)Strong affinity — route to cached endpoint even if busier
Medium (0.050.05–0.25)Moderate affinity — prefer cached endpoint if available
Low (<$0.05)Cost/load balance — route to least-utilized endpoint
These thresholds are configurable:
cache:
  high_value_threshold: "0.25"
  low_value_threshold: "0.05"
  stickiness_factor: 0.8

Multi-Endpoint Cache Sharing

Capacity pools (multiple endpoints for the same provider) share cached content for failover. When endpoint A is rate-limited, endpoint B in the same pool often has overlapping cache from shared system prompts.

Session Affinity

The X-Session-ID header controls cache affinity:
# First request creates cache on endpoint A
curl -H "X-Session-ID: sess_123" -d '{"model":"claude-sonnet-4",...}'

# Follow-up routes to endpoint A (cache hit)
curl -H "X-Session-ID: sess_123" -d '{"model":"claude-sonnet-4",...}'
Affinity is weighted, not absolute — rate limits and errors can override the cache preference.