Cache Optimization - iGent Concert

Ensemble’s cache optimization is the primary driver of cost savings. Anthropic’s prompt caching offers 90% cost reduction on cached tokens — routing a request to an endpoint that already has your prompt prefix cached saves real money.

How It Works

CRC-Based Prefix Fingerprinting

Instead of comparing full message content (expensive at 150k-800k token contexts), Ensemble computes lightweight CRC fingerprints of message prefixes:

On each request: Compute CRC of the conversation prefix (system prompt + first N messages)
Session affinity: Route follow-up messages in the same session to the same endpoint
Cross-session sharing: Detect when different sessions share the same system prompt prefix

Cache Value Estimation

For each routing decision, Ensemble estimates the cache value:

cache_value = cached_tokens × cache_savings_per_token × confidence

Where cache_savings_per_token varies by provider (from config):

Anthropic: 90% discount — cached reads at $0.30/MTok vs$ 3.00/MTok input (multi-breakpoint, 5m–1h TTL, 1024 min tokens)
OpenAI: 90% discount — cached reads at $0.125/MTok vs$ 1.25/MTok input (prefix-based, 24h TTL, 1024 min tokens)
Gemini: 75% discount — cached reads at $0.125/MTok vs$ 1.25/MTok input (implicit ephemeral, 60m TTL, 2048 min tokens)
xAI: 75% discount — cached reads at $0.75/MTok vs$ 3.00/MTok input (prefix-based, 10m TTL)
Fireworks: 50–90% discount depending on model
OpenRouter: 75% discount (varies by underlying provider)

Routing Thresholds

Cache Value	Routing Decision
High (>$0.25)	Strong affinity — route to cached endpoint even if busier
Medium ( $0.05–$ 0.25)	Moderate affinity — prefer cached endpoint if available
Low (<$0.05)	Cost/load balance — route to least-utilized endpoint

These thresholds are configurable:

cache:
  high_value_threshold: "0.25"
  low_value_threshold: "0.05"
  stickiness_factor: 0.8

Capacity pools (multiple endpoints for the same provider) share cached content for failover. When endpoint A is rate-limited, endpoint B in the same pool often has overlapping cache from shared system prompts.

Session Affinity

The X-Session-ID header controls cache affinity:

# First request creates cache on endpoint A
curl -H "X-Session-ID: sess_123" -d '{"model":"claude-sonnet-4",...}'

# Follow-up routes to endpoint A (cache hit)
curl -H "X-Session-ID: sess_123" -d '{"model":"claude-sonnet-4",...}'

Affinity is weighted, not absolute — rate limits and errors can override the cache preference.

Ensemble

​How It Works

​CRC-Based Prefix Fingerprinting

​Cache Value Estimation

​Routing Thresholds

​Multi-Endpoint Cache Sharing

​Session Affinity

How It Works

CRC-Based Prefix Fingerprinting

Cache Value Estimation

Routing Thresholds

Multi-Endpoint Cache Sharing

Session Affinity