Routing Decision Flow

Request arrives


┌─────────────┐
│ Model lookup │ → Which providers support this model?
└──────┬──────┘


┌──────────────────┐
│ Cache affinity   │ → Does any endpoint have cached context for this session?
└──────┬───────────┘


┌──────────────────┐
│ Rate limit check │ → Filter out endpoints at capacity
└──────┬───────────┘


┌──────────────────┐
│ Cost/load balance│ → Among remaining, pick optimal endpoint
└──────┬───────────┘


  Selected endpoint

RoutingDecision

Every request produces a RoutingDecision:
type RoutingDecision struct {
    Provider       string          // "anthropic", "openai", etc.
    Endpoint       string          // Endpoint display name
    EndpointID     string          // Internal endpoint ID
    Reason         string          // Human-readable reason
    EstimatedValue decimal.Decimal // Estimated cache value
    CacheOptimized bool            // Whether cache influenced the decision
    CostPenalty    decimal.Decimal // Cost delta vs cheapest option
}

Routing Strategies

Configurable per provider:
StrategyBehavior
session_affinityPrefer endpoint with session cache (default)
round_robinEqual distribution across endpoints
least_usedRoute to endpoint with lowest utilization
providers:
  anthropic:
    strategy: session_affinity
  openai:
    strategy: least_used

Error-Aware Routing

When a request fails on the selected endpoint:
  1. Rate limit (429): Automatically retry on next available endpoint in the capacity pool
  2. Server error (5xx): Retry on a different provider if available
  3. Permanent error (4xx): Return error immediately (no retry)
Error classification is provider-specific — Ensemble understands the difference between Anthropic’s overloaded (retryable) and invalid_api_key (permanent).

Capacity Pools

Multiple API keys for the same provider form a capacity pool:
providers:
  anthropic:
    keys:
      - name: primary
        endpoints:
          - id: anthropic-1
            rpm_limit: 1000
            tpm_limit: 100000
      - name: secondary
        endpoints:
          - id: anthropic-2
            rpm_limit: 500
            tpm_limit: 50000
The router distributes load across the pool and fails over between endpoints.