Architecture

Ensemble uses a local-first rate management design optimized for zero hot-path latency:
Request path (microseconds):
  Local atomic counter check → Allow/Deny

Background (every 1 second):
  Local counters → Redis → Global view update
No Redis queries on the request path. Local counters use atomic.Int64 for lock-free operation.

Rate Limit Tracking

Per-Endpoint Limits

Each endpoint has RPM (requests per minute) and TPM (tokens per minute) limits:
endpoints:
  - id: anthropic-primary
    rpm_limit: 1000
    tpm_limit: 100000

Local Counter Structure

Counters are packed into a single atomic.Uint64 for cache-line efficiency:
// Upper 32 bits: TPM count
// Lower 32 bits: RPM count
packed := atomic.Uint64{}
Window rollover uses CompareAndSwap — no mutex, no contention.

Global View

Background sync publishes local counters to Redis and reads global aggregates:
Redis key: {namespace}:ratelimit:{endpoint_id}:{model_id}
The global view is used for routing decisions (avoid sending traffic to endpoints that other instances have already saturated) but is never on the request path.

Mock Endpoint Detection

Endpoints with TPM limits above a configurable threshold (MockEndpointTPMThreshold) are treated as unlimited locally — useful for testing and development.

RateDecision

Every rate limit check produces a RateDecision:
type RateDecision struct {
    Allowed     bool
    CurrentTPM  int64
    LimitTPM    int64
    Utilization float64        // 0.0-1.0
    WindowStart time.Time
    RetryAfter  *time.Duration // Set when rate limited
}

Per-Key Rate Limits

API keys can have their own rate limits (in addition to endpoint limits):
{
  "rate_limit_rpm": 100,
  "rate_limit_tpm": 50000
}
Key-level limits are checked before endpoint-level limits.

Redis Namespace Isolation

Rate limit keys are namespaced by environment:
Priority 1: REDIS_NAMESPACE env var
Priority 2: ENSEMBLE_ENVIRONMENT env var (dev/staging/production)
Priority 3: Test mode detection (unique per test run)
Default: "default"
This prevents development instances from interfering with production rate limit state.