Component Overview

                    Client Request

                    ┌────▼────┐
                    │  Auth   │  API key validation (SHA256 hash lookup in sync.Map)
                    └────┬────┘

                    ┌────▼────────────┐
                    │ Request Parser  │  Unified InferenceRequest from any provider format
                    └────┬────────────┘

                    ┌────▼────────────┐
                    │ Parameter       │  YAML-driven validation, provider-specific constraints
                    │ Validator       │
                    └────┬────────────┘

                    ┌────▼────────────┐
                    │    Router       │  Cache affinity → Rate limits → Cost optimization
                    └────┬────────────┘

                    ┌────▼────────────┐
                    │ Provider Client │  HTTP/2 connection pool (100 conns/host)
                    └────┬────────────┘

                    ┌────▼────────────┐
                    │ Response        │  Stream parsing → Unified EventBlocks → Cost calc
                    │ Handler         │
                    └────┬────────────┘

                    ┌────▼──────────────────┐
                    │ Persistence / Logging │  S3 response storage, async batched logs
                    └───────────────────────┘

Package Layout

ensemble/
├── cmd/ensemble/         # Main entry point (CLI: serve, migrate, etc.)
├── internal/
│   ├── config/           # YAML config structures, hot-reload, validation cache
│   ├── server/           # HTTP server, route handlers, middleware
│   ├── proxy/            # Provider request/response proxying
│   ├── router/           # Routing engine (cache, rate, cost)
│   ├── ratelimit/        # Local-first rate management with Redis sync
│   ├── storage/          # EmbeddedStore (SQLite + Redis), API keys, customers
│   ├── providers/        # Provider adapters (Anthropic, OpenAI, Gemini, xAI, OpenRouter)
│   ├── streaming/        # SSE/WebSocket stream handling, stall detection
│   ├── batch/            # Batch processing (Anthropic batch API)
│   ├── otel/             # OpenTelemetry tracing and metrics
│   └── async_inference/  # Response persistence (S3), status tracking
├── pkg/
│   ├── types/            # Shared types (InferenceRequest, EventBlock, ToolDefinition, etc.)
│   └── logging/          # Async batched logger (65K ring buffer)
├── client-go/            # Go client library
├── client-python/        # Python client library
├── client-typescript/    # TypeScript client library
└── config/               # Example YAML configurations

Core Data Types

InferenceRequest

The unified request format that all providers accept:
type InferenceRequest struct {
    Model           string          // Target model name
    Messages        []Message       // Conversation history
    SessionID       string          // For cache affinity routing
    Tools           []ToolDefinition // Function calling definitions
    MaxTokens       int
    Temperature     *float64
    TopP            *float64
    StopSequences   []string
    Stream          bool            // SSE streaming mode
    ProviderConfig  map[string]interface{} // Per-request provider overrides
}

EventBlock

The canonical response representation — a sequence of typed blocks:
type EventBlock struct {
    Type           string  // "text", "tool_call", "thinking", "error"
    Text           string  // For text blocks
    ToolCall       *ToolCall
    ThinkingContent string
}

InferenceResponse

type InferenceResponse struct {
    ID                  string
    Model               string
    Provider            string
    Blocks              []EventBlock    // The canonical representation
    InputTokens         int64
    OutputTokens        int64
    CachedPromptTokens  int64
    CacheCreationTokens int64
    ReasoningTokens     int64
    Cost                decimal.Decimal
    ProcessingTime      time.Duration
    FinishReason        string
    PerformanceMetrics  *PerformanceMetrics
    RateLimitInfo       *RateLimitInfo
}

Storage Architecture

The EmbeddedStore uses SQLite as the primary data store with in-memory caches for hot-path lookups:
  • API Keys: sync.Map keyed by SHA256 hash — lock-free lookup on every request
  • Customers: In-memory map loaded at startup
  • Endpoints: In-memory map with encrypted provider API keys
  • Model Support: Configuration-driven model → providers mapping
  • Pricing: Per-model input/output/cache pricing from YAML
Redis (optional) provides:
  • Cross-instance API key synchronization via pub/sub
  • Rate limit coordination across multiple Ensemble instances
  • Namespaced keys for environment isolation (dev/staging/production)

Routing Algorithm

The router evaluates endpoints in priority order:
  1. Cache affinity — Route to endpoint with highest estimated cache value for this session
  2. Rate limit check — Skip endpoints at capacity (local counters + cached global view)
  3. Cost optimization — For low-cache-value requests, prefer least-utilized endpoint
  4. Failover — On rate limit or error, try next capacity pool automatically
  5. 429 only when exhausted — Returns 429 only when ALL pools are overwhelmed