The Problem

LLM inference is expensive and non-deterministic. If a client disconnects mid-stream:
  • The provider has already charged for the request
  • The generated output is lost
  • Re-running produces different (and costly) results

The Solution

Ensemble persists every completed response to S3:
  1. Provider continues: Even after client disconnection, the provider call runs to completion
  2. S3 storage: The complete response (blocks, tokens, cost) is stored in S3
  3. Client recovery: The client can retrieve the response later via GET /api/v1/retrieve/{request_id}

Flow

Client ──── Ensemble ──── Provider
  │              │             │
  │  request     │  forward    │
  │─────────────>│────────────>│
  │              │             │
  │  streaming   │  streaming  │
  │<─────────────│<────────────│
  │              │             │
  ╳ disconnect   │  continues  │
                 │<────────────│
                 │             │
                 │  complete   │
                 │  ──> S3     │
                 │             │
  │  retrieve    │             │
  │─────────────>│             │
  │  response    │             │
  │<─────────────│             │

Configuration

Response persistence is enabled by configuring S3:
# S3 configuration (also supports MinIO)
# Set via environment variables:
# AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
# AWS_ENDPOINT_URL (for MinIO)

Request ID Tracking

Every request gets a unique X-Request-ID (auto-generated or client-provided). This ID is:
  • Returned in response headers
  • Used as the S3 storage key
  • Required for status checks and retrieval
  • Included in logs and traces for correlation