The Problem
LLM inference is expensive and non-deterministic. If a client disconnects mid-stream:- The provider has already charged for the request
- The generated output is lost
- Re-running produces different (and costly) results
The Solution
Ensemble persists every completed response to S3:- Provider continues: Even after client disconnection, the provider call runs to completion
- S3 storage: The complete response (blocks, tokens, cost) is stored in S3
- Client recovery: The client can retrieve the response later via
GET /api/v1/retrieve/{request_id}
Flow
Configuration
Response persistence is enabled by configuring S3:Request ID Tracking
Every request gets a uniqueX-Request-ID (auto-generated or client-provided). This ID is:
- Returned in response headers
- Used as the S3 storage key
- Required for status checks and retrieval
- Included in logs and traces for correlation