Ensemble supports two streaming protocols. Both deliver the same unified EventBlock format.

POST /api/v1/stream (SSE)

Server-Sent Events streaming. The request body is identical to /api/v1/generate.

Headers

Same as /generate, plus:
HeaderValueDescription
Accepttext/event-streamRequired for SSE

Event Stream Format

event: block
data: {"type":"text","text":"Hello"}

event: block
data: {"type":"text","text":" world!"}

event: done
data: {"id":"req_abc","model":"claude-sonnet-4","blocks":[...],"input_tokens":25,"output_tokens":12,"cost":"0.000111"}
Event types:
  • block — Incremental content block (text delta, tool call, thinking)
  • done — Complete InferenceResponse with token counts and cost
  • error — Error event with classification

Preflight Validation

Before streaming begins, Ensemble performs preflight checks:
  • API key validation
  • Rate limit check
  • Parameter validation
  • Provider health check
If preflight fails, an error response is returned as a normal HTTP error (not a stream event), enabling clean client retry.

Stall Detection

During streaming, dual timeouts protect against stuck connections:
  • Inter-token timeout: Detects stalled streams (no data for N seconds)
  • Total timeout: Per-model maximum (10-30 minutes for reasoning models like o1, o3, GPT-5)

WebSocket /api/v1/ws

Persistent WebSocket connection for streaming. Supports multiplexed requests.

Connection

ws://ensemble:8080/api/v1/ws

Authentication

Send API key in first message:
{"type": "auth", "api_key": "ens_your_key"}

Request Message

{
  "type": "request",
  "request_id": "req_123",
  "model": "claude-sonnet-4-20250514",
  "messages": [...],
  "max_tokens": 4096,
  "session_id": "sess_abc"
}

Response Messages

{"type": "block", "request_id": "req_123", "block": {"type": "text", "text": "Hello"}}
{"type": "done", "request_id": "req_123", "response": {...}}
{"type": "error", "request_id": "req_123", "error": {"code": 429, "message": "Rate limited"}}

Streaming Coalescence

For high-frequency token streams, Ensemble supports optional coalescence to reduce network overhead:
server:
  coalescence_window: 100ms  # Buffer tokens for 100ms before sending
This batches multiple tokens into single SSE events, reducing the event rate from ~100/s to ~10/s while maintaining the same total throughput.