Streaming - iGent Concert

Ensemble supports two streaming protocols. Both deliver the same unified EventBlock format.

POST /api/v1/stream (SSE)

Server-Sent Events streaming. The request body is identical to /api/v1/generate.

Headers

Same as /generate, plus:

Header	Value	Description
`Accept`	`text/event-stream`	Required for SSE

Event Stream Format

event: block
data: {"type":"text","text":"Hello"}

event: block
data: {"type":"text","text":" world!"}

event: done
data: {"id":"req_abc","model":"claude-sonnet-4","blocks":[...],"input_tokens":25,"output_tokens":12,"cost":"0.000111"}

Event types:

block — Incremental content block (text delta, tool call, thinking)
done — Complete InferenceResponse with token counts and cost
error — Error event with classification

Preflight Validation

Before streaming begins, Ensemble performs preflight checks:

API key validation
Rate limit check
Parameter validation
Provider health check

If preflight fails, an error response is returned as a normal HTTP error (not a stream event), enabling clean client retry.

Stall Detection

During streaming, dual timeouts protect against stuck connections:

Inter-token timeout: Detects stalled streams (no data for N seconds)
Total timeout: Per-model maximum (10-30 minutes for reasoning models like o1, o3, GPT-5)

WebSocket /api/v1/ws

Persistent WebSocket connection for streaming. Supports multiplexed requests.

Connection

ws://ensemble:8080/api/v1/ws

Authentication

Send API key in first message:

{"type": "auth", "api_key": "ens_your_key"}

Request Message

{
  "type": "request",
  "request_id": "req_123",
  "model": "claude-sonnet-4-20250514",
  "messages": [...],
  "max_tokens": 4096,
  "session_id": "sess_abc"
}

Response Messages

{"type": "block", "request_id": "req_123", "block": {"type": "text", "text": "Hello"}}
{"type": "done", "request_id": "req_123", "response": {...}}
{"type": "error", "request_id": "req_123", "error": {"code": 429, "message": "Rate limited"}}

Streaming Coalescence

For high-frequency token streams, Ensemble supports optional coalescence to reduce network overhead:

server:
  coalescence_window: 100ms  # Buffer tokens for 100ms before sending

This batches multiple tokens into single SSE events, reducing the event rate from ~100/s to ~10/s while maintaining the same total throughput.

Ensemble

​POST /api/v1/stream (SSE)

​Headers

​Event Stream Format

​Preflight Validation

​Stall Detection

​WebSocket /api/v1/ws

​Connection

​Authentication

​Request Message

​Response Messages

​Streaming Coalescence