Multimodal Support - iGent Concert

Ensemble normalizes multimodal content across providers, handling format differences transparently.

Provider Capabilities

Provider	Images	PDFs	Audio	Video
Anthropic (Claude Opus/Sonnet/Haiku)	Yes	Yes (document blocks)	No	No
OpenAI (GPT-5 series)	Yes	Yes (Responses API)	No	No
Gemini (2.5/3.0/3.1 Pro/Flash)	Yes	Yes	Yes (mp3, wav, etc.)	Yes (mp4, 300s max)
Gemini (image models)	Image generation (1024x1024)	No	No	No
Gemini (Veo 3.1)	No	No	No	Video generation (8–60s, up to 4K)
xAI (Grok 4/4.1/4.20)	Yes	No	No	No
Fireworks (Kimi K2.5)	Yes	No	No	Yes (input only)

Multimodal content is sent as content_blocks in messages:

{
  "type": "image",
  "source": {
    "type": "base64",
    "media_type": "image/png",
    "data": "iVBORw0KGgo..."
  }
}

{
  "type": "document",
  "source": {
    "type": "base64",
    "media_type": "application/pdf",
    "data": "JVBERi0xLjQ..."
  }
}

{
  "type": "audio",
  "source": {
    "type": "base64",
    "media_type": "audio/mp3",
    "data": "..."
  }
}

Ensemble translates content blocks to each provider’s native format:

Anthropic supports prompt caching for document blocks. Ensemble automatically sets cache_control on large documents to enable caching:

{
  "type": "document",
  "source": {"type": "base64", "media_type": "application/pdf", "data": "..."},
  "cache_control": {"type": "ephemeral"}
}