Ensemble normalizes multimodal content across providers, handling format differences transparently.

Provider Capabilities

ProviderImagesPDFsAudioVideo
Anthropic (Claude Opus/Sonnet/Haiku)YesYes (document blocks)NoNo
OpenAI (GPT-5 series)YesYes (Responses API)NoNo
Gemini (2.5/3.0/3.1 Pro/Flash)YesYesYes (mp3, wav, etc.)Yes (mp4, 300s max)
Gemini (image models)Image generation (1024x1024)NoNoNo
Gemini (Veo 3.1)NoNoNoVideo generation (8–60s, up to 4K)
xAI (Grok 4/4.1/4.20)YesNoNoNo
Fireworks (Kimi K2.5)YesNoNoYes (input only)

Content Blocks

Multimodal content is sent as content_blocks in messages:

Image

{
  "type": "image",
  "source": {
    "type": "base64",
    "media_type": "image/png",
    "data": "iVBORw0KGgo..."
  }
}

Document (PDF)

{
  "type": "document",
  "source": {
    "type": "base64",
    "media_type": "application/pdf",
    "data": "JVBERi0xLjQ..."
  }
}

Audio (Gemini only)

{
  "type": "audio",
  "source": {
    "type": "base64",
    "media_type": "audio/mp3",
    "data": "..."
  }
}

Provider-Specific Handling

Ensemble translates content blocks to each provider’s native format:
  • Anthropic: Uses document blocks with cache control for PDFs
  • OpenAI: Converts to Responses API format for PDFs
  • Gemini: Native multimodal parts with media resolution options
  • Grok: Image-only via standard image_url format

Prompt Caching with Multimodal

Anthropic supports prompt caching for document blocks. Ensemble automatically sets cache_control on large documents to enable caching:
{
  "type": "document",
  "source": {"type": "base64", "media_type": "application/pdf", "data": "..."},
  "cache_control": {"type": "ephemeral"}
}