One brief. One plan. Text, image, video and audio delivered together.
An orchestrator that turns a brief into a multimodal production plan and runs each piece on the right provider, with live progress streamed in real time.
An orchestrator, not a pipeline you assemble by hand.
The user doesn't assemble a pipeline — they pick the platform, mark the desired modalities, and the agent decides what to produce in each slot. When they hit generate, the platform returns a complete production plan before spending a single credit.
Two-step flow: Plan → Execute.
Two SSE routes: the first returns a plan for the user to approve; the second runs each asset on the right provider and emits granular progress events.
Plan
POST /api/v1/multimodal/plan (SSE)assets list describing each piece (type, prompt, suggested provider, order). The user sees the plan before executing.Execute
POST /api/v1/multimodal/execute-plan (SSE)asset_start— asset N started generatingasset_complete— done, returnsasset_idasset_failed— failed, returns reasondone— whole plan finished
Platform ↔ modality: coupling that kills invalid configurations.
In Create Content the user marks modalities as chips — text always on, image, video and audio optional. The filter only shows modalities valid for the selected platform.
| Platform | Allowed modalities |
|---|---|
| Copy | text |
| Creative Brief | text |
| text + image (no video) | |
| Video | text + image + video + audio |
| Design | text + image |
The UI also reads the enabled_providers list from the backend and disables modalities without a configured provider, with a tooltip "Configure a <modality> provider in Settings first". No more click-and-break.
The real providers already wired into the code.
| Modality | Providers |
|---|---|
| Image | DALL-E 3, DALL-E 2, GPT Image, Gemini Imagen 3, Gemini Imagen 3 Fast |
| Video | Gemini Veo |
| Audio | Audio/speech providers plugged in via enabled_providers (Whisper for voice transcription in chat) |
| Text | Claude (Code / Anthropic API), OpenAI, configurable fallback |
Every asset is a record. Every record can be regenerated without redoing the rest.
Each asset becomes a ContentAsset record with content_id, asset_type, provider, file_url/base64, prompt, position and status (pending / generating / completed / failed).
The final content reads assets ordered by position. If an asset fails or the result isn't good, the user regenerates that specific asset via a dedicated route — no need to redo the entire content.
Three things that make a difference day-to-day.
Plan → Approve → Execute
The user sees what will be produced before spending credits. Human approval lives inside the flow, not outside it.
Real-time progress via SSE
Not fire-and-forget. Every asset emits start / complete / failed / done — UI and automations see the plan live.
Platform ↔ modality coupling
Invalid combinations are killed before submit. The user can't ask for video on a Copy brief.
Want to see this running on your own pipeline?
We'll show you in a quick demo, using data you already work with.