Multimodal Assets

One brief. One plan. Text, image, video and audio delivered together.

An orchestrator that turns a brief into a multimodal production plan and runs each piece on the right provider, with live progress streamed in real time.

What it is

An orchestrator, not a pipeline you assemble by hand.

The user doesn't assemble a pipeline — they pick the platform, mark the desired modalities, and the agent decides what to produce in each slot. When they hit generate, the platform returns a complete production plan before spending a single credit.

How it works

Two-step flow: Plan → Execute.

Two SSE routes: the first returns a plan for the user to approve; the second runs each asset on the right provider and emits granular progress events.

Plan

POST /api/v1/multimodal/plan (SSE)

The agent reads the brief, the platform and the marked modalities, then returns a production plan as JSON: the text body + an assets list describing each piece (type, prompt, suggested provider, order). The user sees the plan before executing.

Execute

POST /api/v1/multimodal/execute-plan (SSE)

The backend fires each asset at the right provider and emits granular progress events. No fire-and-forget — the user sees exactly where the plan is at every second.

SSE events

asset_start — asset N started generating
asset_complete — done, returns asset_id
asset_failed — failed, returns reason
done — whole plan finished

In the UI

Platform ↔ modality: coupling that kills invalid configurations.

In Create Content the user marks modalities as chips — text always on, image, video and audio optional. The filter only shows modalities valid for the selected platform.

Platform	Allowed modalities
Copy	text
Creative Brief	text
Email	text + image (no video)
Video	text + image + video + audio
Design	text + image

The UI also reads the enabled_providers list from the backend and disables modalities without a configured provider, with a tooltip "Configure a <modality> provider in Settings first". No more click-and-break.

Providers

The real providers already wired into the code.

Modality	Providers
Image	DALL-E 3, DALL-E 2, GPT Image, Gemini Imagen 3, Gemini Imagen 3 Fast
Video	Gemini Veo
Audio	Audio/speech providers plugged in via `enabled_providers` (Whisper for voice transcription in chat)
Text	Claude (Code / Anthropic API), OpenAI, configurable fallback

Persistence

Every asset is a record. Every record can be regenerated without redoing the rest.

Each asset becomes a ContentAsset record with content_id, asset_type, provider, file_url/base64, prompt, position and status (pending / generating / completed / failed).

The final content reads assets ordered by position. If an asset fails or the result isn't good, the user regenerates that specific asset via a dedicated route — no need to redo the entire content.

Highlights

Three things that make a difference day-to-day.

Plan → Approve → Execute

The user sees what will be produced before spending credits. Human approval lives inside the flow, not outside it.

Real-time progress via SSE

Not fire-and-forget. Every asset emits start / complete / failed / done — UI and automations see the plan live.

Platform ↔ modality coupling

Invalid combinations are killed before submit. The user can't ask for video on a Copy brief.

Want to see this running on your own pipeline?

We'll show you in a quick demo, using data you already work with.

Get a demo Read the docs