All Stories Research Notes News
← Back to blog
Research

Grounding references in multi-image prompts

A one-line prompt rewrite that moves identity from 90% to 98% across every image provider — what we learned running 50 scenarios.

Blooper Team June 23, 2026 · 9 min read
Grounding references in multi-image prompts

Multi-image generation pipelines all run into the same problem at some point. We hit it building Blooper, where users @-mention characters, settings, and props inside a single chat prompt — the system has to weave the names back into a single string and hope the model lines them up with the right images. With one or two refs, it does. With three or more, it stops: the character’s outfit ends up on the background, left and right swap, a fourth person walks in who wasn’t anywhere in the inputs.

Is that a model-capability ceiling or a prompt-shape problem? We ran a controlled test, and it was overwhelmingly the second.

The setup

50 scenarios, 1–5 reference images each (characters, settings, props), across three image providers: NANO_BANANA (Gemini’s image model), OPENAI_IMAGE (gpt-image-2), and FLUX2_MAX. To keep the test honest, every reference got a nonsense label — Pib, Tverg, Krell, Wamo, Suda. The label gives nothing away about appearance, and the order doesn’t either, so the only way the model can tie a name to a picture is through the attachment itself.

Pib reference
Pibblue robotright
Krell reference
Krellbrown teddy bearleft
Fendle reference
Fendlegreen forestsetting
Quib reference
Quibyellow heart balloonprop
One scenario’s four references — nonsense labels, mixed roles, positional constraints embedded in the prompt.

We compared four ways of writing the prompt:

methodprompt shape
V1 — raw @Draw @Tverg on the right, @Pib on the left, at @Suda.
V2 — stripDraw Tverg on the right, Pib on the left, at Suda.
V3 — role + image NDraw Tverg (the character in image 1) on the right, Pib (the character in image 2) on the left, at Suda (the setting in image 3).
V4 — multipartInterleaved [Character — Tverg:] caption directly before each image, then the instruction. A structural change to the API request — only providers that natively read multipart prompts can use it (today: Gemini).

A separate Gemini judge at temperature 0 scored each output on identity (the right things showing up) and positions (in the right places).

Headline numbers

metricV1 rawV2 stripV3 role+imageV4 multipart
Identity86%90%98%100%
Identity + positions66%68%78%100%

Each comparison tile below carries the judge’s pass marks: P = characters, O = objects/props, = positions. = pass, = fail. A dash means the dimension isn’t applicable to that scenario.

V1 raw @ output (OPENAI)
V1 — raw @
OPENAI
P✓ O✓ ⇆✗prompt sentDraw @Pib on the right, @Krell on the left, at @Fendle, with @Quib.
V2 strip output (OPENAI)
V2 — strip
OPENAI
P✓ O✗ ⇆✗prompt sentDraw Pib on the right, Krell on the left, at Fendle, with Quib.
V3 role+image N output (OPENAI)
V3 — role+image N
OPENAI
P✓ O✓ ⇆✓prompt sentDraw Pib (the character in image 1) on the right, Krell (the character in image 2) on the left, at Fendle (the setting in image 3), with Quib (the object in image 4). Use each reference image for the thing it depicts; keep each one’s exact appearance.
V4 multipart captions output
V4 — multipart
Gemini multipart
P✓ O✓ ⇆✓multipart request[Character — Pib:] [img] [Character — Krell:] [img] [Setting — Fendle:] [img] [Object — Quib:] [img] ⟶ Draw Pib on the right, Krell on the left, at Fendle, with Quib. Keep each one’s exact appearance; do not add extra characters.
Same four refs across four phrasings. V1, V2, V3 run on OPENAI_IMAGE — the scenario’s assigned provider. V4 switches to Gemini because the interleaved caption structure is a multipart request, not a text rewrite, and gpt-image-2 doesn’t read multipart. V1 puts the robot on the wrong side. V2 anthropomorphizes the balloon prop into a fourth character with arms. V3 lands clean. V4 also lands clean, on a different model.
A single phrase — “(the character in image N)” — moves identity from 90% to 98% across every provider, with no model change, no fine-tune, no architecture trick.

What surprised us

Stripping the @ isn’t a free win. On NANO_BANANA, raw @Tverg scored 100% on identity; bare Tverg dropped to 82%. On FLUX2_MAX, the opposite — raw was 75%, bare jumped to 100%. Same prompt change, opposite reactions from two different providers. V3 stops the argument: every provider lands at 94–100%.

Vuno reference
Vunowhite catright
Olwen reference
Olwenblue robotleft
Pib reference
Pibsnowy mountainsetting
Marn reference
Marnyellow heart balloonprop
V1 raw @ output (NANO_BANANA)
V1 — raw @
NANO_BANANA
P✓ O✓ ⇆✓prompt sentDraw @Vuno on the right, @Olwen on the left, at @Pib, with @Marn.
V2 strip output (NANO_BANANA)
V2 — strip
NANO_BANANA
P✓ O✓ ⇆✓prompt sentDraw Vuno on the right, Olwen on the left, at Pib, with Marn.
V3 role+image N output (NANO_BANANA)
V3 — role+image N
NANO_BANANA
P✓ O✓ ⇆✗prompt sentDraw Vuno (the character in image 1) on the right, Olwen (the character in image 2) on the left, at Pib (the setting in image 3), with Marn (the object in image 4). Use each reference image for the thing it depicts; keep each one’s exact appearance.
V4 multipart captions output
V4 — multipart
Gemini multipart
P✓ O✓ ⇆✓multipart request[Character — Vuno:] [img] [Character — Olwen:] [img] [Setting — Pib:] [img] [Object — Marn:] [img] ⟶ Draw Vuno on the right, Olwen on the left, at Pib, with Marn. Keep each one’s exact appearance; do not add extra characters.
NANO_BANANA (which is Gemini under the hood), scenario #45 — single re-run. V1 nails everything and even paints the nonsense labels onto the canvas. V2 also lands the layout this time; across all 50 scenarios V2 dropped identity to 82% on NANO_BANANA, but a single sample isn’t going to show that. V3 keeps the identities but flips left and right. V4 runs on the same model as V1–V3 — the only thing that changes is the prompt structure, which is enough to fix the positions.

The breakage is multi-reference. With one or two refs, every method scored 100%. The drop shows up at three refs (V2 91% → V3 100%) and four (V2 86% → V3 100%). Single-character prompts don’t need the rewrite. Ensembles do.

Identity is the easy half. V3 hits 98% on identity but only 78% on positions. “Pib on the left” still throws the model off even when it knows who Pib is. Knowing who doesn’t tell it where, and that’s a separate problem we haven’t cracked.

Drovo reference
Drovowhite catcenter
Marn reference
Marnpink pigright
Fendle reference
Fendlegreen dinosaurleft
Bompf reference
Bompfcity streetsetting
V1 raw @ output (FLUX2_MAX)
V1 — raw @
FLUX2_MAX
P✓ O— ⇆✗prompt sentDraw @Drovo on the center, @Marn on the right, @Fendle on the left, at @Bompf.
V2 strip output (FLUX2_MAX)
V2 — strip
FLUX2_MAX
P✓ O— ⇆✗prompt sentDraw Drovo on the center, Marn on the right, Fendle on the left, at Bompf.
V3 role+image N output (FLUX2_MAX)
V3 — role+image N
FLUX2_MAX
P✓ O— ⇆✓prompt sentDraw Drovo (the character in image 1) on the center, Marn (the character in image 2) on the right, Fendle (the character in image 3) on the left, at Bompf (the setting in image 4). Use each reference image for the thing it depicts; keep each one’s exact appearance.
V4 multipart captions output
V4 — multipart
Gemini multipart
P✓ O— ⇆✓multipart request[Character — Drovo:] [img] [Character — Marn:] [img] [Character — Fendle:] [img] [Setting — Bompf:] [img] ⟶ Draw Drovo on the center, Marn on the right, Fendle on the left, at Bompf. Keep each one’s exact appearance; do not add extra characters.
Same ladder on FLUX2_MAX with a different ref bundle (no object slot in this scenario — O is N/A). V1 and V2 both produce all three subjects in the city, but cat-left / dino-center is the opposite of the expected cat-center / dino-left. V3 puts the dino on the left, cat in the middle, and pig on the right. V4 — Gemini, same refs — also lands clean.

Why this works

A multi-image model sees a flat list of attachments and a flat string of text. With a nonsense label and no other structure, it has to guess which name maps to which picture. “(The character in image N)” hands it two things at once: a role, so it knows the image is a character and not a background, and an index, so it knows which attachment to look at. That’s the whole trick.

V4 goes one step further. Instead of mentioning “image N” inside a flat prompt, it interleaves each ref’s caption directly with the image, in a multipart format. That’s a structural change to the API request, not a rewording, so it only works on providers that natively read multipart prompts. Today that means Gemini — which is what NANO_BANANA is under the hood. OpenAI’s gpt-image-2 edits endpoint and BFL’s FLUX both flatten the image array before reading the prompt, so the interleaving never reaches them. On Gemini, V4 climbs from V3’s 98% / 78% (identity / positions) to a perfect 100% / 100%.

Takeaway

There is no single prompt format that’s right for every image provider. V1 wins on NANO_BANANA, V2 wins on FLUX. Anyone shipping multi-image generation across more than one backend either branches per provider or finds phrasing both can read. V3 is that phrasing — eight extra points of identity for two extra parentheticals, no architecture work involved.

The broader thing we learned: image models reward structure more than clever wording. The whole gap closed once we handed the model a role and an index — worth trying anywhere you’re stitching multiple references into a single prompt. V3 is what Blooper runs on every @-mention today; V4 layers on automatically when the provider is Gemini. Identity is mostly solved. Positions still aren’t, except where the API gives us multipart.