Grounding references in multi-image prompts
A one-line prompt rewrite that moves identity from 90% to 98% across every image provider — what we learned running 50 scenarios.
Multi-image generation pipelines all run into the same problem at some point. We hit it building Blooper, where users @-mention characters, settings, and props inside a single chat prompt — the system has to weave the names back into a single string and hope the model lines them up with the right images. With one or two refs, it does. With three or more, it stops: the character’s outfit ends up on the background, left and right swap, a fourth person walks in who wasn’t anywhere in the inputs.
Is that a model-capability ceiling or a prompt-shape problem? We ran a controlled test, and it was overwhelmingly the second.
The setup
50 scenarios, 1–5 reference images each (characters, settings, props), across three image providers: NANO_BANANA (Gemini’s image model), OPENAI_IMAGE (gpt-image-2), and FLUX2_MAX. To keep the test honest, every reference got a nonsense label — Pib, Tverg, Krell, Wamo, Suda. The label gives nothing away about appearance, and the order doesn’t either, so the only way the model can tie a name to a picture is through the attachment itself.




We compared four ways of writing the prompt:
| method | prompt shape |
|---|---|
V1 — raw @ | Draw @Tverg on the right, @Pib on the left, at @Suda. |
| V2 — strip | Draw Tverg on the right, Pib on the left, at Suda. |
| V3 — role + image N | Draw Tverg (the character in image 1) on the right, Pib (the character in image 2) on the left, at Suda (the setting in image 3). |
| V4 — multipart | Interleaved [Character — Tverg:] caption directly before each image, then the instruction. A structural change to the API request — only providers that natively read multipart prompts can use it (today: Gemini). |
A separate Gemini judge at temperature 0 scored each output on identity (the right things showing up) and positions (in the right places).
Headline numbers
| metric | V1 raw | V2 strip | V3 role+image | V4 multipart |
|---|---|---|---|---|
| Identity | 86% | 90% | 98% | 100% |
| Identity + positions | 66% | 68% | 78% | 100% |
Each comparison tile below carries the judge’s pass marks: P = characters, O = objects/props, ⇆ = positions. ✓ = pass, ✗ = fail. A dash — means the dimension isn’t applicable to that scenario.

OPENAI
P✓ O✓ ⇆✗prompt sentDraw
@Pib on the right, @Krell on the left, at @Fendle, with @Quib.
OPENAI
P✓ O✗ ⇆✗prompt sentDraw Pib on the right, Krell on the left, at Fendle, with Quib.

OPENAI
P✓ O✓ ⇆✓prompt sentDraw Pib (the character in image 1) on the right, Krell (the character in image 2) on the left, at Fendle (the setting in image 3), with Quib (the object in image 4). Use each reference image for the thing it depicts; keep each one’s exact appearance.

Gemini multipart
P✓ O✓ ⇆✓multipart request[Character — Pib:] [img] [Character — Krell:] [img] [Setting — Fendle:] [img] [Object — Quib:] [img] ⟶ Draw Pib on the right, Krell on the left, at Fendle, with Quib. Keep each one’s exact appearance; do not add extra characters.
What surprised us
Stripping the @ isn’t a free win. On NANO_BANANA, raw @Tverg scored 100% on identity; bare Tverg dropped to 82%. On FLUX2_MAX, the opposite — raw was 75%, bare jumped to 100%. Same prompt change, opposite reactions from two different providers. V3 stops the argument: every provider lands at 94–100%.





NANO_BANANA
P✓ O✓ ⇆✓prompt sentDraw
@Vuno on the right, @Olwen on the left, at @Pib, with @Marn.
NANO_BANANA
P✓ O✓ ⇆✓prompt sentDraw Vuno on the right, Olwen on the left, at Pib, with Marn.

NANO_BANANA
P✓ O✓ ⇆✗prompt sentDraw Vuno (the character in image 1) on the right, Olwen (the character in image 2) on the left, at Pib (the setting in image 3), with Marn (the object in image 4). Use each reference image for the thing it depicts; keep each one’s exact appearance.

Gemini multipart
P✓ O✓ ⇆✓multipart request[Character — Vuno:] [img] [Character — Olwen:] [img] [Setting — Pib:] [img] [Object — Marn:] [img] ⟶ Draw Vuno on the right, Olwen on the left, at Pib, with Marn. Keep each one’s exact appearance; do not add extra characters.
The breakage is multi-reference. With one or two refs, every method scored 100%. The drop shows up at three refs (V2 91% → V3 100%) and four (V2 86% → V3 100%). Single-character prompts don’t need the rewrite. Ensembles do.
Identity is the easy half. V3 hits 98% on identity but only 78% on positions. “Pib on the left” still throws the model off even when it knows who Pib is. Knowing who doesn’t tell it where, and that’s a separate problem we haven’t cracked.





FLUX2_MAX
P✓ O— ⇆✗prompt sentDraw
@Drovo on the center, @Marn on the right, @Fendle on the left, at @Bompf.
FLUX2_MAX
P✓ O— ⇆✗prompt sentDraw Drovo on the center, Marn on the right, Fendle on the left, at Bompf.

FLUX2_MAX
P✓ O— ⇆✓prompt sentDraw Drovo (the character in image 1) on the center, Marn (the character in image 2) on the right, Fendle (the character in image 3) on the left, at Bompf (the setting in image 4). Use each reference image for the thing it depicts; keep each one’s exact appearance.

Gemini multipart
P✓ O— ⇆✓multipart request[Character — Drovo:] [img] [Character — Marn:] [img] [Character — Fendle:] [img] [Setting — Bompf:] [img] ⟶ Draw Drovo on the center, Marn on the right, Fendle on the left, at Bompf. Keep each one’s exact appearance; do not add extra characters.
Why this works
A multi-image model sees a flat list of attachments and a flat string of text. With a nonsense label and no other structure, it has to guess which name maps to which picture. “(The character in image N)” hands it two things at once: a role, so it knows the image is a character and not a background, and an index, so it knows which attachment to look at. That’s the whole trick.
V4 goes one step further. Instead of mentioning “image N” inside a flat prompt, it interleaves each ref’s caption directly with the image, in a multipart format. That’s a structural change to the API request, not a rewording, so it only works on providers that natively read multipart prompts. Today that means Gemini — which is what NANO_BANANA is under the hood. OpenAI’s gpt-image-2 edits endpoint and BFL’s FLUX both flatten the image array before reading the prompt, so the interleaving never reaches them. On Gemini, V4 climbs from V3’s 98% / 78% (identity / positions) to a perfect 100% / 100%.
Takeaway
There is no single prompt format that’s right for every image provider. V1 wins on NANO_BANANA, V2 wins on FLUX. Anyone shipping multi-image generation across more than one backend either branches per provider or finds phrasing both can read. V3 is that phrasing — eight extra points of identity for two extra parentheticals, no architecture work involved.
The broader thing we learned: image models reward structure more than clever wording. The whole gap closed once we handed the model a role and an index — worth trying anywhere you’re stitching multiple references into a single prompt. V3 is what Blooper runs on every @-mention today; V4 layers on automatically when the provider is Gemini. Identity is mostly solved. Positions still aren’t, except where the API gives us multipart.