Grounding references in multi-image prompts

A one-line prompt rewrite that moves identity from 90% to 98% across every image provider — what we learned running 50 scenarios.

Blooper Team June 23, 2026 · 9 min read

Grounding references in multi-image prompts

Multi-image generation pipelines all run into the same problem at some point. We hit it building Blooper, where users @-mention characters, settings, and props inside a single chat prompt — the system has to weave the names back into a single string and hope the model lines them up with the right images. With one or two refs, it does. With three or more, it stops: the character’s outfit ends up on the background, left and right swap, a fourth person walks in who wasn’t anywhere in the inputs.

Is that a model-capability ceiling or a prompt-shape problem? We ran a controlled test, and it was overwhelmingly the second.

The setup

50 scenarios, 1–5 reference images each (characters, settings, props), across three image providers: NANO_BANANA (Gemini’s image model), OPENAI_IMAGE (gpt-image-2), and FLUX2_MAX. To keep the test honest, every reference got a nonsense label — Pib, Tverg, Krell, Wamo, Suda. The label gives nothing away about appearance, and the order doesn’t either, so the only way the model can tie a name to a picture is through the attachment itself.

Pib reference — **Pib***blue robot*right

Krell reference — **Krell***brown teddy bear*left

Fendle reference — **Fendle***green forest*setting

Quib reference — **Quib***yellow heart balloon*prop

One scenario’s four references — nonsense labels, mixed roles, positional constraints embedded in the prompt.

We compared four ways of writing the prompt:

method	prompt shape
V1 — raw `@`	`Draw @Tverg on the right, @Pib on the left, at @Suda.`
V2 — strip	`Draw Tverg on the right, Pib on the left, at Suda.`
V3 — role + image N	`Draw Tverg (the character in image 1) on the right, Pib (the character in image 2) on the left, at Suda (the setting in image 3).`
V4 — multipart	Interleaved `[Character — Tverg:]` caption directly before each image, then the instruction. A structural change to the API request — only providers that natively read multipart prompts can use it (today: Gemini).

A separate Gemini judge at temperature 0 scored each output on identity (the right things showing up) and positions (in the right places).

Headline numbers

metric	V1 raw	V2 strip	V3 role+image	V4 multipart
Identity	86%	90%	98%	100%
Identity + positions	66%	68%	78%	100%

Each comparison tile below carries the judge’s pass marks: P = characters, O = objects/props, ⇆ = positions. ✓ = pass, ✗ = fail. A dash — means the dimension isn’t applicable to that scenario.

V1 raw @ output (OPENAI) — V1 — raw @
OPENAI
P✓ O✓ ⇆✗prompt sentDraw `@Pib` on the right, `@Krell` on the left, at `@Fendle`, with `@Quib`.

V2 strip output (OPENAI) — V2 — strip
OPENAI
P✓ O✗ ⇆✗prompt sentDraw Pib on the right, Krell on the left, at Fendle, with Quib.

V3 role+image N output (OPENAI) — V3 — role+image N
OPENAI
P✓ O✓ ⇆✓prompt sentDraw Pib *(the character in image 1)* on the right, Krell *(the character in image 2)* on the left, at Fendle *(the setting in image 3)*, with Quib *(the object in image 4)*. Use each reference image for the thing it depicts; keep each one’s exact appearance.

V4 multipart captions output — V4 — multipart
Gemini multipart
P✓ O✓ ⇆✓multipart request[Character — Pib:] [img] *[Character — Krell:]* [img] *[Setting — Fendle:]* [img] *[Object — Quib:]* [img] ⟶ Draw Pib on the right, Krell on the left, at Fendle, with Quib. Keep each one’s exact appearance; do not add extra characters.

Same four refs across four phrasings. V1, V2, V3 run on OPENAI_IMAGE — the scenario’s assigned provider. V4 switches to Gemini because the interleaved caption structure is a multipart request, not a text rewrite, and gpt-image-2 doesn’t read multipart. V1 puts the robot on the wrong side. V2 anthropomorphizes the balloon prop into a fourth character with arms. V3 lands clean. V4 also lands clean, on a different model.

A single phrase — “(the character in image N)” — moves identity from 90% to 98% across every provider, with no model change, no fine-tune, no architecture trick.

What surprised us

Stripping the @ isn’t a free win. On NANO_BANANA, raw @Tverg scored 100% on identity; bare Tverg dropped to 82%. On FLUX2_MAX, the opposite — raw was 75%, bare jumped to 100%. Same prompt change, opposite reactions from two different providers. V3 stops the argument: every provider lands at 94–100%.

Vuno reference — **Vuno***white cat*right

Olwen reference — **Olwen***blue robot*left

Marn reference — **Marn***yellow heart balloon*prop

V1 raw @ output (NANO_BANANA) — V1 — raw @
NANO_BANANA
P✓ O✓ ⇆✓prompt sentDraw `@Vuno` on the right, `@Olwen` on the left, at `@Pib`, with `@Marn`.

V2 strip output (NANO_BANANA) — V2 — strip
NANO_BANANA
P✓ O✓ ⇆✓prompt sentDraw Vuno on the right, Olwen on the left, at Pib, with Marn.

V3 role+image N output (NANO_BANANA) — V3 — role+image N
NANO_BANANA
P✓ O✓ ⇆✗prompt sentDraw Vuno *(the character in image 1)* on the right, Olwen *(the character in image 2)* on the left, at Pib *(the setting in image 3)*, with Marn *(the object in image 4)*. Use each reference image for the thing it depicts; keep each one’s exact appearance.

NANO_BANANA (which is Gemini under the hood), scenario #45 — single re-run. V1 nails everything and even paints the nonsense labels onto the canvas. V2 also lands the layout this time; across all 50 scenarios V2 dropped identity to 82% on NANO_BANANA, but a single sample isn’t going to show that. V3 keeps the identities but flips left and right. V4 runs on the same model as V1–V3 — the only thing that changes is the prompt structure, which is enough to fix the positions.

The breakage is multi-reference. With one or two refs, every method scored 100%. The drop shows up at three refs (V2 91% → V3 100%) and four (V2 86% → V3 100%). Single-character prompts don’t need the rewrite. Ensembles do.

Identity is the easy half. V3 hits 98% on identity but only 78% on positions. “Pib on the left” still throws the model off even when it knows who Pib is. Knowing who doesn’t tell it where, and that’s a separate problem we haven’t cracked.

Drovo reference — **Drovo***white cat*center

Bompf reference — **Bompf***city street*setting

V1 raw @ output (FLUX2_MAX) — V1 — raw @
FLUX2_MAX
P✓ O— ⇆✗prompt sentDraw `@Drovo` on the center, `@Marn` on the right, `@Fendle` on the left, at `@Bompf`.

V2 strip output (FLUX2_MAX) — V2 — strip
FLUX2_MAX
P✓ O— ⇆✗prompt sentDraw Drovo on the center, Marn on the right, Fendle on the left, at Bompf.

V3 role+image N output (FLUX2_MAX) — V3 — role+image N
FLUX2_MAX
P✓ O— ⇆✓prompt sentDraw Drovo *(the character in image 1)* on the center, Marn *(the character in image 2)* on the right, Fendle *(the character in image 3)* on the left, at Bompf *(the setting in image 4)*. Use each reference image for the thing it depicts; keep each one’s exact appearance.

Same ladder on FLUX2_MAX with a different ref bundle (no object slot in this scenario — O is N/A). V1 and V2 both produce all three subjects in the city, but cat-left / dino-center is the opposite of the expected cat-center / dino-left. V3 puts the dino on the left, cat in the middle, and pig on the right. V4 — Gemini, same refs — also lands clean.

Why this works

A multi-image model sees a flat list of attachments and a flat string of text. With a nonsense label and no other structure, it has to guess which name maps to which picture. “(The character in image N)” hands it two things at once: a role, so it knows the image is a character and not a background, and an index, so it knows which attachment to look at. That’s the whole trick.

V4 goes one step further. Instead of mentioning “image N” inside a flat prompt, it interleaves each ref’s caption directly with the image, in a multipart format. That’s a structural change to the API request, not a rewording, so it only works on providers that natively read multipart prompts. Today that means Gemini — which is what NANO_BANANA is under the hood. OpenAI’s gpt-image-2 edits endpoint and BFL’s FLUX both flatten the image array before reading the prompt, so the interleaving never reaches them. On Gemini, V4 climbs from V3’s 98% / 78% (identity / positions) to a perfect 100% / 100%.

Takeaway

There is no single prompt format that’s right for every image provider. V1 wins on NANO_BANANA, V2 wins on FLUX. Anyone shipping multi-image generation across more than one backend either branches per provider or finds phrasing both can read. V3 is that phrasing — eight extra points of identity for two extra parentheticals, no architecture work involved.

The broader thing we learned: image models reward structure more than clever wording. The whole gap closed once we handed the model a role and an index — worth trying anywhere you’re stitching multiple references into a single prompt. V3 is what Blooper runs on every @-mention today; V4 layers on automatically when the provider is Gemini. Identity is mostly solved. Positions still aren’t, except where the API gives us multipart.