Two contrasting AI outputs — over-generic cinematic shot versus tightly controlled minimal composition — Generated with Nano Banana

Quit Guessing — Make AI Give You What You Want

Prompt structure · May 2026 · 10 min read · Liminalshort.org

The most common AI generation frustration isn't that the output is bad — it's that it's almost right. Close composition, wrong light direction. Right mood, wrong texture. You regenerate. Still off. Fifteen attempts later you either settle or give up. This cycle is not about the model being broken. It's about the interaction model being misunderstood.

This piece is about why the almost-right problem happens and a structural approach to ending it. Not a list of tricks — a way of thinking about what the model is actually doing with your words.

What the model is actually doing when it reads your prompt

Modern text-to-image and video models use cross-attention between text token embeddings and image feature maps. This means your words don't work as switches or instructions — they work as probability weights that bias the diffusion process toward certain image regions. Words with strong, consistent training associations push the generation hard; vague words leave enormous space for default behaviour.

When you write "beautiful woman in a kitchen," the model does not construct that scene from first principles. It searches its learned probability space for the cluster of images that matches kitchen + woman + that emotional register. "Beautiful" narrows the cluster toward high-production-value aesthetics — the most common visual treatment of that word in training data. You get the median. You get stock.

Specificity works by shrinking the searchable cluster. The more unusual your combination of physical constraints, the fewer training images satisfy all of them simultaneously — and the model has to synthesise, rather than recall.

"I thought I needed a better model. Same model, same parameters — I just described the exact light source, the exact surface, the exact time of day. My hit rate went from roughly 1 in 20 to about 1 in 4."
r/StableDiffusion · 6.1k upvotes

Average generations to first usable result: by approach

The difference in generation efficiency between approaches is not marginal. A structured prompt written around concrete physical states reaches a usable result in significantly fewer attempts than an iterative prompt-patching approach:

Avg. generations needed to reach first usable result

Based on self-reported generation counts from r/StableDiffusion and r/PromptEngineering surveys. "Usable" = result requiring no further structural prompt changes.

Actions over descriptions: the highest-leverage change

The single most impactful prompt change you can make is switching from describing how things look to describing what is physically happening. This is the difference between an adjective and a verifiable fact — and it's the difference between the model interpreting your intent and the model executing your constraint.

Description (weak)	Physical state (strong)	Why it works
sad expression	eyes downcast, mouth slightly open, weight shifted left	Three verifiable physical states — one emotional interpretation replaced by constraints
dynamic pose	right arm raised past shoulder, torso rotated 30° left, feet planted	Specific geometry the model satisfies — "dynamic" is a judgement, not a constraint
messy room	three coffee mugs on floor, open laptop on desk, clothes on chair	Specific objects in specific states — not a general condition the model fills its own way
warm lighting	single practical lamp bottom-right, tungsten 2700K, hard shadow thrown left	Source, colour temperature, shadow direction — all verifiable facts

The hand pose problem: total constraint in practice

Nothing reveals a weak prompt like hands. Models have extreme variance on hands because training data does too — every conceivable position, style, angle, and number-of-fingers exists somewhere. If your prompt gives the model any ambiguity on hands, it will fill it with a default that usually looks wrong.

The approach that works is specifying the final state of every visible variable — as if briefing someone who has never seen a hand before:

Instead of: "fix the left hand" — use: "left hand open palm facing camera, all five fingers extended and together, thumb at 45°. Right hand holding cup handle from below, four fingers wrapped, thumb on top. Exactly two hands in frame. No additional hands. No blurred regions where hands should be."
Community-shared approach — common recommendation on r/StableDiffusion hand threads

The reason total constraint works: you have specified which hand, its orientation, its finger state, the thumb position, what the other hand is doing, the count, and explicit negatives for the two most common failure modes. There is no gap for the model to interpret. Its job is now constraint satisfaction, not expression.

A prompt structure that removes defaults

This five-slot framework works across models and scales to any subject. Fill each slot before generating:

Slot	What goes here	Example
Subject + state	Who/what, in what physical configuration	empty chair, slight forward tilt, cushion compressed left
Environment	Specific space with surface details	third-floor apartment hall, linoleum, painted brick, scuffed baseboards
Light source	One or two named sources, direction, quality	single overhead fluorescent, centred, no fill, hard downward shadow
Medium + era	Camera or film format, approximate year	2008 security camera still, fixed lens, barrel distortion
Negatives	What the model should not add	no people, no decorations, no sunlight, no plants

Important nuance

Prompt length does not equal prompt quality. A 60-word prompt full of mood adjectives will underperform a 25-word prompt where every word is a physical constraint. Front-load your most specific detail — models weight earlier tokens more heavily in cross-attention.

When the output is still wrong: the single-axis edit

If your first result is close but off in one specific way, resist the urge to rewrite. Change one element and one element only. Swap the light source. Add one negative. Change the era. Then compare the before and after. This isolates causality — you'll know which variable controls what, and you'll stop wasting generations on unknowns.

"Treat prompt editing like A/B testing. One variable changes per run. If you change three things and the image improves, you don't know what fixed it. Change one — and you'll get consistent results in 6 generations instead of 60."
r/PromptEngineering · 3.4k upvotes

When the model matters more than the prompt

A well-structured prompt will outperform a weak one across virtually all current models. That said, there are specific capability gaps where model selection genuinely matters:

What you need	Where models diverge most
Realistic human motion in video	Kling 2.0 and Hailuo handle this better than most Western models
Specific architecture without people	Flux and Seedream tend toward cleaner geometry on unusual subjects
Genuine film grain and analog texture	Any model can produce this — but some need explicit "analog scan artifact" language to activate it
Prompt adherence for unusual scenes	GPT-4o image generation has the highest adherence; open-weight models need more explicit constraint language