A clean minimal prompt circle versus an over-complicated scribble — the paradox of more words producing worse results — Generated with Nano Banana

Why Every AI Image Looks the Same

Prompt craft · May 2026 · 9 min read · Liminalshort.org

There is a recognisable look to most AI-generated images. Not a bad look necessarily — but a familiar one. A certain rendered-ness. A specific way light falls, a particular approach to hair, a compositional confidence that feels slightly too perfect. You've seen a thousand of them. After a while they all blur together.

The people who escape this aren't using different models or secret settings. They're writing different prompts — specifically, prompts built to constrain output rather than to describe aspiration. The gap between "looks AI-generated" and "looks like it was actually taken or made by someone with a specific vision" almost entirely lives in how the prompt was written.

What a generic prompt actually does to a model

A prompt like "atmospheric abandoned mall, cinematic lighting, beautiful, melancholic, dreamcore aesthetic, 4k" is a list of desiderata — things the creator wants to feel when looking at the output. It is not a description of a scene. It gives the model maximum freedom to interpret, and models given maximum freedom default to their training priors: the most statistically common visual treatment of each concept present in the training data.

"Cinematic" pulls toward a very specific cluster. "4k" activates the sharpened, over-resolved look of stock photography at max settings. "Dreamcore" has a near-canonical aesthetic by now — someone has trained the internet on what it looks like. The output is a blend of the most confident guesses for each of those words. Which is why it looks like every other output generated with similar words.

"I had a breakthrough moment where I realised I was writing prompts for humans to read, not for models to execute. Once I started writing like I was briefing a DP — specific light source, specific surface, specific time, specific gear — my outputs became genuinely controllable."
r/VideoEditing · 7.3k upvotes

The anatomy of a prompt that works

A constrained prompt operates like a shot list. It specifies what exists in the frame, what light is doing, what medium captured it, and what should not be there. Four components — each one doing different work:

Component	What it controls	Example
Subject + state	The anchor — what the model builds everything around	empty office chair, slightly pushed back, one armrest bent inward
Environment	Surfaces, architecture, depth — the model's spatial reference	late-90s government office, grey cubicle fabric, drop ceiling, one missing tile
Light	The single most powerful variable — determines mood, era, surface legibility	single overhead fluorescent, centre frame, no fill, hard shadow on floor
Medium + era	Texture, grain, colour science — makes the image feel made, not generated	1998 security camera still, fixed wide lens, slight barrel distortion, low saturation
Negatives	Trims probability space — forces the model away from its default additions	no people, no daylight, no windows, no plants, no clocks

Prompt length vs prompt quality: the token count curve

There is a persistent myth that longer prompts produce better results. Research on CLIP-based architectures tells a more nuanced story. Prompt length beyond approximately 40 tokens produces diminishing returns — and in some architectures, the most important concepts begin to compete with surrounding text for cross-attention weight, diluting their effect.

The correct relationship is not length to quality — it is specificity per token. The chart below shows how prompt quality (measured as first-generation adherence to intent) peaks in the 15–35 token range before plateauing and eventually declining:

Prompt quality vs token count (composite adherence score, normalised)

Composite of reported adherence scores from CLIP-based model studies (2023–2024) and community surveys. Adherence = first-generation match to stated visual intent. Peaks in the 25–35 token range for most current architectures.

What this means practically: one well-specified light source beats ten mood adjectives. One concrete surface beats a paragraph of atmosphere. The goal is not to fill the prompt — it is to maximise constraint per token.

Before and after: what the structure change actually looks like

Scene: liminal office hallway

	Prompt
Weak	liminal space office hallway scary abandoned AI masterpiece 4k cinematic
Strong	empty office hallway at 2am, green carpet, dropped ceiling tiles, one flickering fluorescent at the far end, rows of dark doorways, no people, no text, no windows, security camera angle from corner, 2003 CCTV still, slight barrel distortion

Scene: surreal outdoor / dreamcore

	Prompt
Weak	surreal dreamcore beautiful sky landscape dreamy pastel aesthetic
Strong	empty playground at dusk, primary-coloured equipment, sky oversaturated to the wrong hue for the time of day, swings stationary with no wind, shadows pointing the wrong direction, no people, no animals, no movement, handheld phone video 2011, soft vignette

The "strong" versions don't use more words to describe the feeling — they replace feeling-words with physical contradictions. "Scary" becomes empty + 2am + one light source far away. "Dreamy" becomes wrong sky colour + stationary swings + contradictory shadows. The uncanny emerges from the specific wrongness, not from labelling it. This is why it works. You cannot tell a model to render "eerie" reliably. You can tell it to render a specific arrangement of things that produces eeriness as a consequence.

"The moment I stopped writing the mood and started writing what was physically wrong with the scene, everything became more interesting. A playground where the swings don't move — that's more unsettling than any amount of 'eerie atmosphere' in a prompt."
r/midjourney · 8.9k upvotes

Convergence vs lottery: a different approach to iteration

Most people treat generation as a lottery: run enough attempts and eventually get lucky. This is slow and expensive. The alternative is treating generation as a convergence process — each attempt tells you something about how the model interprets your current prompt, and you adjust one variable at a time based on that information.

Write a structured prompt using the five-slot shot-list components
Generate 2–3 variants to get a read on what the model does with it
Identify the single element most off from what you want
Change only that element and regenerate
If it improved, keep the change — now identify the next element to adjust
Once in the right territory, use image editing to fix fine details without regenerating

The underlying principle

Every generation is a measurement. If you change one thing between runs, you know what that thing controls. If you change five things, you've learned nothing except that something changed. Convergence requires information — and information requires isolating variables.

With this approach, most creators report reaching a usable result in 6–10 generations rather than 30–50 on the same concept. The same technique extends to fine details — a specific hand position, a specific background element — because you know exactly which clause in your prompt corresponds to which visual element.

"Stop regenerating. Start diagnosing. Every bad output is data. What in your prompt caused this? Change that one thing. You'll get there in 8 generations instead of 80."
r/AIArt · 11.4k upvotes