Why Every AI Image Looks the Same
There is a recognisable look to most AI-generated images. Not a bad look necessarily — but a familiar one. A certain rendered-ness. A specific way light falls, a particular approach to hair, a compositional confidence that feels slightly too perfect. You've seen a thousand of them. After a while they all blur together.
The people who escape this aren't using different models or secret settings. They're writing different prompts — specifically, prompts built to constrain output rather than to describe aspiration. The gap between "looks AI-generated" and "looks like it was actually taken or made by someone with a specific vision" almost entirely lives in how the prompt was written.
What a generic prompt actually does to a model
A prompt like "atmospheric abandoned mall, cinematic lighting, beautiful, melancholic, dreamcore aesthetic, 4k" is a list of desiderata — things the creator wants to feel when looking at the output. It is not a description of a scene. It gives the model maximum freedom to interpret, and models given maximum freedom default to their training priors: the most statistically common visual treatment of each concept present in the training data.
"Cinematic" pulls toward a very specific cluster. "4k" activates the sharpened, over-resolved look of stock photography at max settings. "Dreamcore" has a near-canonical aesthetic by now — someone has trained the internet on what it looks like. The output is a blend of the most confident guesses for each of those words. Which is why it looks like every other output generated with similar words.
"I had a breakthrough moment where I realised I was writing prompts for humans to read, not for models to execute. Once I started writing like I was briefing a DP — specific light source, specific surface, specific time, specific gear — my outputs became genuinely controllable."
r/VideoEditing · 7.3k upvotes
The anatomy of a prompt that works
A constrained prompt operates like a shot list. It specifies what exists in the frame, what light is doing, what medium captured it, and what should not be there. Four components — each one doing different work:
| Component | What it controls | Example |
|---|---|---|
| Subject + state | The anchor — what the model builds everything around | empty office chair, slightly pushed back, one armrest bent inward |
| Environment | Surfaces, architecture, depth — the model's spatial reference | late-90s government office, grey cubicle fabric, drop ceiling, one missing tile |
| Light | The single most powerful variable — determines mood, era, surface legibility | single overhead fluorescent, centre frame, no fill, hard shadow on floor |
| Medium + era | Texture, grain, colour science — makes the image feel made, not generated | 1998 security camera still, fixed wide lens, slight barrel distortion, low saturation |
| Negatives | Trims probability space — forces the model away from its default additions | no people, no daylight, no windows, no plants, no clocks |
Prompt length vs prompt quality: the token count curve
There is a persistent myth that longer prompts produce better results. Research on CLIP-based architectures tells a more nuanced story. Prompt length beyond approximately 40 tokens produces diminishing returns — and in some architectures, the most important concepts begin to compete with surrounding text for cross-attention weight, diluting their effect.
The correct relationship is not length to quality — it is specificity per token. The chart below shows how prompt quality (measured as first-generation adherence to intent) peaks in the 15–35 token range before plateauing and eventually declining:
Composite of reported adherence scores from CLIP-based model studies (2023–2024) and community surveys. Adherence = first-generation match to stated visual intent. Peaks in the 25–35 token range for most current architectures.
What this means practically: one well-specified light source beats ten mood adjectives. One concrete surface beats a paragraph of atmosphere. The goal is not to fill the prompt — it is to maximise constraint per token.
Before and after: what the structure change actually looks like
Scene: liminal office hallway
| Prompt | |
|---|---|
| Weak | liminal space office hallway scary abandoned AI masterpiece 4k cinematic |
| Strong | empty office hallway at 2am, green carpet, dropped ceiling tiles, one flickering fluorescent at the far end, rows of dark doorways, no people, no text, no windows, security camera angle from corner, 2003 CCTV still, slight barrel distortion |
Scene: surreal outdoor / dreamcore
| Prompt | |
|---|---|
| Weak | surreal dreamcore beautiful sky landscape dreamy pastel aesthetic |
| Strong | empty playground at dusk, primary-coloured equipment, sky oversaturated to the wrong hue for the time of day, swings stationary with no wind, shadows pointing the wrong direction, no people, no animals, no movement, handheld phone video 2011, soft vignette |
The "strong" versions don't use more words to describe the feeling — they replace feeling-words with physical contradictions. "Scary" becomes empty + 2am + one light source far away. "Dreamy" becomes wrong sky colour + stationary swings + contradictory shadows. The uncanny emerges from the specific wrongness, not from labelling it. This is why it works. You cannot tell a model to render "eerie" reliably. You can tell it to render a specific arrangement of things that produces eeriness as a consequence.
"The moment I stopped writing the mood and started writing what was physically wrong with the scene, everything became more interesting. A playground where the swings don't move — that's more unsettling than any amount of 'eerie atmosphere' in a prompt."
r/midjourney · 8.9k upvotes
Convergence vs lottery: a different approach to iteration
Most people treat generation as a lottery: run enough attempts and eventually get lucky. This is slow and expensive. The alternative is treating generation as a convergence process — each attempt tells you something about how the model interprets your current prompt, and you adjust one variable at a time based on that information.
- Write a structured prompt using the five-slot shot-list components
- Generate 2–3 variants to get a read on what the model does with it
- Identify the single element most off from what you want
- Change only that element and regenerate
- If it improved, keep the change — now identify the next element to adjust
- Once in the right territory, use image editing to fix fine details without regenerating
Every generation is a measurement. If you change one thing between runs, you know what that thing controls. If you change five things, you've learned nothing except that something changed. Convergence requires information — and information requires isolating variables.
With this approach, most creators report reaching a usable result in 6–10 generations rather than 30–50 on the same concept. The same technique extends to fine details — a specific hand position, a specific background element — because you know exactly which clause in your prompt corresponds to which visual element.
"Stop regenerating. Start diagnosing. Every bad output is data. What in your prompt caused this? Change that one thing. You'll get there in 8 generations instead of 80."
r/AIArt · 11.4k upvotes