AI image consistency: Moving from Vibes to Pipelines

Over the last couple of years, almost everyone in tech has had to interact with AI in some capacity—whether it's generating code, reviewing it, finding inspiration, or 'vibe coding' an idea into a terribly structured app. A lot has been said about whether AI is inherently good or bad, but I won't get into that debate too deeply here.

Instead, I want to talk about how we actually build reliable software with these tools. As a developer, chatting with an AI is one thing, but relying on it for an automated pipeline is another. A major hurdle in image generation is consistency. As a quick technical caveat: while 'hallucination' usually refers to text models confidently making up fake facts, in the image generation world, we face a similar issue called 'seed variation.' Because image generators start from random static, if you don't lock the parameters perfectly, the AI 'hallucinates' new details because it's starting from scratch every time.

Given that we are talking to a computer, we can try speaking its language a bit more to enforce this consistency—for example, by giving instructions in JSON format.

The Approach

In this experiment, I asked an AI to generate an image of a golden retriever:

Next, I asked the AI to provide a strict JSON-formatted description of the generated picture, and I received this:

{
  "image_filename": "watermarked_img_15660457155382619529.png",
  "subject": "Golden Retriever",
  "description": "A medium-to-large sized Golden Retriever sitting in a grassy, open field.",
  "physical_characteristics": {
    "coat_color": "Golden-tan",
    "coat_texture": "Medium length, dense fur",
    "posture": "Sitting upright",
    "expression": "Happy, with mouth slightly open and tongue visible"
  },
  "accessories": {
    "collar": "Blue collar with a small tag"
  },
  "setting": {
    "environment": "Outdoors, a grassy hill",
    "background": "Rolling green hills and a bright, soft-focus sky",
    "lighting": "Natural daylight, suggesting golden hour"
  }
}

I copied and pasted this code into an editor, manually modified the collar colour to be 'red', and fed the JSON block back as my next prompt. This was the result:

As you can see, the images are fundamentally similar. The AI modified exactly what I asked for, rather than guessing what to change.

Why does this matter for Developers?

It is true that modern AI models—like Gemini, Midjourney, or DALL-E 3—are getting much better at handling natural language edits. You can often just say 'make the collar red' and get a decent result. However, if you are building an automated pipeline—say, a background script that generates 100 asset variations for a marketing dashboard—you cannot rely on conversational 'vibes'.

You need strict, structured data. By forcing the LLM to read and output JSON, you are bypassing its tendency to act as a 'creative middleman' who rewrites your prompts. You force the model to behave strictly logically. It shifts the AI from being a conversational assistant to acting like a strict API endpoint.

To me, this demonstrates the absolute importance of providing the right context and programmatic structure. It also shows why software developer roles—and the need for precise, logical systems architecture—won't be disappearing but just evolving.

Thank you for reading! Feel free to reach out if you'd like to chat about AI pipelines or web development.