GenAI Model Showdown

A rigorous comparison of state-of-the-art generative image models on specific prompts and challenges with a strong emphasis on prompt adherence.

Created by Shaun Pedicini

GenAI Model Showdown - Comparative Analysis

Generating yet another photorealistic human portrait is a largely solved problem in the GenAI space. But what happens when we push these models beyond their comfort zone? This project compiles a comprehensive comparison of various state-of-the-art generative image models (Imagen, OpenAI DALL-E, Midjourney, Flux, Qwen, and others) against highly specific challenges with a strong emphasis placed on prompt adherence.

Beyond Pretty Pictures

While modern generative AI models excel at creating beautiful, photorealistic images, they often struggle with specific, unusual, or historically anachronistic requests. This project systematically tests these boundaries to understand not just what models can create, but how well they follow precise instructions.

Example Challenge: Alexander's Unusual Mount

Prompt:

"A historical oil painting of Alexander the Great riding a hippity hop toy into battle. The hippity hop toy was an old children's toy that looked like a giant rubber ball that a child could straddle and hold onto rubber handles. If the model did not have knowledge of a hippity hop in its training data, we would further prompt with a description of the hippity hop toy."

"The core workout of riding a hippity hop made Alexander the Great's march through the Indian subcontinent significantly more difficult but think of the GAINS!"
Imagen 4 - Alexander the Great on hippity hop
Imagen 4
5 attempts - ✅ Passed
Midjourney 7 - Alexander the Great attempt
Midjourney 7
16 attempts - ❌ Failed
QwenImage - Alexander the Great on hippity hop
QwenImage
2 attempts - ✅ Passed

Testing Methodology

  • Prompt Adherence: Each model is evaluated on how accurately it follows specific instructions, especially unusual or anachronistic elements.
  • Attempt Tracking: We record how many attempts were needed to achieve a satisfactory result, providing insight into model consistency.
  • Edge Case Testing: Deliberately challenging prompts that focus more on prompts that exhibit complex actions, testing the models' ability to blend disparate concepts.
  • Fair Comparison: All models receive identical prompts, with additional clarification provided only when necessary for models lacking specific training data.

Why This Matters

Understanding the boundaries and capabilities of generative AI models is crucial for:

  • Researchers studying the limitations and biases in current AI systems
  • Developers choosing which models to integrate into their applications
  • Anyone curious about the real capabilities versus marketing claims of these AI systems

Visit the Full Comparison

Explore the complete collection of prompts, model responses, and detailed analysis. See which models excel at following instructions and which prioritize aesthetic appeal over accuracy.

View All Comparisons

Key Features

  • Side-by-side comparison of SOTA models
  • Emphasis on prompt adherence testing
  • Challenging and specific prompts
  • Pass/fail criteria with attempt tracking

Technologies & Tags

AI GenAI Other