Qwen3 vs XKCD
A quick test of the Qwen3 VLM illustrated through the historical challenges of image recognition.
The Qwen3-VL framework , which was released in September of 2025, integrates a vision encoder and a language model decoder to process multimodal inputs, including text, images, and video. In addition to its impressive multimodal capabilities, it is open-weight and quantized models can easily run on modest hardware.
I’m sure everyone remembers this famous XKCD comic around image classification and recognition some years back - well before the advent of LLMs.
For kicks let’s try feeding the comic itself into Qwen3-VL running on a modest Mac M1 with 32GB of RAM.
Full Response from Qwen3-VL
This is a black-and-white comic strip with minimalist stick-figure characters and text-based humor, centered on a Computer Science (CS) joke about task complexity. Below is a detailed breakdown:
1. Visual Elements & Characters
-
Left Character (Standing):
- A minimalist stick figure with a circular head and three straight lines for the body (implying two legs and one arm, though drawn simply).
- No facial features; the figure is entirely defined by basic geometric shapes.
- Positioned standing (no chair), facing the seated character.
-
Right Character (Seated):
- A stick figure with a ponytail (a single line extending from the circular head to represent the ponytail).
- Seated in an office chair with a circular seat, backrest, and four legs (drawn with simple lines).
- At a desk with:
- A rectangular computer monitor (screen + base).
- A small rectangular keyboard on the desk.
- The seated character’s posture suggests engagement with the computer (hands implied near the keyboard).
2. Text & Dialogue (All in Uppercase, Typical of Comic Strips)
-
Top Panel (Narrative Setup):
“WHEN A USER TAKES A PHOTO, THE APP SHOULD CHECK WHETHER THEY’RE IN A NATIONAL PARK …”
- This introduces the scenario: an app’s intended functionality (geotagging photos for national parks).
-
Right Character’s First Response (Speech Bubble):
“SURE, EASY GIS LOOKUP. GIMME A FEW HOURS.”
- The character (likely a developer or tech expert) claims the task is straightforward: using GIS (Geographic Information Systems) to verify national park location.
- “GIMME A FEW HOURS” emphasizes perceived ease (a quick, technical task).
-
Left Character’s Follow-Up (Speech Bubble):
“… AND CHECK WHETHER THE PHOTO IS OF A BIRD.”
- The standing character adds a second requirement: the app must identify if the photo contains a bird.
-
Right Character’s Second Response (Speech Bubble):
“I’LL NEED A RESEARCH TEAM AND FIVE YEARS.”
- This contrasts sharply with the first response: identifying birds in photos is extremely complex (requires expertise, data, and time—not a simple tech task).
-
Bottom Caption (Contextual Explanation):
“IN CS, IT CAN BE HARD TO EXPLAIN THE DIFFERENCE BETWEEN THE EASY AND THE VIRTUALLY IMPOSSIBLE.”
- “CS” = Computer Science. The caption explains the joke: some tasks (like GIS lookups) seem easy but are trivial to implement, while others (like bird identification) are virtually impossible without massive resources.
3. Specific Details & Context
-
Animals/Objects:
- The only animal referenced is birds (the target of the photo-check task).
- Objects include:
- A computer monitor (rectangular screen + base).
- A keyboard (small rectangle on the desk).
- An office chair (circular seat, backrest, four legs).
- A desk (rectangular surface).
-
Humor & Juxtaposition:
- The comic’s joke hinges on misleading simplicity:
- GIS lookup for national parks → easy (uses existing tech, quick to implement).
- Bird identification in photos → virtually impossible (requires expertise, training data, and time—not a simple algorithm).
- This highlights a core CS challenge: explaining the gap between intuitive tasks and their technical realities.
- The comic’s joke hinges on misleading simplicity:
-
Art Style:
- All elements are drawn with bold black lines on a white background (typical of web comics).
- Characters are minimalist stick figures (no facial features, simple shapes), emphasizing the joke’s focus on process, not appearance.
4. Broader Interpretation
The comic satirizes how technical tasks can be deceptively easy or impossibly hard. It targets CS professionals (or anyone familiar with tech limitations) by contrasting:
- GIS lookups (a well-established, efficient tech task).
- Bird identification (a problem requiring deep learning, massive datasets, and expert validation—not a simple “if-else” check).
The phrase “VIRTUALLY IMPOSSIBLE” underscores the absurdity of expecting a simple app to handle complex, real-world tasks (like identifying species in photos), which often require years of research and specialized tools.
In short, the image is a clever, minimalist comic that uses stick figures and text to illustrate a universal CS struggle: the gap between what seems easy and what is actually feasible.
Conclusion
We’ve come a long way since the days of the ImageNet challenge and convolutional models such as AlexNet which came out nearly 15 years ago.