top of page

Llama 4 Stumbles While Gemini 2.5 Pro Soars: A Tale of Two Recent AI Releases

  • Kai Haase
  • 14. Mai
  • 4 Min. Lesezeit

Visual comparison of recent AI model launches: Meta’s Llama 4 stumbles while Google’s Gemini 2.5 Pro soars—highlighting the volatility of the AI landscape and the growing need for stable, trustworthy solutions in regulated industries like healthcare and life sciences.

The AI landscape moves at lightning speed, and the last few weeks have been no exception. Two giants, Meta and Google, dropped significant new language model updates: Meta with its long-awaited Llama 4 family and Google with the impressive Gemini 2.5 Pro. While both represent advancements, the reception and perceived trajectory couldn't be more different. Meta's release seems marred by disappointment and controversy, while Google continues its steady climb, delivering capabilities that feel genuinely useful.


Meta's Llama 4: High Hopes Meet Underwhelming Reality

Meta has historically been a cornerstone of the open-source AI movement. Their Llama models, released with permissive licenses, became incredibly popular, fueling innovation across the tech field. Anticipation for Llama 4 was high, especially after a significant gap since the last major release.


However, the road to Llama 4 seems to have been bumpy. Rumors swirled about delays, suggesting internal challenges. Some sources even whispered of panic after the surprise drop of DeepSeek's R1 model, which reportedly outperformed Meta's internal Llama 4 benchmarks, forcing a scramble to incorporate new techniques.


When Llama 4 finally arrived – notably, released over a weekend, a timing choice many interpreted as an attempt to minimize initial scrutiny – it boasted some impressive specs on paper:


  • Natively Multimodal: Understanding image and video inputs.

  • Multiple Sizes: Scout (small), Maverick (medium) and Behemoth (large, still training).

  • Massive Context (Scout): An headline-grabbing 10 million token context window.

  • Mixture of Experts Architecture: A sophisticated model design.


Despite these specs, the community "vibe check" has been largely negative. While the Scout model's 10M context window allows for impressive "needle-in-a-haystack" retrieval (like finding a password hidden in the Harry Potter series), users report it struggles with deeper comprehension over long texts. Furthermore, the practical memory requirements for such a large context put it out of reach for most. The medium-sized Maverick model offers a more modest 1 million token context.


More concerning are the accusations of benchmark manipulation. Llama 4 initially shot up leaderboards like the LMSys Chatbot Arena. However, it emerged that the model submitted wasn't the base open-weight model, but a version specifically fine-tuned for human preference on that leaderboard. LMSys publicly stated, "Meta's interpretation of our policy did not match what we expect from model providers." This, coupled with broader suspicions of training on benchmark test data (which Meta denies), has cast a shadow over the reported performance gains. Speculation arose connecting the Llama 4 controversy to the departure of Meta's Head of AI Research, Joelle Pineau, though Meta leadership has refuted claims of benchmark hacking. Poor results on benchmarks like the ADA Polyglot coding test (where Llama 4 Maverick scored a low 15.6%) further fuel the disappointment.


While Llama 4 might still hold potential as a base model, the launch feels anticlimactic and somewhat clumsy, failing to live up to the high expectations set by its predecessors.



Google's Gemini 2.5 Pro: Quietly Delivering the Goods


In stark contrast, Google's recent release of Gemini 2.5 Pro has been met with considerable enthusiasm, including my own. It feels like a significant step forward, particularly in areas crucial for real-world applications.


What stands out immediately is Google's apparent fairness in reporting. They highlighted benchmarks where Gemini 2.5 Pro performs well, but also openly included results from benchmarks like Live Codebench V5 and Swebench Verified where competitors edged it out slightly. This transparency builds confidence.


For me, the standout feature is Gemini's mastery of long context. While Llama 4 Scout touts a 10M token window mainly for retrieval, Gemini 2.5 Pro (tested internally up to 10M, publicly available up to 2M tokens) demonstrates remarkable understanding across vast amounts of text. This was powerfully illustrated by the Fiction LiveBench benchmark. The test requires models to piece together plot points and character motivations scattered across tens or hundreds of thousands of words in a fictional story – a task demanding true comprehension, not just finding keywords. Gemini 2.5 Pro excelled here, pulling far ahead of competitors, while Llama 4 models struggled significantly. This ability to genuinely process and reason over entire documents, codebases, or books unlocks powerful use cases.


Beyond long context, Gemini 2.5 Pro has shown SOTA or near-SOTA performance across various challenging benchmarks, including a record high score on SimpleBench (testing tricky logic, spatial reasoning, and common sense) and topping the charts on the Weird ML benchmark. It also brings practical advantages like handling YouTube URLs directly and having a very recent knowledge cutoff (January 2025).


The Takeaway: Momentum Shifts


Comparing these two releases paints a clear picture. Meta, once the undisputed open-weight champion, appears to be facing headwinds. The Llama 4 launch feels less like a confident stride forward and more like a hurried attempt to stay relevant, undermined by questionable tactics and performance that doesn't quite match the hype.


Google, on the other hand, is steadily building momentum. With Gemini 2.5 Pro, they've delivered a model that isn't just incrementally better on standard benchmarks but offers a leap in a genuinely useful capability – deep, long-context understanding. I find myself increasingly turning to Gemini for tasks involving document analysis or reasoning over large text bodies.


The AI race is far from over, and new breakthroughs can come from anywhere. But based on this latest round, Google seems to have positioned itself as a more reliable and impressive innovator in the language model space, while Meta has some ground to make up, both in performance and in rebuilding community trust.

 
 
bottom of page