AI Models Still Struggle With Reasoning — And Here’s Why

AI Models Still Struggle With Reasoning — And Here’s Why

AI models have achieved remarkable feats in the last three years. ChatGPT generates text at speed and increasingly with incredible creativity. Aiva composes music, Sora generates videos and Midjourney generates art, with many of these models even passing standardized tests.

Yet, when it comes to advanced mathematical reasoning — the kind of reasoning required to apply logic beyond surface-level patterns and even in unfamiliar contexts — these models often falter. While they may excel at solving routine problems, they struggle with tasks requiring deep understanding and they often break down when logic must be applied in unfamiliar or unstructured ways.

And that raises a question about whether current AI benchmarks are truly measuring intelligence or merely assessing a model’s ability to mimic patterns.

Today’s Benchmarking Problem

Traditional benchmarks like GSM8K or Grade School Math 8K — a widely recognized dataset designed to evaluate the mathematical reasoning capabilities of large language models — have been the gold standard for evaluating AI’s progress in math reasoning, with reported accuracies climbing past 90%. But a new study by Apple researchers suggests that the GSM8K benchmark may not adequately assess genuine reasoning.

The researchers developed the GSM-Symbolic benchmark, which introduces variations to standard problems to test a model’s adaptability, and proved that when you apply this benchmark — changing names and numerical values — there’s a significant performance drop in the same models, even though the underlying logic remains similar.

As Dr. Matthew Yip, a mathematics educator and assessment designer with the GMMO, noted in an interview, “we’re rewarding models for replaying training data, not reasoning from first principles. We need to move beyond this approach into the very roots of how exactly AI models ‘think’ and process logic. Benchmarks must force adaptability — change the rules and see if they still understand.”

Another research on the UTMath benchmark, which frames each problem as a suite of unit tests across 1,053 challenges, found the best models solving just 32.57% of cases — roughly one in three — reinforcing Yip’s sentiment.

And on the FrontierMath benchmark, which was co-developed by over 60 mathematicians and presents hundreds of original, exceptionally challenging mathematical problems, a new study showed that today’s state-of-the-art AI models solved less than 2% of those problems. This was despite their impressive performance on traditional benchmarks like the GSM8K, the Massive Multitask Language Understanding or MMLU, the HumanEval, the Big-Bench Lite or BBL and others.

These problems highlight the challenges AI faces in understanding mathematical concepts and reasoning through complex problems. The bigger question, though, is if slight rewrites trip up AI on grade-school math, what happens when they’re used to optimize loans, diagnose illness, or write code for mission-critical systems? Can we truly trust outcomes from AI models, many of which people across all walks of life are increasingly using in their day-to-day activities?

While model developers like OpenAI often add a disclaimer that the outcomes from AI models aren’t always accurate and must always be verified, some experts consider this a poor attempt at masking the inadequacies of current AI models. The solution, for many, is that this problem must be addressed at the very root.

The Broader Implications

Cognitive scientist Gary Marcus has long warned that narrow testing with AI benchmarks gives a false sense of progress. “The largest problem, the longest-term problem, is we have poor control over these systems,” he told the Wall Street Journal, arguing that current benchmarking metrics let brittle models slip through evaluation gates.

The truth is that the shortcomings of current AI benchmarks have far-reaching implications across several industries. In fields like finance, healthcare and scientific research, for example, decisions often hinge on complex reasoning. If AI systems are trained and evaluated using flawed benchmarks, there’s a risk of deploying models that perform well in controlled settings but fail in real-world applications.

Moreover, the overreliance on benchmarks that reward pattern recognition can hinder the development of AI systems capable of genuine reasoning. As AI continues to integrate into various aspects of society, ensuring that these systems possess true understanding becomes paramount.

That’s why Yip said it’s critical to develop more robust benchmarks that assess reasoning and adaptability. “We must trace each step of a model’s reasoning, not just judge whether the final answer matches,” he explained, adding that “robust benchmarks vary context, rules and data distributions — forcing models to prove they understand and not just recall.”

Such benchmarks, he noted, should incorporate diverse problem sets, evaluate the reasoning process and minimize the chances of models relying solely on memorization.

Beyond Memorization

What today’s traditional benchmarks reveal isn’t really intelligence, but clever mimicry — and cleverness alone doesn’t build trust. For Yip, true machine intelligence must move beyond memorization to transferable understanding, such that models can adapt even when conditions change. “Most current assessments — either for students or machines — focus too much on whether the answer is correct, not how the reasoning process itself unfolds,” he said in an interview.

Yip noted that integrating principles from human education into AI assessment can be beneficial to how AI benchmark tests are structured, adding that “understanding comes not from getting one right answer, but from being able to generalize that logic across levels.”

Some practical steps that he suggested include:

  1. Process-centric scoring: Rather than scoring only the final answer, evaluate chains of thought step by step — rewarding sound logic even if arithmetic slips.
  2. Adaptive adversarial prompts: Continually generate new problem variants that exploit known weaknesses, keeping benchmarks one step ahead of overfitting.
  3. Cross-domain suites: Blend math, language vision and code reasoning into unified tests, reflecting the multifaceted nature of real tasks.
  4. Expert‐in‐the‐loop validation: Periodic human review of model reasoning traces ensures benchmarks remain meaningful and free from contamination.
  5. Dynamic evolution: Like standardized exams for human learners, AI benchmarks should evolve — retiring problems once models master them and introducing fresh challenges.

Toward Better Benchmarking

AI’s march forward is real, but so too is the gap between surface performance and genuine understanding. As AI systems become increasingly prevalent, ensuring their ability to reason and understand becomes even more crucial.

As Yip noted, if we keep rewarding models for repeating patterns, we’ll deploy systems that crumble outside controlled tests. By rethinking how we evaluate reasoning — drawing on insights from education, cognitive science and adversarial testing — we can push models toward true intelligence, rather than clever mimicry.

“The next wave of AI will be judged not by the checkmarks it accumulates on old tests, but by its ability to tackle problems we’ve never imagined,” he concluded.

Leave a Reply

Your email address will not be published. Required fields are marked *