It perhaps already is clear that large language models using “reasoning” (chain of thought, for example) can provide much-better accuracy than non-reasoning models.
Reasoning models consistently perform well on simple queries. Benchmarks such as MMLU (Massive Multitask Language Understanding) and ARC (Abstraction and Reasoning Corpus) show that models reach near-human or superhuman accuracy on tasks that do not require multi-step or abstract reasoning.
But reasoning models frequently exhibit "reasoning is not correctness" gaps as query complexity grows. Performance degrades sharply as step counts increase, as is required for complex tasks. Token usage grows three to five times for complex queries while accuracy drops 30 percent to 40 percent compared to simple tasks.
Reasoning models using techniques like Chain-of-Thought (CoT) can break down moderately complex queries into steps, improving accuracy and interpretability. However, this comes with increased latency and sometimes only modest gains in factuality or retrieval quality, especially when the domain requires specialized tool usage or external knowledge.
The biggest issues come with the most-complex tasks. Research shows that even state-of-the-art models experience a "reasoning does not equal experience" fallacy: while they can articulate step-by-step logic, they may lack the knowledge or procedural experience needed for domain-specific reasoning, for example.
That happens because the reasoning models use logically coherent steps that contain critical factual errors. In scientific problem-solving, 40 percent of model-generated solutions pass syntactic checks but fail empirical validation, for example.
Such issues are likely to be most concerning for some use cases, rather than others. For example, advanced mathematics problems such as proving novel theorems or solving high-dimensional optimization problems often require formal reasoning beyond pattern matching
Problems involving multiple interacting agents (economic simulations, game theory with many players) can overwhelm LLMs due to the exponential growth of possible outcomes.
In a complex negotiation scenario, an LLM might fail to account for second-order effects of one agent’s actions on others.
Also, problems spanning multiple domains (designing a sustainable energy grid involving engineering, economics, and policy) require integrating diverse knowledge LLMs were not trained on.
Of course, one might also counter that humans, working without LLMs, might very well also make mistakes when assessing complex problems, and also produce logically reasoned but still "incorrect" conclusions!
But there are probably many complex queries that still will benefit, as most queries will not test the limits of advanced theorems, economic simulations, game theory, multiple domains and unexpected human behaviors.
So for many use cases, even complexity might not be a practical issue for a reasoning LLM, even if they demonstrably become less proficient as problem complexity rises. And, of course, researchers are working on ways to ameliorate the issues.