Table of Contents
Somewhere in the archive of Humanity’s Last Exam sits a question about a Roman tombstone. Not the Latin inscription — that would be too easy — but the Palmyrene script running alongside it, a language spoken in ancient Syria and dead for seventeen centuries. The question was written by a classicist at Oxford, tested against every major AI system available, and passed into the dataset only after all of them failed. It is one of 2,500 such questions. Together they constitute the most rigorous attempt yet to find the ceiling of what artificial intelligence actually knows.
The need for such an exercise became urgent because the ceiling of existing tests had long since been scraped. Models from OpenAI, Google, Anthropic and others now exceed 90 per cent accuracy on MMLU — the Massive Multitask Language Understanding benchmark that was, only a few years ago, considered a meaningful measure of machine intelligence. A test that most PhD students would struggle with has become, for frontier AI, something close to routine.
So a global consortium — nearly 1,000 subject-matter experts affiliated with more than 500 institutions across 50 countries — spent months designing questions that might actually matter. The result, Humanity’s Last Exam (HLE), was published in Nature in January and covers mathematics, biology, linguistics, chemistry, history, computer science and much else besides. Its questions require not internet retrieval but genuine reasoning: how many paired tendons are supported by a specific sesamoid bone unique to hummingbirds? Which class of graphs satisfies a particular convergence property in Markov chains? The question-setters were mostly professors and graduate researchers, each one working in territory AI couldn’t easily follow. Those it could follow were cut.
The filtering process alone tells you something. More than 70,000 AI attempts were logged during question development; roughly 13,000 stumped the models sufficiently to proceed to human expert review. Of those, 2,500 survived to become HLE. Each surviving question had to have a single unambiguous answer, verifiable by a domain expert, resistant to web search.
When the benchmark was published and the frontier models were finally tested against it properly, the scores were not encouraging — if you were hoping for superintelligence. GPT-4o managed 2.7 per cent. Claude 3.5 Sonnet reached 4.1. OpenAI’s reasoning-specialist o1, the system explicitly designed to think harder before answering, achieved 8. More recent models have done better: GPT-5, released after HLE was made public, scored around 25 per cent. But the benchmark was engineered to resist saturation. Even a quarter correct leaves three-quarters wrong.
What makes the results especially telling isn’t just the accuracy numbers. It’s the calibration. When a model is wrong, does it know it’s probably wrong? Well-calibrated systems should hedge on hard questions, expressing lower confidence when they’re guessing. HLE found the opposite: most frontier models exhibited calibration errors above 70 per cent, meaning they were consistently wrong in ways they didn’t recognise. Confident and mistaken is a more alarming failure mode than uncertain and mistaken.
There is one more curious finding in the data. Reasoning models — those designed to generate extended chains of thought before committing to an answer — do improve with more thinking, up to a point. Feed them more tokens to reason with and accuracy climbs on a roughly log-linear curve. But beyond about 16,000 reasoning tokens, the trend reverses. More deliberation starts to hurt. Why this happens isn’t yet understood, but it suggests that simply scaling up compute at inference time isn’t a path to expert-level knowledge.
Tung Nguyen, an instructional associate professor in computer science and engineering at Texas A&M University, contributed more questions to HLE’s mathematics and computer science sections than almost anyone else — 73 in total, the second-highest count among nearly 1,000 contributors. He is cautious about what the results mean. “When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human-level understanding,” he said. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”
He is equally cautious about the practical stakes. Without rigorous benchmarks, he argues, the risks multiply. “Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do. Benchmarks provide the foundation for measuring progress and identifying risks.”
The name Humanity’s Last Exam invites a particular kind of reading — the final test before the machines win, the last line of human intellectual defence. Nguyen pushes back on this. “This isn’t a race against AI. It’s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies.” The diversity of question-setters was itself the point: historians, physicists, linguists and medical researchers all probing different corners of knowledge, precisely because different corners catch different failures. “Perhaps ironically,” Nguyen said, “it’s humans working together” that exposes the gaps.
There is a $500,000 prize pool attached to the effort, with top-ranked questions earning $5,000 each — a signal of how seriously the organisers took the quality problem. The questions keep coming in. And the models keep improving, which means HLE-Rolling, a dynamic version of the dataset, will update as frontier performance nudges upward. “For now,” Nguyen said, “Humanity’s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence — and despite rapid technological advances, it remains wide.”
Related
Discover more from NeuroEdge
Subscribe to get the latest posts sent to your email.

