‘Humanity’s Last Exam’ benchmark is stumping top AI models – can you do any better? [Video]

PM Images/Getty Images

Are artificial intelligence (AI) models really surpassing human ability? Or are current tests just too easy for them?

On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity’s Last Exam (HLE), a new academic benchmark aiming to “test the limits of AI knowledge at the frontiers of human expertise,” Scale AI said in a release. The test consists of 3,000 text and multi-modal questions on more than 100 subjects like math, science, and humanities, submitted by experts in a variety of fields.

Also: Roll over, Darwin: How Google DeepMind’s ‘mind evolution’ could enhance AI thinking

Anthropic’s Michael Gerstenhaber, head of API technologies, noted to Bloomberg last fall that AI models frequently outpace benchmarks (part of why the Chatbot Arena leaderboard changes so rapidly when new models are released). For example, many LLMs now score over 90% on multi-task language understanding (MMLU), a commonly used benchmark. This is known as …

Tags AI, Artificial Intelligence, Cognitive Computing, Deep Learning, Solutions Offering