Science

OpenAI submitted models to the hardest math test yet for AI

March 6, 2026

OpenAI published its proof attempts on February 14 for First Proof, a challenge put together by 11 leading mathematicians from educational institutions like Harvard, Yale, Stanford, and MIT.

The challenge is a set of 10 unpublished, research-level math problems designed specifically to be impossible for artificial intelligence (AI) models to have seen before. The aim is to understand whether an AI is truly reasoning when solving a problem versus potentially remembering how to solve an already known math problem.

The results of the grueling quiz were mixed, but remarkable.

Based on feedback from experts, OpenAI believes at least five of its models’ proof attempts (problems 4, 5, 6, 9, and 10) have a high chance of being correct, with several others still under review.

Chief scientist Jakub Pachocki later updated that number to six, writing: “Based on feedback from experts, we believe at least six solutions (2, 4, 5, 6, 9, 10) have a high chance of being correct, and some further ones look promising.”

He added that the attempt “was a side-sprint executed in a week mostly by querying one of the models we’re currently training,” acknowledging that “the methodology we employed leaves a lot to be desired.”

Mathematicians, however, have already identified potential holes in at least one of those six, and the final verdict will depend on a formal peer review. No specific timeline for the peer review has been publicly announced.

The attempt was the result of a weeklong sprint in which OpenAI’s latest in-house models worked alongside human mathematicians with expertise in the relevant fields.

Importantly, the First Proof rules require that all mathematical ideas come from the AI autonomously; human input for mathematical content is explicitly disallowed, which makes the collaboration with experts a point of scrutiny.

The First Proof team itself found that, without restrictions, AI systems, including GPT 5.2 Pro and Gemini 3.0 Deepthink, confidently produced proofs for all 10 problems, but only two held up under expert review.

This effort fits into OpenAI’s broader push into scientific and numerical reasoning.

In July 2025, one of its models achieved the gold medal at the International Mathematical Olympiad with 35 out of 42 points.

More recently, GPT-5.2 helped researchers solve an open theoretical problem in statistics, with the human role limited to verification and writing rather than mathematical scaffolding.

What is First Proof?

First Proof is a benchmark created by 11 research mathematicians, including Fields Medal-adjacent names from Harvard, Yale, Columbia, and EPFL, to test whether AI can autonomously solve the kind of problems that arise naturally in academic research.

Unlike competition math (think Olympiad problems with clean numerical answers), these are lemmas: smaller, technical proofs that mathematicians encounter while working on larger results.

The problems span fields like algebraic topology, spectral graph theory, symplectic geometry, and numerical linear algebra.

Crucially, none of the solutions had ever been published online before the challenge, preventing any possibility of AI training data contamination.

Solutions were encrypted and released on February 13, giving participants exactly one week to tackle the proofs.

The challenge gets its name from a baking metaphor: the first proof is the bulk fermentation of dough, the first step before anything gets shaped into a final product.

The lesson from First Proof isn’t that AI failed. It’s that the bar for what counts as “solving” a math problem just got significantly higher.

Juan Pablo Aguirre Osorio

Juan Pablo Aguirre Osorio is a contributing reporter to Espacio Media Incubator. With a background in full stack engineering, Juan Pablo brings a technical background to his reporting on cutting edge technologies, including AI. His work has been featured in HackerNoon, The Sociable, and others, and he was previously a Student Ambassador at Microsoft.
VIEW ALL POSTS

< Next Post

DARPA Generative Optogenetics (GO) program gathers biosecurity, regulatory advisers for commercializing programmable biology tech

Previous Post >

The hidden costs of sedentary work: Why prevention starts at your desk

Business Science

The smart kitchen revolution: Why automation may be the next big health breakthrough

In an age of rising diet-related chronic diseases, how we eat matters just as much as what we eat....

January 27, 2026 Isabel Ramelli

Business Science

The first 1,000 days: the startup revolution redefining lifelong health

For most of humanity’s 2.8 million-year existence, life expectancy barely budged. But, over the...

December 8, 2025 Isabel Ramelli

Science

Biotech CEO reviews the efficacy of ketamine, MDMA & psychedelic therapy for mental health (podcast episode)

According to The New Yorker, ketamine therapy is going mainstream. Evidence from studies...

October 28, 2022 Sam Brake Guia

Sociable's Podcast

Brains Byte Back

Brains Byte Back interviews startups, entrepreneurs, and industry leaders that tap into how our brains work. We explore how knowledge & technology intersect to build a better, more sustainable future for humanity. If you’re interested in ideas that push the needle, and future-proofing yourself for the new information age, join us every Friday. Brains Byte Back guests include founders, CEOs, and other influential individuals making a big difference in society, with past guest speakers such as New York Times journalists, MIT Professors, and C-suite executives of Fortune 500 companies.

Every hype cycle has a sales guy. Crypto had them. AI agents have them now, and most of what's being sold as an ”agent” is old automation with a new label. It's called agent washing, and Gartner projects 40%+ of agentic AI projects will be cancelled by 2027.

So how do you tell the real thing from a fresh coat of paint? And even if it is real, do you actually need a complex agent for what you're looking to achieve?

Host Erick Espinosa sits down with Mariano Jurich, Senior Product Leader at Making Sense, who helps mid-market and PE-backed companies separate real AI value from plausible noise, and who tells clients, more often than you'd expect, that the boring workflow automation is the better buy.

Inside this episode:

The four traits that define a real AI agent
The ”if-then-do” test that exposes a washed agent in one sentence
Why ”agentic” is overkill for most problems companies bring him
Why agentic AI breaks when you treat it like a cloud migration
What to put in writing before you sign
Find out more about Mariano Jurich here.
Learn more about Making Sense here.
Reach out to today's host, Erick Espinosa – [email protected]
Get the latest on tech news – https://sociable.co/
Leave an iTunes review – https://rb.gy/ampk26
Follow us on your favourite podcast platform – https://link.chtbl.com/rN3x4ecY