Science

OpenAI submitted models to the hardest math test yet for AI

OpenAI published its proof attempts on February 14 for First Proof, a challenge put together by 11 leading mathematicians from educational institutions like Harvard, Yale, Stanford, and MIT. 

The challenge is a set of 10 unpublished, research-level math problems designed specifically to be impossible for artificial intelligence (AI) models to have seen before. The aim is to understand whether an AI is truly reasoning when solving a problem versus potentially remembering how to solve an already known math problem.

The results of the grueling quiz were mixed, but remarkable. 

Based on feedback from experts, OpenAI believes at least five of its models’ proof attempts (problems 4, 5, 6, 9, and 10) have a high chance of being correct, with several others still under review. 

Chief scientist Jakub Pachocki later updated that number to six, writing: “Based on feedback from experts, we believe at least six solutions (2, 4, 5, 6, 9, 10) have a high chance of being correct, and some further ones look promising.” 

He added that the attempt “was a side-sprint executed in a week mostly by querying one of the models we’re currently training,” acknowledging that “the methodology we employed leaves a lot to be desired.”

Mathematicians, however, have already identified potential holes in at least one of those six, and the final verdict will depend on a formal peer review. No specific timeline for the peer review has been publicly announced.

The attempt was the result of a weeklong sprint in which OpenAI’s latest in-house models worked alongside human mathematicians with expertise in the relevant fields. 

Importantly, the First Proof rules require that all mathematical ideas come from the AI autonomously; human input for mathematical content is explicitly disallowed, which makes the collaboration with experts a point of scrutiny. 

The First Proof team itself found that, without restrictions, AI systems, including GPT 5.2 Pro and Gemini 3.0 Deepthink, confidently produced proofs for all 10 problems, but only two held up under expert review.

This effort fits into OpenAI’s broader push into scientific and numerical reasoning. 

In July 2025, one of its models achieved the gold medal at the International Mathematical Olympiad with 35 out of 42 points. 

More recently, GPT-5.2 helped researchers solve an open theoretical problem in statistics, with the human role limited to verification and writing rather than mathematical scaffolding.

What is First Proof?

First Proof is a benchmark created by 11 research mathematicians, including Fields Medal-adjacent names from Harvard, Yale, Columbia, and EPFL, to test whether AI can autonomously solve the kind of problems that arise naturally in academic research. 

Unlike competition math (think Olympiad problems with clean numerical answers), these are lemmas: smaller, technical proofs that mathematicians encounter while working on larger results.

The problems span fields like algebraic topology, spectral graph theory, symplectic geometry, and numerical linear algebra. 

Crucially, none of the solutions had ever been published online before the challenge, preventing any possibility of AI training data contamination.

Solutions were encrypted and released on February 13, giving participants exactly one week to tackle the proofs. 

The challenge gets its name from a baking metaphor: the first proof is the bulk fermentation of dough, the first step before anything gets shaped into a final product.

The lesson from First Proof isn’t that AI failed. It’s that the bar for what counts as “solving” a math problem just got significantly higher.

Juan Pablo Aguirre Osorio

Juan Pablo Aguirre Osorio is a contributing reporter to Espacio Media Incubator. With a background in full stack engineering, Juan Pablo brings a technical background to his reporting on cutting edge technologies, including AI. His work has been featured in HackerNoon, The Sociable, and others, and he was previously a Student Ambassador at Microsoft.

Recent Posts

The hidden tech quietly cushioning the world’s biggest oil crisis

The closure of the Strait of Hormuz has sent shockwaves through the global economy, halting…

3 days ago

Pressure for return on AI investment mounts within organizations: Report

For the past few years, corporate investment into AI startups and in-house R&D have skyrocketed,…

4 days ago

Ourself Health’s launch of Stella brings AI-powered insights to women’s hormonal healthcare

Although we continue to push the frontiers in innovation across healthcare, women remain one of…

4 days ago

2026 megadeals and tech-heavy M&A transformations face a big ‘IT problem’

The early 2026 mergers and acquisitions (M&A) landscape is supercharged. While the total number of…

6 days ago

Tony Blair Institute claims digital ID public support is ‘recoverable’

Globalists & govts want you to believe that digital ID is unavoidable, to lead you…

6 days ago

Kenya Token and Catholic USD stablecoin pair up on Solana at Africa Digital Assets Summit 

A new blockchain-based financial system targeting African economic development is set to to launch at…

6 days ago