OpenAI published its proof attempts on February 14 for First Proof, a challenge put together by 11 leading mathematicians from educational institutions like Harvard, Yale, Stanford, and MIT.
The challenge is a set of 10 unpublished, research-level math problems designed specifically to be impossible for artificial intelligence (AI) models to have seen before. The aim is to understand whether an AI is truly reasoning when solving a problem versus potentially remembering how to solve an already known math problem.
The results of the grueling quiz were mixed, but remarkable.
Based on feedback from experts, OpenAI believes at least five of its models’ proof attempts (problems 4, 5, 6, 9, and 10) have a high chance of being correct, with several others still under review.
Chief scientist Jakub Pachocki later updated that number to six, writing: “Based on feedback from experts, we believe at least six solutions (2, 4, 5, 6, 9, 10) have a high chance of being correct, and some further ones look promising.”
He added that the attempt “was a side-sprint executed in a week mostly by querying one of the models we’re currently training,” acknowledging that “the methodology we employed leaves a lot to be desired.”
Mathematicians, however, have already identified potential holes in at least one of those six, and the final verdict will depend on a formal peer review. No specific timeline for the peer review has been publicly announced.
The attempt was the result of a weeklong sprint in which OpenAI’s latest in-house models worked alongside human mathematicians with expertise in the relevant fields.
Importantly, the First Proof rules require that all mathematical ideas come from the AI autonomously; human input for mathematical content is explicitly disallowed, which makes the collaboration with experts a point of scrutiny.
The First Proof team itself found that, without restrictions, AI systems, including GPT 5.2 Pro and Gemini 3.0 Deepthink, confidently produced proofs for all 10 problems, but only two held up under expert review.
This effort fits into OpenAI’s broader push into scientific and numerical reasoning.
In July 2025, one of its models achieved the gold medal at the International Mathematical Olympiad with 35 out of 42 points.
More recently, GPT-5.2 helped researchers solve an open theoretical problem in statistics, with the human role limited to verification and writing rather than mathematical scaffolding.
What is First Proof?
First Proof is a benchmark created by 11 research mathematicians, including Fields Medal-adjacent names from Harvard, Yale, Columbia, and EPFL, to test whether AI can autonomously solve the kind of problems that arise naturally in academic research.
Unlike competition math (think Olympiad problems with clean numerical answers), these are lemmas: smaller, technical proofs that mathematicians encounter while working on larger results.
The problems span fields like algebraic topology, spectral graph theory, symplectic geometry, and numerical linear algebra.
Crucially, none of the solutions had ever been published online before the challenge, preventing any possibility of AI training data contamination.
Solutions were encrypted and released on February 13, giving participants exactly one week to tackle the proofs.
The challenge gets its name from a baking metaphor: the first proof is the bulk fermentation of dough, the first step before anything gets shaped into a final product.
The lesson from First Proof isn’t that AI failed. It’s that the bar for what counts as “solving” a math problem just got significantly higher.

