Science

OpenAI submitted models to the hardest math test yet for AI

OpenAI published its proof attempts on February 14 for First Proof, a challenge put together by 11 leading mathematicians from educational institutions like Harvard, Yale, Stanford, and MIT.

The challenge is a set of 10 unpublished, research-level math problems designed specifically to be impossible for artificial intelligence (AI) models to have seen before. The aim is to understand whether an AI is truly reasoning when solving a problem versus potentially remembering how to solve an already known math problem.

The results of the grueling quiz were mixed, but remarkable.

Based on feedback from experts, OpenAI believes at least five of its models’ proof attempts (problems 4, 5, 6, 9, and 10) have a high chance of being correct, with several others still under review.

Chief scientist Jakub Pachocki later updated that number to six, writing: “Based on feedback from experts, we believe at least six solutions (2, 4, 5, 6, 9, 10) have a high chance of being correct, and some further ones look promising.”

He added that the attempt “was a side-sprint executed in a week mostly by querying one of the models we’re currently training,” acknowledging that “the methodology we employed leaves a lot to be desired.”

Mathematicians, however, have already identified potential holes in at least one of those six, and the final verdict will depend on a formal peer review. No specific timeline for the peer review has been publicly announced.

The attempt was the result of a weeklong sprint in which OpenAI’s latest in-house models worked alongside human mathematicians with expertise in the relevant fields.

Importantly, the First Proof rules require that all mathematical ideas come from the AI autonomously; human input for mathematical content is explicitly disallowed, which makes the collaboration with experts a point of scrutiny.

The First Proof team itself found that, without restrictions, AI systems, including GPT 5.2 Pro and Gemini 3.0 Deepthink, confidently produced proofs for all 10 problems, but only two held up under expert review.

This effort fits into OpenAI’s broader push into scientific and numerical reasoning.

In July 2025, one of its models achieved the gold medal at the International Mathematical Olympiad with 35 out of 42 points.

More recently, GPT-5.2 helped researchers solve an open theoretical problem in statistics, with the human role limited to verification and writing rather than mathematical scaffolding.

What is First Proof?

First Proof is a benchmark created by 11 research mathematicians, including Fields Medal-adjacent names from Harvard, Yale, Columbia, and EPFL, to test whether AI can autonomously solve the kind of problems that arise naturally in academic research.

Unlike competition math (think Olympiad problems with clean numerical answers), these are lemmas: smaller, technical proofs that mathematicians encounter while working on larger results.

The problems span fields like algebraic topology, spectral graph theory, symplectic geometry, and numerical linear algebra.

Crucially, none of the solutions had ever been published online before the challenge, preventing any possibility of AI training data contamination.

Solutions were encrypted and released on February 13, giving participants exactly one week to tackle the proofs.

The challenge gets its name from a baking metaphor: the first proof is the bulk fermentation of dough, the first step before anything gets shaped into a final product.

The lesson from First Proof isn’t that AI failed. It’s that the bar for what counts as “solving” a math problem just got significantly higher.

Juan Pablo Aguirre Osorio

Juan Pablo Aguirre Osorio is a contributing reporter to Espacio Media Incubator. With a background in full stack engineering, Juan Pablo brings a technical background to his reporting on cutting edge technologies, including AI. His work has been featured in HackerNoon, The Sociable, and others, and he was previously a Student Ambassador at Microsoft.

Next DARPA Generative Optogenetics (GO) program gathers biosecurity, regulatory advisers for commercializing programmable biology tech »

Previous « The hidden costs of sedentary work: Why prevention starts at your desk

Published by

Juan Pablo Aguirre Osorio

Tags: AIFirst Proofopenai

2 months ago

The new generation of AI-powered chatbots boosting patient engagement and helping busy physicians

AI in health has been growing for years, helping to spot disease biomarkers and better…

4 days ago

Business

As tech companies recognize the strategic importance of PR, these 10 professionals are ones to watch in 2026

In 2026, digital technology can no longer be classified as a trend. Today, it represents…

5 days ago

Government and Policy

Rockefeller exec echoes Tony Blair, Larry Ellison calls to unify data: One Health Summit

Rockefeller Foundation VP for Reimagining Humanitarian Nutrition Security Simon Winter tells the One Health Summit…

6 days ago

Business

NTT Research unveils SaltGrain, a zero-trust data security tool built for the AI agent era

NTT Research launched SaltGrain at its Upgrade 2026 conference on Wednesday in San Jose, California.…

6 days ago

Business

NTT Research names Dr. Tetsuomi Sogawa as new Physics & Informatics Lab director

NTT Research, the Silicon Valley-based research division of Japanese telecom giant NTT, announced Dr. Tetsuomi…

6 days ago

Business

What the Fall of the Mall Reveals About the Future of Synthetic Data

This piece started from a series of conversations I kept coming back to over the…

6 days ago

OpenAI submitted models to the hardest math test yet for AI

What is First Proof?

Recent Posts

The new generation of AI-powered chatbots boosting patient engagement and helping busy physicians

As tech companies recognize the strategic importance of PR, these 10 professionals are ones to watch in 2026

Rockefeller exec echoes Tony Blair, Larry Ellison calls to unify data: One Health Summit

NTT Research unveils SaltGrain, a zero-trust data security tool built for the AI agent era

NTT Research names Dr. Tetsuomi Sogawa as new Physics & Informatics Lab director

What the Fall of the Mall Reveals About the Future of Synthetic Data

Search

OpenAI submitted models to the hardest math test yet for AI

What is First Proof?

Related Post

Recent Posts

The new generation of AI-powered chatbots boosting patient engagement and helping busy physicians

As tech companies recognize the strategic importance of PR, these 10 professionals are ones to watch in 2026

Rockefeller exec echoes Tony Blair, Larry Ellison calls to unify data: One Health Summit

NTT Research unveils SaltGrain, a zero-trust data security tool built for the AI agent era

NTT Research names Dr. Tetsuomi Sogawa as new Physics & Informatics Lab director

What the Fall of the Mall Reveals About the Future of Synthetic Data

Search