GPT-5.4 outperforms human experts in 83% of tasks: AI is no longer practicing, it's working

On March 5, OpenAI launched GPT-5.4 with a figure that does not leave much room for interpretation: on GDPval, the benchmark measuring real professional work across 44 occupations, the model matched or surpassed human specialists in 83% of comparisons. Three days later, Anthropic published that Claude Opus 4.6 had identified 22 new vulnerabilities in Firefox over two weeks, 14 of them high-severity, and that it found the first one in twenty minutes. Two different news stories. The same message.

GPT-5.4: the model that no longer waits for instructions at every step

GPT-5.4's architecture consolidates something OpenAI had been building in pieces: it unified the reasoning capabilities of GPT-5.2 with the agentic programming capabilities of GPT-5.3-Codex into a single model. The result is a system that operates computers autonomously, navigates applications unsupervised, and completes complex workflows without anyone needing to guide it step by step.

According to OpenAI, GPT-5.4 is 33% less likely to make errors in individual statements than its predecessor GPT-5.2, and complete responses contain 18% fewer errors. Performance numbers in professional environments are even more striking: on GDPval, which evaluates well-specified work across 44 occupations in the nine sectors contributing most to US GDP, GPT-5.4 reached 83% matching or exceeding human expert criteria, up from the 70.9% achieved by GPT-5.2.

For anyone working with financial documents, presentations, or legal analysis, this is not abstract. In investment and banking models, 87.3% of evaluators preferred GPT-5.4 compared to 68.4% who preferred GPT-5.2. The context window rises to one million tokens in the API, allowing entire contracts, code repositories, or clinical records to be ingested in a single interaction.

The key operational feature is a new function called forward planning: before responding, the model shows its reasoning plan. The user can redirect it mid-process without starting from scratch. That is not a minor detail. It is what distinguishes an assistant from a collaborator.

Metric / Benchmark	GPT-5.4	GPT-5.3-Codex	GPT-5.2
GDPval (wins or ties)	83.0 %	70.9 %	70.9 %
SWE-Bench Pro (Public)	57.7 %	56.8 %	55.6 %
OSWorld-Verified	75.0 %	74.0 %*	47.3 %
Toolathlon	54.6 %	51.9 %	46.3 %
BrowseComp	82.7 %	77.3 %	65.8 %

* Previously reported as 64.7%. GPT-5.3-Codex reaches 74.0% with a newly introduced API parameter that preserves the original image resolution.

Claude found in Firefox what humans spent a year patching

While OpenAI was presenting its model, Anthropic was publishing something with a different dimension: not a benchmark number, but a real case with real consequences for hundreds of millions of users.

Claude Opus 4.6 found 22 vulnerabilities in Firefox during February 2026, more than were reported in any single month of 2025, and its fixes reached users through Firefox 148.0. Of those 22, 14 were classified as high-severity, representing nearly a fifth of all vulnerabilities in that category patched in Firefox throughout all of 2025.

The detail that has circulated most among security researchers: after just twenty minutes of autonomous exploration, the model reported having identified a "use-after-free" error in the browser's JavaScript engine, which was validated by a human researcher in a virtualized environment before being submitted to Mozilla.

Of the 112 total reports Anthropic sent to Mozilla, 22 resulted in official CVEs for security flaws, while the remaining ninety corresponded to non-critical issues such as crashes and logic errors. The team scanned around 6,000 C++ source files. The complete operation lasted two weeks.

Here comes the part that Anthropic did not highlight, but which the report makes clear: the model is far better at finding flaws than exploiting them. To test the system's offensive capabilities, researchers tried to have Claude develop functional exploits for the discovered flaws, spending approximately $4,000 in API credits over hundreds of attempts; Opus 4.6 only managed to create a working exploit in two cases. Both worked only in test environments with the browser's protections deliberately disabled.

Terminal screen with C++ code and automated security analysis results from Claude Opus

What neither press release says directly

The official narrative from both companies this week is optimistic: AI helps humans, AI strengthens security, AI does the boring work. All true. And at the same time, there is a subtext worth reading without anesthetic.

That a model finds 20% of a project like Firefox's annual security work in fourteen days, a project that Mozilla has spent decades hardening with specialized engineers, is not only a story of progress. It is also a story about what happens when that same capability is used by someone who does not have a coordinated disclosure agreement with Mozilla. Anthropic writes it in its own report: "If future models break the barrier between finding vulnerabilities and exploiting them, additional measures will need to be considered."

The transition from generative AI to agentic AI, systems that do not wait but act, has been announced for months in conferences and papers. This week it happened in production, with paid invoices and patches distributed to hundreds of millions of people.

The market that already changed hands before the debate

The launch of GPT-5.4 puts OpenAI in direct competition with Anthropic, which had dominated the enterprise segment with similar tools; both companies are competing to capture the corporate market with systems capable of doing real work in sectors ready to adopt AI.

Metric	GPT-5.4 (xhigh)	Claude Opus 4.6 (No reasoning)	Analysis
Creator	OpenAI	Anthropic	GPT-5.4 (xhigh) is developed by OpenAI and Claude Opus 4.6 by Anthropic
Context window	1050k tokens (~1,575 A4 pages, Arial 12)	200k tokens (~300 A4 pages, Arial 12)	GPT-5.4 (xhigh) has a significantly larger context window
Launch date	March 2026	February 2026	GPT-5.4 (xhigh) has a more recent launch date
Image input support	Yes	Yes	Both models support image input
Open source (model weights)	No	No	Both models are proprietary

The difference is one of positioning, not technical capability. OpenAI enters enterprise territory with a unified model that consolidates its entire previous stack. Anthropic arrives with a practical demonstration that its model can act as an autonomous security auditor. Meta's Llama 4 Maverick, with its ten-million-token context window, remains the go-to option for those who want to run everything on their own infrastructure without depending on either company.

The debate about whether AI replaces jobs or transforms them will remain a debate for a while. What is no longer a debate is whether it can do the work. This week it did.

What remains pending is more uncomfortable: knowing who else, with access to the same models, is scanning the same repositories right now.

GPT-5.4 outperforms human experts in 83% of tasks: AI is no longer practicing, it's working

GPT-5.4: the model that no longer waits for instructions at every step

Claude found in Firefox what humans spent a year patching

What neither press release says directly

The market that already changed hands before the debate

Sources

Related News

NASA confirms the date and menu for Artemis II

Meta lays off 16,000 people. Its stock rises 3%

AI already solves centuries-old math. What your bank encrypts is being rebuilt from scratch.

Other News

One third of video game developers lost their jobs: GDC 2026 opens with the hardest data in its history

Cancer has been buying time for decades. AI just designed proteins to attack it in 4 weeks.

The Kepler telescope has been off for eight years. Its data just delivered the most Earth-like exoplanet we've seen orbiting a Sun-like star.

AI promised to accelerate science. Nature just proved it can also destroy it.

Congress gave the casino industry what it asked for. Then added the fine print.

Six days of war in the Middle East: Trump wants to choose Iran's next leader

The Academy has resisted horror films for decades. Sinners just won the war.

Pokémon is no longer just about catching creatures. Pokopia asks you to build their home.

Polymarket Recorded the Attack on Iran Before the Press Did. Now the Senate Wants to Know Who Bet.

Iran had been threatening this war for decades. Nobody calculated it would begin with girls killed at a school in Minab.

The U.S. and Israel Attacked Iran at Dawn. Khamenei is Dead and the World Holds Its Breath.

The Federal Reserve Has Spent Decades Relying on Surveys and Futures. A Betting Market Just Beat Them Both

GPT-5.4: the model that no longer waits for instructions at every step

Claude found in Firefox what humans spent a year patching

What neither press release says directly

The market that already changed hands before the debate

Sources

The most important news while you enjoy a cup of coffee.

Related News

NASA confirms the date and menu for Artemis II

Meta lays off 16,000 people. Its stock rises 3%

AI already solves centuries-old math. What your bank encrypts is being rebuilt from scratch.

Other News

One third of video game developers lost their jobs: GDC 2026 opens with the hardest data in its history

Cancer has been buying time for decades. AI just designed proteins to attack it in 4 weeks.

The Kepler telescope has been off for eight years. Its data just delivered the most Earth-like exoplanet we've seen orbiting a Sun-like star.

AI promised to accelerate science. Nature just proved it can also destroy it.

Congress gave the casino industry what it asked for. Then added the fine print.

Six days of war in the Middle East: Trump wants to choose Iran's next leader

The Academy has resisted horror films for decades. Sinners just won the war.

Pokémon is no longer just about catching creatures. Pokopia asks you to build their home.

Polymarket Recorded the Attack on Iran Before the Press Did. Now the Senate Wants to Know Who Bet.

Iran had been threatening this war for decades. Nobody calculated it would begin with girls killed at a school in Minab.

The U.S. and Israel Attacked Iran at Dawn. Khamenei is Dead and the World Holds Its Breath.

The Federal Reserve Has Spent Decades Relying on Surveys and Futures. A Betting Market Just Beat Them Both