OpenAI Launches GPT‑5.4, Surpassing Human Pro‑Level Performance by 83% in Tests
Photo by Rolf van Root (unsplash.com/@freshvanroot) on Unsplash
Where earlier GPT models still trailed human experts, OpenAI reports its new GPT‑5.4 now beats pro‑level performance by 83% in internal tests.
Key Facts
- •Key company: OpenAI
OpenAI’s GPT‑5.4 represents a marked shift toward agent‑style reasoning, with the company’s “Thinking System Card” noting a 1 million‑token context window and a 47 % reduction in token consumption per query, enabling more complex, multi‑step workflows without proportional cost growth (OpenAI). The model’s architecture emphasizes “steerability,” allowing developers to interrupt and adjust outputs mid‑generation, a capability that OpenAI says is essential for dynamic knowledge‑work applications such as automated code review or real‑time data analysis (OpenAI). Benchmarks released alongside the launch illustrate these advances: on the OSWorld‑Verified computer‑use suite, GPT‑5.4 achieved a 75 % success rate, surpassing the human baseline of 72.4 % and indicating a higher proficiency in orchestrating OS‑level actions than any prior public model (report). In the BrowseComp benchmark, which measures a model’s ability to retrieve, synthesize, and reason over web content, GPT‑5.4 scored 82.7 %, again outpacing the human benchmark of 78 % and suggesting that the system can reliably navigate and extract information from the open web with near‑expert accuracy (report).
Error metrics also show a tangible reliability jump. According to OpenAI’s internal testing, GPT‑5.4 produces 18 % fewer factual errors than its predecessor GPT‑5.2, while false‑claim generation drops by 33 % (Zdnet). These reductions are attributed to a tighter alignment pipeline and expanded verification layers that cross‑check model outputs against curated knowledge bases before final emission. The company’s internal “pro‑level” evaluation, which pits the model against domain experts across coding, data science, and legal reasoning tasks, reports an 83 % performance advantage over the same human cohort—a figure that the firm frames as a “pro‑level” margin rather than an absolute IQ‑type score (Zdnet). The testing methodology, while proprietary, follows a standard protocol of blind assessment where human experts and the model solve identical problem sets under time constraints, with scoring based on correctness, completeness, and methodological soundness.
Beyond raw performance, GPT‑5.4’s efficiency gains are poised to reshape enterprise deployment economics. The 47 % token‑use improvement translates directly into lower inference costs for API customers, a claim OpenAI backs with internal cost‑model simulations that project a 30 % reduction in per‑token pricing for high‑volume users (OpenAI). This efficiency, combined with the expanded context window, enables “long‑form” tasks such as full‑document analysis or multi‑turn planning without the need for external chunking strategies that previously fragmented context. Early adopters in the software development sector have reported that the model can generate and debug code snippets up to 1.5 × faster than GPT‑4o, with fewer compilation errors, though these anecdotal observations have not yet been independently verified (report).
The launch arrives amid a broader push by OpenAI to embed transactional capabilities into its consumer products. While GPT‑5.4 itself is a backend model, OpenAI has concurrently rolled out “instant checkout” features for ChatGPT, initially integrating with e‑commerce platforms like Etsy and Shopify (CNET). Wired notes that this move is intended to position ChatGPT as a direct competitor to Google’s shopping assistants, leveraging the model’s improved reasoning to handle purchase‑related queries end‑to‑end (Wired). VentureBeat, however, reports a mixed reception from enterprise users who have seen older models like GPT‑4o temporarily withdrawn as the company reallocates compute resources to support the new rollout (VentureBeat). This operational turbulence underscores the trade‑off between rapid model iteration and service stability, a balance OpenAI’s leadership acknowledges as “bumpy” in recent public statements (VentureBeat).
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.