Skip to main content
OpenAI

OpenAI and Paradigm's EVMbench Shows AI-Driven Audits Boost Smart Contract Security

Published by
SectorHQ Editorial
OpenAI and Paradigm's EVMbench Shows AI-Driven Audits Boost Smart Contract Security

Photo by Zulfugar Karimov (unsplash.com/@zulfugarkarimov) on Unsplash

$100 billion. That’s the value of crypto assets protected by smart contracts, and a recent report shows AI‑driven audits using OpenAI and Paradigm’s EVMbench can slash vulnerabilities, marking a decisive step forward for DeFi security.

Key Facts

  • Key company: OpenAI

OpenAI and Paradigm released EVMbench in February 2026, an open‑source benchmark that forces AI agents through three concrete tasks—detecting known vulnerabilities, patching the code without breaking functionality, and exploiting the contracts in a sandboxed blockchain environment. The suite draws on 117 curated bugs from 40 real audits, largely sourced from Code4rena competitions and Paradigm’s own Tempo L1 audit, ensuring the test set reflects production‑grade flaws rather than synthetic examples (EVMbench Deep Dive, March 20, 2026). By measuring recall in detection, functional preservation in patching, and on‑chain state changes in exploitation, the benchmark provides a granular view of where current models excel and where they lag.

The most striking result comes from the Exploit mode. GPT‑5.3‑Codex, accessed via the Codex CLI, succeeded in draining funds from 71 % of the vulnerable contracts, a dramatic jump from the 33.3 % success rate recorded for GPT‑5 only six months earlier and from sub‑20 % rates for earlier models (EVMbench Deep Dive, 2026). Paradigm’s Alpin Yukseloglu highlighted the significance, noting that “top models were only able to exploit less than 20 % of the critical, fund‑draining Code4rena bugs” when the project began, whereas the latest iteration now breaches the majority of test cases. This two‑fold improvement in six months underscores how quickly generative models are mastering the mechanics of on‑chain exploitation, turning what was once a niche skill into a near‑routine capability.

In contrast, the Detect and Patch modes lag behind. Detection scores are measured by recall—whether the AI flags the same bugs human auditors documented—but agents often stop after finding a single issue, mirroring the “first‑bug satisfaction” bias observed in human auditors (EVMbench Deep Dive, 2026). The Patch mode proves even more demanding: the AI must rewrite vulnerable code while preserving all original functionality, a task that requires deep semantic understanding of contract intent. While fixing a straightforward reentrancy guard is feasible, correcting subtle accounting errors in a lending protocol’s liquidation path without breaking the interest‑rate model remains out of reach for current models (EVMbench Deep Dive, 2026). This asymmetry—high exploit success paired with modest detection and patch performance—mirrors the historical trend where pentesters find it easier to exploit known bug classes than to devise comprehensive fixes, but the speed at which AI can generate exploits now reshapes the threat landscape.

The benchmark’s sandboxed environment, built on the Rust‑based Anvil harness, enforces deterministic replay of AI‑generated transactions and blocks unsafe RPC calls, ensuring that exploit attempts are measured against reproducible, historically documented vulnerabilities (EVMbench Deep Dive, 2026). By limiting the test to real‑world bugs rather than contrived edge cases, the suite validates that AI agents are not merely memorizing patterns but are actively reasoning about contract logic. Nevertheless, the difficulty of the Patch mode signals that AI‑driven remediation tools are still in their infancy; achieving functional parity after a fix demands a level of code comprehension that current language models have not yet mastered.

For DeFi developers and security teams, the implications are immediate. With $100 billion in crypto assets locked behind smart contracts, the ability of an AI to locate and exploit a majority of critical bugs within minutes raises the stakes for continuous, automated monitoring (Le de). While AI auditors can now serve as rapid “red‑team” probes, the industry must still rely on human expertise for thorough vulnerability discovery and safe remediation. The EVMbench results suggest a hybrid workflow: AI agents flag and even demonstrate exploit paths, then human auditors verify findings and craft patches that preserve protocol semantics. As OpenAI and Paradigm continue to refine their models, the gap between detection, patching, and exploitation may narrow, but for now the benchmark underscores that AI is a powerful, yet incomplete, component of the smart‑contract security stack.

Sources

Primary source

No primary source found (coverage-based)

Other signals
  • Dev.to AI Tag

Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.

More from SectorHQ:📊Intelligence📝Blog

🏢Companies in This Story

Related Stories