EVMbench: OpenAI’s Stress Test for Smart Contract Security

Smart contracts are the backbone of decentralized finance, securing over $100 billion in crypto assets. But how do we measure whether AI can truly detect, patch, and even exploit vulnerabilities in these high‑stakes environments? Enter EVMbench, OpenAI’s new benchmark developed in collaboration with Paradigm.

What is EVMbench?

EVMbench is a curated benchmark built from 120 vulnerabilities across 40 security audits, including competitions on Code4rena and audits of the Tempo blockchain. It’s designed to test AI agents across the full lifecycle of smart contract security:

Mode	Description
Detect	Agents audit repositories, scored on recall of vulnerabilities and audit rewards.
Patch	Agents fix flaws while preserving functionality, verified via automated tests.
Exploit	Agents execute sandboxed fund‑draining attacks, graded via transaction replay.

To ensure reproducibility, OpenAI built a Rust‑based harness that deploys contracts deterministically and runs exploits in isolated Anvil environments—never on live networks.

How AI Performs

Exploit mode shines: GPT‑5.3‑Codex scored 72.2%, a huge leap from GPT‑5’s 31.9% just six months earlier.
Detection & patching lag: Agents often stop after finding one vulnerability, or break functionality while attempting fixes.
False positives remain a challenge: The grading system can’t yet distinguish between genuine flaws and issues beyond human‑auditor baselines.

Why It Matters

Economic impact: Smart contracts underpin billions in assets—errors can mean catastrophic losses.
AI as attacker & defender: EVMbench tests both sides of the equation, showing how AI can exploit flaws but also patch them.
Benchmarking progress: Provides a reproducible framework for measuring AI’s evolving cyber capabilities.

Beyond the Benchmark

OpenAI paired the release with:

$10M in API credits through its Cybersecurity Grant Program to accelerate defensive research.
Expansion of Aardvark, its security research agent, now in private beta.
Public release of EVMbench’s tasks, tooling, and framework to encourage community collaboration.

Final Thought

EVMbench is more than a benchmark—it’s a stress test for AI in economically consequential environments. By measuring how well AI can detect, patch, and exploit vulnerabilities, OpenAI is pushing the boundaries of what agentic systems can achieve in blockchain security. The takeaway? AI is learning fast—but defense must evolve just as quickly.