Smart contracts are the backbone of decentralized finance, securing over $100 billion in crypto assets. But how do we measure whether AI can truly detect, patch, and even exploit vulnerabilities in these high‑stakes environments? Enter EVMbench, OpenAI’s new benchmark developed in collaboration with Paradigm.
What is EVMbench?
EVMbench is a curated benchmark built from 120 vulnerabilities across 40 security audits, including competitions on Code4rena and audits of the Tempo blockchain. It’s designed to test AI agents across the full lifecycle of smart contract security:
| Mode | Description |
|---|---|
| Detect | Agents audit repositories, scored on recall of vulnerabilities and audit rewards. |
| Patch | Agents fix flaws while preserving functionality, verified via automated tests. |
| Exploit | Agents execute sandboxed fund‑draining attacks, graded via transaction replay. |
To ensure reproducibility, OpenAI built a Rust‑based harness that deploys contracts deterministically and runs exploits in isolated Anvil environments—never on live networks.
How AI Performs
- Exploit mode shines: GPT‑5.3‑Codex scored 72.2%, a huge leap from GPT‑5’s 31.9% just six months earlier.
- Detection & patching lag: Agents often stop after finding one vulnerability, or break functionality while attempting fixes.
- False positives remain a challenge: The grading system can’t yet distinguish between genuine flaws and issues beyond human‑auditor baselines.
Why It Matters
- Economic impact: Smart contracts underpin billions in assets—errors can mean catastrophic losses.
- AI as attacker & defender: EVMbench tests both sides of the equation, showing how AI can exploit flaws but also patch them.
- Benchmarking progress: Provides a reproducible framework for measuring AI’s evolving cyber capabilities.
Beyond the Benchmark
OpenAI paired the release with:
- $10M in API credits through its Cybersecurity Grant Program to accelerate defensive research.
- Expansion of Aardvark, its security research agent, now in private beta.
- Public release of EVMbench’s tasks, tooling, and framework to encourage community collaboration.
Final Thought
EVMbench is more than a benchmark—it’s a stress test for AI in economically consequential environments. By measuring how well AI can detect, patch, and exploit vulnerabilities, OpenAI is pushing the boundaries of what agentic systems can achieve in blockchain security. The takeaway? AI is learning fast—but defense must evolve just as quickly.
Leave a Reply