OpenZeppelin finds data contamination in OpenAI’s EVMbench

Blockchain security firm OpenZeppelin says it has found methodological flaws and data contamination after auditing OpenAI’s new artificial intelligence blockchain security benchmark, EVMbench.

EVMbench was launched in partnership with crypto investment firm Paradigm in mid-February. It was built to evaluate how well different artificial intelligence models can identify, patch, and exploit smart contract vulnerabilities.

In an X post on Monday, OpenZeppelin said it welcomed the initiative but recently decided to put EVMbench “through the same scrutiny” it applies to all the protocols it helps secure, which includes the likes of decentralized finance heavyweights Aave, Lido and Uniswap.

From its audit, OpenZeppelin said it found two key issues: training data contamination and classification issues around several high-severity vulnerabilities.

“We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications including at least four issues labeled high severity that are not exploitable in practice,” OpenZeppelin said.

The release of the EVMbench saw an evaluation of how well AI agents could theoretically exploit smart contract vulnerabilities. Anthropic’s Claude Open 4.6 topped the list, followed by OpenAI’s OC-GPT-5.2 and Google’s Gemini 3 Pro.

EVMbench testing may need revising

Looking at the first issue in data contamination, OpenZeppelin said the most important capability in “AI security is finding novel vulnerabilities in code the model has never seen before.”

However, during the EVMbench’s testing of AI agents, OpenZeppelin said that all the AI agents that scored the highest had “likely been exposed to the benchmark’s vulnerability reports during pretraining.”

The EVMbench’s testing saw internet access cut off for the AI agents, meaning they couldn’t simply search for solutions to problems. However, the benchmark had been based on curated vulnerabilities from 120 audits dating between 2024 and mid-2025, with the knowledge training cutoffs for these agents generally being mid-2025.

As such, it ran the risk that the AI agents already had the answers to all of the problems stored in their memory.

“While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test. The dataset’s limited size further narrows the evaluation surface, making these contamination concerns more significant,” OpenZeppelin said.

Finally, OpenZeppelin said that there had been some significant factual errors in the EVMbench’s dataset, arguing that several “high-severity vulnerabilities” were invalid.

OpenZeppelin said it had assessed at least four vulnerabilities that were given a high-risk classification by EVMbench, but don’t actually work. However, EVMbench had been scoring AI agents correctly for finding these supposedly false vulnerabilities.

“These aren’t subjective severity disagreements, they are findings where the described exploit doesn’t work.”

Ultimately, OpenZeppelin reiterated that AI will have a significant impact on bolstering blockchain security, but stressed the importance of applying the tech and testing it in the correct way to maximize its potential.

“The question isn’t whether AI will transform smart contract security — it will. The question is whether the data and benchmarks we use to build and evaluate these tools are held to the same standard as the contracts they’re meant to protect.”

Magazine: AI won’t make you rich but crypto games might, Axie founder steps down: Web3 Gamer