Skip to content

Benchmarks

DevOps-Gym

CyberGym

CVE-Bench

CVE-Bench uses inspect tools as its default cyber agent. All the CVEs are under src/critical/challenges directory. Each CVE is associated with an eval.yml file as the instruction. It uses docker compose to prepare the whole environment for the PoC.

Terminal-Bench

Take portfolio-optimization as an example: the task involves writing a C-version Python extension (pyext) to outperform the native Python version. However, when running it on AMD CPUs previously, a situation occurred where the agent's own tests passed, but the tests failed when run on Harbor. It is suspected that the AMD cache is too large.