Benchmarks
DevOps-Gym
CyberGym
CVE-Bench
CVE-Bench uses inspect tools as its default cyber agent. All the CVEs are under src/critical/challenges directory. Each CVE is associated with an eval.yml file as the instruction. It uses docker compose to prepare the whole environment for the PoC.
Terminal-Bench
Take portfolio-optimization as an example: the task involves writing a C-version Python extension (pyext) to outperform the native Python version. However, when running it on AMD CPUs previously, a situation occurred where the agent's own tests passed, but the tests failed when run on Harbor. It is suspected that the AMD cache is too large.