We released AutoCodeBench-V2 benchmark, built on the original dataset and iteratively refined through top proprietary models and a sandbox to produce 1,000 higher-quality problems.
Additionally, we have updated the sandbox with improved performance and fixed several bugs in programming language parsing.
2026-02-17: We have updated the v2 version of AutoCodeBench and fixed several bugs, which led to performance improvements across all models. The dataset is available in autocodebench-v2-260217.jsonl.
Below is a complete workflow for evaluation using AutoCodeBench-V2.
Download the dataset from Hugging Face:
Obtain model outputs and replace the output field in autocodebench-v2.jsonl.
Note: You must use the following system prompt:
You are an expert programmer. Your task is to provide a code solution within a single Markdown code block for the given programming problem. Do not include any direct execution commands, test cases, or usage examples within the code block.
Pull the V2 sandbox image:
docker pull hunyuansandbox/multi-language-sandbox:v2Start the sandbox service:
docker run -d \
--name sandbox-service \
-p 8080:8080 \
--cap-add=NET_ADMIN \
hunyuansandbox/multi-language-sandbox:v2Verify the service is running:
# Check container status
docker ps | grep sandbox
# Test service health
curl -X POST http://localhost:8080/submit \
-H "Content-Type: application/json" \
-d '{"src_uid": "test-001", "lang": "python", "source_code": "print(\"Hello World\")"}'If the response contains "exec_outcome": "PASSED", the service is running successfully.
Run the verification script:
cd AutoCodeBench-V2
python3 call_sandbox.py \
--input_file autocodebench-v2.jsonl \
--output exec.jsonl \
--server_ip localhost \
--server_port 8080 \
--concurrency 32