GitHub - multi-swe-bench/MopenHands

👋 Hi, everyone!
We are ByteDance Seed team.

You can get to know us better through the following channels👇

🚀 Mopenhands: Multi-SWE-Bench Infer with OpenHands

We have modified the original Openhands (0.25.0 version) compatible with Multi-SWE-Bench! MopenHands can be used to evaluate the performance of LLMs across 7 languages(c++, c, java, go, rust, typescript, javascript) in the Multi-SWE-Bench dataset.

To Start

1. Environment Preparing

conda create -n openhands python=3.12 conda-forge::nodejs conda-forge::poetry
conda activate openhands
make build

Make sure you have docker environment in your local device You should first create a file named config.toml, and update your model key in the file, for example:

[llm.YYY]
model = "llm.xxx"
base_url = "xxx"
api_key = "xxx"

2. Dataset Preparing

You should first download the Multi-SWE-Bench dataset. And change the dataset following /evaluation/benchmarks/swe_bench/data/data_change.py

Run Inference on SWE-Bench Instances

bash evaluation/benchmarks/swe_bench/infer.sh

Explanation

models, e.g. llm.eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.6.2.
agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates the (500 issues), which will no exceed the maximum of the dataset number.
max_iter, e.g. 20, is the maximum number of iterations for the agent to run. By default, it is set to 50.
num_workers, e.g. 3, is the number of parallel workers to run the evaluation. By default, it is set to 1.
language, the language of your evaluating dataset.
dataset, the absolute position of the dataset jsonl.

Images

We provide the images for each instance. You can use the following command to download the images directly from our docker hub site rather than build them locally.

📊 Evaluation

After running the agent, all the predicted patches will be save in evaluation/evaluation_outputs directory, named as output.jsonl. You can extract the git_patch of each instance and then you can evaluate in the multi-swe-bench repo

Run Evaluation

To run the evaluation, you need to prepare the following:

Patch Files: Some patch files in JSONL format, each item containing:
- org: Organization Name
- repo: Repository Name
- number: Pull Request Number
- fix_patch: Fix Patch Content
Dataset Files: Dataset files in JSONL format available on Hugging Face, such as Multi-SWE-Bench

Then you can run the evaluation using the following command:

cd multi-swe-bench
python -m multi_swe_bench.harness.run_evaluation --config /path/to/your/config.json

📜 License

This project is licensed under Apache License 2.0. See the LICENSE flie for details.

📖 Citation

If you find our Multi-SWE-bench and MopenHands useful for your research and applications, feel free to give us a star ⭐ or cite us using:

@misc{zan2025multiswebench,
      title={Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving}, 
      author={Daoguang Zan and Zhirong Huang and Wei Liu and Hanwu Chen and Linhao Zhang and Shulin Xin and Lu Chen and Qi Liu and Xiaojian Zhong and Aoyan Li and Siyao Liu and Yongsheng Xiao and Liangqiang Chen and Yuyu Zhang and Jing Su and Tianyu Liu and Rui Long and Kai Shen and Liang Xiang},
      year={2025},
      eprint={2504.02605},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2504.02605}, 
}

🏢 About ByteDance Seed Team

Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
.openhands/microagents		.openhands/microagents
containers		containers
dev_config/python		dev_config/python
docs		docs
evaluation		evaluation
frontend		frontend
microagents		microagents
openhands		openhands
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.nvmrc		.nvmrc
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMMUNITY.md		COMMUNITY.md
CONTRIBUTING.md		CONTRIBUTING.md
CREDITS.md		CREDITS.md
Development.md		Development.md
ISSUE_TRIAGE.md		ISSUE_TRIAGE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
build.sh		build.sh
config.template.toml		config.template.toml
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pydoc-markdown.yml		pydoc-markdown.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Mopenhands: Multi-SWE-Bench Infer with OpenHands

To Start

1. Environment Preparing

2. Dataset Preparing

Run Inference on SWE-Bench Instances

Explanation

Images

📊 Evaluation

Run Evaluation

📜 License

📖 Citation

🏢 About ByteDance Seed Team

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

multi-swe-bench/MopenHands

Folders and files

Latest commit

History

Repository files navigation

🚀 Mopenhands: Multi-SWE-Bench Infer with OpenHands

To Start

1. Environment Preparing

2. Dataset Preparing

Run Inference on SWE-Bench Instances

Explanation

Images

📊 Evaluation

Run Evaluation

📜 License

📖 Citation

🏢 About ByteDance Seed Team

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages