Horangi is an open-source benchmark framework for comprehensively evaluating Korean LLM performance.
By integrating WandB/Weave and Inspect AI, it evaluates Korean LLMs along two axes: General Language Performance (GLP) and Alignment Performance (ALT), providing standardized benchmark datasets and evaluation pipelines.
- 📦 Over 20 Korean benchmarks are registered in Weave, allowing you to start evaluation immediately without separate data preparation.
- You can add new benchmarks. See the Adding a New Benchmark guide for details.
- 🔓 You can evaluate API models (OpenAI, Anthropic, Google, etc.) as well as open-source models served via vLLM using the same standards.
- 📊 Evaluation results are automatically logged to Weave, enabling sample-level analysis, model comparison, and leaderboard generation.
- 🏆 Check out the official leaderboard operated by W&B at Horangi Leaderboard.
- Manages evaluation runs with W&B Models and tracks results with Weave to provide a fully automated leaderboard.
- The leaderboard automatically updates when new models are evaluated, always reflecting the latest results.
| Leaderboard Registration | Application Form |
| Enterprise Inquiries | contact-kr@wandb.com |
- Features
- Viewing Results
- Supported Benchmarks
- Project Structure
- Installation
- Quick Start
- Configuration Guide
- SWE-bench Evaluation (Code Generation)
- 🇰🇷 20+ Korean benchmarks supported
- 📊 Automatic WandB/Weave logging - Experiment tracking and result comparison
- 🚀 Various model support - OpenAI, Claude, Gemini, Solar, EXAONE, etc.
- 📈 Automatic leaderboard generation - Model comparison in Weave UI
After evaluation completes, you can view detailed results at the Weave URL in the output, and view comprehensive evaluation result tables in the Models workspace. See the Weave guide for more details.
- Per-sample scores and responses
- Model comparison
- Aggregated metrics
- Automatic leaderboard generation
Evaluates general language model capabilities including language understanding, knowledge, reasoning, coding, and function calling.
| Evaluation Area | Benchmark | Description | Samples | Source |
|---|---|---|---|---|
| Syntax Analysis | ko_balt_700_syntax |
Sentence structure analysis, grammatical validity evaluation | 100 | snunlp/KoBALT-700 |
| Semantic Analysis | ko_balt_700_semantic |
Context-based inference, semantic consistency evaluation | 100 | snunlp/KoBALT-700 |
haerae_bench_v1_rc |
Reading comprehension-based semantic interpretation | 100 | HAERAE-HUB/HAE_RAE_BENCH_1.0 | |
| Expression | ko_mtbench |
Writing, roleplay, humanities expression (LLM Judge) | 80 | LGAI-EXAONE/KoMT-Bench |
| Information Retrieval | squad_kor_v1 |
QA-based information retrieval | 100 | KorQuAD/squad_kor_v1 |
| General Knowledge | kmmlu |
Common sense, STEM fundamentals | 100 | HAERAE-HUB/KMMLU |
haerae_bench_v1_wo_rc |
Multi-turn QA-based knowledge evaluation | 100 | HAERAE-HUB/HAE_RAE_BENCH_1.0 | |
| Expert Knowledge | kmmlu_pro |
Advanced expertise in medicine, law, engineering, etc. | 100 | LGAI-EXAONE/KMMLU-Pro |
ko_hle |
Korean expert-level difficult problems | 100 | cais/hle + Custom translation | |
| Common Sense Reasoning | ko_hellaswag |
Sentence completion, next sentence prediction | 100 | davidkim205/ko_hellaswag |
| Mathematical Reasoning | hrm8k |
Korean math reasoning (GSM8K, KSM, MATH, MMMLU, OMNI_MATH combined) | 100 | HAERAE-HUB/HRM8K |
ko_aime2025 |
AIME 2025 advanced math | 30 | allganize/AIME2025-ko | |
| Abstract Reasoning | ko_arc_agi |
Visual/structural reasoning, abstract problem solving | 100 | ARC-AGI |
| Coding | swebench_verified_official_80 |
GitHub issue resolution | 80 | SWE-bench |
humaneval_100 |
Python code generation (HumanEval) | 100 | openai/human-eval | |
bigcodebench_100 |
Complex coding problem solving | 100 | bigcode-project/bigcodebench | |
| Function Calling | bfcl |
Function calling accuracy (single, multi-turn, irrelevance detection) | 258 | BFCL |
Evaluates model safety and alignment including controllability, ethics, harm/bias prevention, and hallucination prevention.
| Evaluation Area | Benchmark | Description | Samples | Source |
|---|---|---|---|---|
| Controllability | ifeval_ko |
Instruction following, command compliance | 100 | allganize/IFEval-Ko |
| Ethics/Morality | ko_moral |
Social norm compliance, safe language generation | 100 | AI Hub Ethics Data |
| Harm Prevention | korean_hate_speech |
Hate speech, offensive speech detection and suppression | 100 | kocohub/korean-hate-speech |
| Bias Prevention | kobbq |
Bias evaluation against specific groups/attributes | 100 | naver-ai/kobbq |
| Hallucination Prevention | ko_truthful_qa |
Factuality verification, evidence-based response | 100 | Custom translation |
ko_hallulens_wikiqa |
Wikipedia QA-based hallucination evaluation | 100 | facebookresearch/HalluLens + Custom translation | |
ko_hallulens_longwiki |
Long context Wikipedia hallucination evaluation | 100 | facebookresearch/HalluLens + Custom translation | |
ko_hallulens_nonexistent |
Fictional entity refusal ability evaluation | 100 | facebookresearch/HalluLens + Custom translation |
📦 Dataset References (Weave)
Datasets are uploaded to the horangi/horangi4 project:
| Dataset | Weave Ref |
|---|---|
| KoHellaSwag_mini | weave:///horangi/horangi4/object/KoHellaSwag_mini:latest |
| KoAIME2025_mini | weave:///horangi/horangi4/object/KoAIME2025_mini:latest |
| IFEval_Ko_mini | weave:///horangi/horangi4/object/IFEval_Ko_mini:latest |
| HAERAE_Bench_v1_mini | weave:///horangi/horangi4/object/HAERAE_Bench_v1_mini:latest |
| KoBALT_700_mini | weave:///horangi/horangi4/object/KoBALT_700_mini:latest |
| KMMLU_mini | weave:///horangi/horangi4/object/KMMLU_mini:latest |
| KMMLU_Pro_mini | weave:///horangi/horangi4/object/KMMLU_Pro_mini:latest |
| SQuAD_Kor_v1_mini | weave:///horangi/horangi4/object/SQuAD_Kor_v1_mini:latest |
| KoTruthfulQA_mini | weave:///horangi/horangi4/object/KoTruthfulQA_mini:latest |
| KoMoral_mini | weave:///horangi/horangi4/object/KoMoral_mini:latest |
| KoARC_AGI_mini | weave:///horangi/horangi4/object/KoARC_AGI_mini:latest |
| HRM8K_mini | weave:///horangi/horangi4/object/HRM8K_mini:latest |
| KoreanHateSpeech_mini | weave:///horangi/horangi4/object/KoreanHateSpeech_mini:latest |
| KoBBQ_mini | weave:///horangi/horangi4/object/KoBBQ_mini:latest |
| KoHLE_mini | weave:///horangi/horangi4/object/KoHLE_mini:latest |
| KoHalluLens_WikiQA_mini | weave:///horangi/horangi4/object/KoHalluLens_WikiQA_mini:latest |
| KoHalluLens_LongWiki_mini | weave:///horangi/horangi4/object/KoHalluLens_LongWiki_mini:latest |
| KoHalluLens_NonExistent_mini | weave:///horangi/horangi4/object/KoHalluLens_NonExistent_mini:latest |
| BFCL_mini | weave:///horangi/horangi4/object/BFCL_mini:latest |
| KoMTBench_mini | weave:///horangi/horangi4/object/KoMTBench_mini:latest |
| SWEBench_Verified_80_mini | weave:///horangi/horangi4/object/SWEBench_Verified_80_mini:latest |
horangi/
├── run_eval.py # Evaluation execution script
├── configs/
│ ├── base_config.yaml # Global default settings
│ └── models/ # Model configuration files
├── src/
│ ├── benchmarks/
│ │ └── horangi.py # @task function definitions (benchmark entry point)
│ ├── core/ # Core logic
│ ├── scorers/ # Custom Scorers
│ └── solvers/ # Custom Solvers
└── logs/ # Evaluation logs
📖 Extension guides:
- Add a new model → docs/README_models_en.md
- Add a new benchmark → docs/README_benchmark_en.md
- Python 3.12+
- uv (recommended) or pip
# Install uv (if not installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone repository
git clone https://github.com/wandb/llm-leaderboard-korean.git
cd llm-leaderboard-korean
# Install dependencies
uv syncFrom zero to your first evaluation in about 5 minutes. Follow these four steps in order.
cp .env.sample .envThe three W&B variables are required. Horangi records all results to W&B Models + Weave, so the run aborts if any of them are missing.
# Required
WANDB_API_KEY=... # https://wandb.ai/authorize
WANDB_ENTITY=your-entity
WANDB_PROJECT=your-project
# Only fill in the keys for the providers you plan to evaluate
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
# ...
WANDB_MODE=offline|disabled|dryrunis not supported.
Each YAML file under configs/models/ (without the extension) is a valid --config value.
ls configs/models/
# claude-opus-4-5-20251101_high-effort.yaml
# gpt-4o.yaml
# ...If the model you want is not in the repo, see the Adding a New Model guide.
Start with a single benchmark at 5 samples to verify the setup.
uv run python run_eval.py --config gpt-4o --only kmmlu --limit 5If a W&B run URL and a Weave URL are printed, the setup is working. Open the links to confirm the traces landed.
uv run python run_eval.py --config gpt-4oThis runs every benchmark sequentially. When finished, a summary table is posted to W&B Models and per-sample traces plus the leaderboard are uploaded to Weave.
| Option | Description | Example |
|---|---|---|
--config |
Model config filename (required) | --config gpt-4o |
--only |
Run a subset of benchmarks (comma-separated) | --only kmmlu,kobbq |
--limit |
Cap the sample count per benchmark | --limit 10 |
--resume |
Continue an interrupted W&B run | --resume abc123xy |
--tag |
Add W&B tags (repeatable) | --tag exp1 --tag test |
--log-dir |
Directory for inspect_ai logs | --log-dir /tmp/my_logs |
- vLLM models auto-start their server at the beginning of a run and shut it down at the end.
- Each benchmark's results stream into W&B in real time.
- The Weave Leaderboard is updated automatically when the run completes.
Per-task guides live in docs/.
| Goal | Doc |
|---|---|
| Add a new model and evaluate it | Adding a New Model |
| Add a new benchmark | Adding a New Benchmark |
| Set up the SWE-bench evaluation server | SWE-bench Guide |
| Explore results in Weave | Weave Guide |
Project layout:
configs/
├── base_config.yaml # Global defaults (shared across benchmarks)
└── models/
├── _template_api.yaml # API model template
├── _template_vllm.yaml # vLLM model template
└── <model-name>.yaml # Used as --config <model-name>
SWE-bench is a benchmark that evaluates the ability to fix bugs in real open-source projects.
📖 Detailed setup guide: docs/README_swebench_en.md
# 1. Run server (Linux environment with Docker)
uv run python src/server/swebench_server.py --host 0.0.0.0 --port 8000
# 2. Client setup (macOS, etc.)
export SWE_SERVER_URL=http://YOUR_SERVER:8000
# 3. Run evaluation
uv run python run_eval.py --config gpt-4o --only swebench_verified_official_80 --limit 5
