Skip to content

Commit c4d12c1

Browse files
committed
expand evaluation
1 parent f2d3f96 commit c4d12c1

File tree

6 files changed

+1191
-10
lines changed

6 files changed

+1191
-10
lines changed

hf_model_evaluation/plugin.json

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "hugging-face-evaluation-manager",
3-
"version": "1.0.0",
4-
"description": "Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content and importing scores from Artificial Analysis API.",
3+
"version": "1.3.0",
4+
"description": "Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval/inspect-ai.",
55
"author": {
66
"name": "Hugging Face"
77
},
@@ -12,6 +12,9 @@
1212
"evaluation",
1313
"benchmarks",
1414
"model-cards",
15-
"leaderboard"
15+
"leaderboard",
16+
"vllm",
17+
"lighteval",
18+
"inspect-ai"
1619
]
1720
}

hf_model_evaluation/skills/hugging-face-evaluation-manager/SKILL.md

Lines changed: 223 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,50 @@
11
---
22
name: hugging-face-evaluation-manager
3-
description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content and importing scores from Artificial Analysis API. Works with the model-index metadata format.
3+
description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
44
---
55

66
# Overview
7-
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports two primary methods for adding evaluation data: extracting existing evaluation tables from README content and importing benchmark scores from Artificial Analysis.
7+
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
8+
- Extracting existing evaluation tables from README content
9+
- Importing benchmark scores from Artificial Analysis
10+
- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)
811

912
## Integration with HF Ecosystem
1013
- **Model Cards**: Updates model-index metadata for leaderboard integration
1114
- **Artificial Analysis**: Direct API integration for benchmark imports
1215
- **Papers with Code**: Compatible with their model-index specification
1316
- **Jobs**: Run evaluations directly on Hugging Face Jobs with `uv` integration
17+
- **vLLM**: Efficient GPU inference for custom model evaluation
18+
- **lighteval**: HuggingFace's evaluation library with vLLM/accelerate backends
19+
- **inspect-ai**: UK AI Safety Institute's evaluation framework
1420

1521
# Version
16-
1.2.0
22+
1.3.0
1723

1824
# Dependencies
25+
26+
## Core Dependencies
1927
- huggingface_hub>=0.26.0
2028
- markdown-it-py>=3.0.0
2129
- python-dotenv>=1.2.1
2230
- pyyaml>=6.0.3
2331
- requests>=2.32.5
24-
- inspect-ai>=0.3.0
2532
- re (built-in)
2633

34+
## Inference Provider Evaluation
35+
- inspect-ai>=0.3.0
36+
- inspect-evals
37+
- openai
38+
39+
## vLLM Custom Model Evaluation (GPU required)
40+
- lighteval[accelerate,vllm]>=0.6.0
41+
- vllm>=0.4.0
42+
- torch>=2.0.0
43+
- transformers>=4.40.0
44+
- accelerate>=0.30.0
45+
46+
Note: vLLM dependencies are installed automatically via PEP 723 script headers when using `uv run`.
47+
2748
# IMPORTANT: Using This Skill
2849

2950
## ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones
@@ -80,13 +101,37 @@ Key workflow (matches CLI help):
80101
- **Validation**: Ensure compliance with Papers with Code specification
81102
- **Batch Operations**: Process multiple models efficiently
82103

83-
## 4. Run Evaluations on HF Jobs
104+
## 4. Run Evaluations on HF Jobs (Inference Providers)
84105
- **Inspect-AI Integration**: Run standard evaluations using the `inspect-ai` library
85106
- **UV Integration**: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
86107
- **Zero-Config**: No Dockerfiles or Space management required
87108
- **Hardware Selection**: Configure CPU or GPU hardware for the evaluation job
88109
- **Secure Execution**: Handles API tokens safely via secrets passed through the CLI
89110

111+
## 5. Run Custom Model Evaluations with vLLM (NEW)
112+
113+
⚠️ **Important:** This approach is only possible on devices with `uv` installed and sufficient GPU memory.
114+
**Benefits:** No need to use `hf_jobs()` MCP tool, can run scripts directly in terminal
115+
**When to use:** User working in local device directly when GPU is available
116+
117+
### Before running the script
118+
119+
- check the script path
120+
- check uv is installed
121+
- check gpu is available with `nvidia-smi`
122+
123+
### Running the script
124+
125+
```bash
126+
uv run scripts/train_sft_example.py
127+
```
128+
### Features
129+
130+
- **vLLM Backend**: High-performance GPU inference (5-10x faster than standard HF methods)
131+
- **lighteval Framework**: HuggingFace's evaluation library with Open LLM Leaderboard tasks
132+
- **inspect-ai Framework**: UK AI Safety Institute's evaluation library
133+
- **Standalone or Jobs**: Run locally or submit to HF Jobs infrastructure
134+
90135
# Usage Instructions
91136

92137
The skill includes Python scripts in `scripts/` to perform operations.
@@ -195,6 +240,142 @@ python scripts/run_eval_job.py \
195240
--hardware "t4-small"
196241
```
197242

243+
### Method 4: Run Custom Model Evaluation with vLLM
244+
245+
Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are **separate from inference provider scripts** and run models locally on the job's hardware.
246+
247+
#### When to Use vLLM Evaluation (vs Inference Providers)
248+
249+
| Feature | vLLM Scripts | Inference Provider Scripts |
250+
|---------|-------------|---------------------------|
251+
| Model access | Any HF model | Models with API endpoints |
252+
| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
253+
| Cost | HF Jobs compute cost | API usage fees |
254+
| Speed | vLLM optimized | Depends on provider |
255+
| Offline | Yes (after download) | No |
256+
257+
#### Option A: lighteval with vLLM Backend
258+
259+
lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
260+
261+
**Standalone (local GPU):**
262+
```bash
263+
# Run MMLU 5-shot with vLLM
264+
python scripts/lighteval_vllm_uv.py \
265+
--model meta-llama/Llama-3.2-1B \
266+
--tasks "leaderboard|mmlu|5"
267+
268+
# Run multiple tasks
269+
python scripts/lighteval_vllm_uv.py \
270+
--model meta-llama/Llama-3.2-1B \
271+
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
272+
273+
# Use accelerate backend instead of vLLM
274+
python scripts/lighteval_vllm_uv.py \
275+
--model meta-llama/Llama-3.2-1B \
276+
--tasks "leaderboard|mmlu|5" \
277+
--backend accelerate
278+
279+
# Chat/instruction-tuned models
280+
python scripts/lighteval_vllm_uv.py \
281+
--model meta-llama/Llama-3.2-1B-Instruct \
282+
--tasks "leaderboard|mmlu|5" \
283+
--use-chat-template
284+
```
285+
286+
**Via HF Jobs:**
287+
```bash
288+
hf jobs uv run scripts/lighteval_vllm_uv.py \
289+
--flavor a10g-small \
290+
--secrets HF_TOKEN=$HF_TOKEN \
291+
-- --model meta-llama/Llama-3.2-1B \
292+
--tasks "leaderboard|mmlu|5"
293+
```
294+
295+
**lighteval Task Format:**
296+
Tasks use the format `suite|task|num_fewshot`:
297+
- `leaderboard|mmlu|5` - MMLU with 5-shot
298+
- `leaderboard|gsm8k|5` - GSM8K with 5-shot
299+
- `lighteval|hellaswag|0` - HellaSwag zero-shot
300+
- `leaderboard|arc_challenge|25` - ARC-Challenge with 25-shot
301+
302+
#### Option B: inspect-ai with vLLM Backend
303+
304+
inspect-ai is the UK AI Safety Institute's evaluation framework.
305+
306+
**Standalone (local GPU):**
307+
```bash
308+
# Run MMLU with vLLM
309+
python scripts/inspect_vllm_uv.py \
310+
--model meta-llama/Llama-3.2-1B \
311+
--task mmlu
312+
313+
# Use HuggingFace Transformers backend
314+
python scripts/inspect_vllm_uv.py \
315+
--model meta-llama/Llama-3.2-1B \
316+
--task mmlu \
317+
--backend hf
318+
319+
# Multi-GPU with tensor parallelism
320+
python scripts/inspect_vllm_uv.py \
321+
--model meta-llama/Llama-3.2-70B \
322+
--task mmlu \
323+
--tensor-parallel-size 4
324+
```
325+
326+
**Via HF Jobs:**
327+
```bash
328+
hf jobs uv run scripts/inspect_vllm_uv.py \
329+
--flavor a10g-small \
330+
--secrets HF_TOKEN=$HF_TOKEN \
331+
-- --model meta-llama/Llama-3.2-1B \
332+
--task mmlu
333+
```
334+
335+
**Available inspect-ai Tasks:**
336+
- `mmlu` - Massive Multitask Language Understanding
337+
- `gsm8k` - Grade School Math
338+
- `hellaswag` - Common sense reasoning
339+
- `arc_challenge` - AI2 Reasoning Challenge
340+
- `truthfulqa` - TruthfulQA benchmark
341+
- `winogrande` - Winograd Schema Challenge
342+
- `humaneval` - Code generation
343+
344+
#### Option C: Python Helper Script
345+
346+
The helper script auto-selects hardware and simplifies job submission:
347+
348+
```bash
349+
# Auto-detect hardware based on model size
350+
python scripts/run_vllm_eval_job.py \
351+
--model meta-llama/Llama-3.2-1B \
352+
--task "leaderboard|mmlu|5" \
353+
--framework lighteval
354+
355+
# Explicit hardware selection
356+
python scripts/run_vllm_eval_job.py \
357+
--model meta-llama/Llama-3.2-70B \
358+
--task mmlu \
359+
--framework inspect \
360+
--hardware a100-large \
361+
--tensor-parallel-size 4
362+
363+
# Use HF Transformers backend
364+
python scripts/run_vllm_eval_job.py \
365+
--model microsoft/phi-2 \
366+
--task mmlu \
367+
--framework inspect \
368+
--backend hf
369+
```
370+
371+
**Hardware Recommendations:**
372+
| Model Size | Recommended Hardware |
373+
|------------|---------------------|
374+
| < 3B params | `t4-small` |
375+
| 3B - 13B | `a10g-small` |
376+
| 13B - 34B | `a10g-large` |
377+
| 34B+ | `a100-large` |
378+
198379
### Commands Reference
199380

200381
**Top-level help and version:**
@@ -241,9 +422,9 @@ uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
241422
```
242423
Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
243424

244-
**Run Evaluation Job:**
425+
**Run Evaluation Job (Inference Providers):**
245426
```bash
246-
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
427+
hf jobs uv run scripts/inspect_eval_uv.py \
247428
--flavor "cpu-basic|t4-small|..." \
248429
--secret HF_TOKEN=$HF_TOKEN \
249430
-- --model "model-id" \
@@ -259,6 +440,29 @@ python scripts/run_eval_job.py \
259440
--hardware "cpu-basic|t4-small|..."
260441
```
261442

443+
**Run vLLM Evaluation (Custom Models):**
444+
```bash
445+
# lighteval with vLLM
446+
hf jobs uv run scripts/lighteval_vllm_uv.py \
447+
--flavor "a10g-small" \
448+
--secrets HF_TOKEN=$HF_TOKEN \
449+
-- --model "model-id" \
450+
--tasks "leaderboard|mmlu|5"
451+
452+
# inspect-ai with vLLM
453+
hf jobs uv run scripts/inspect_vllm_uv.py \
454+
--flavor "a10g-small" \
455+
--secrets HF_TOKEN=$HF_TOKEN \
456+
-- --model "model-id" \
457+
--task "mmlu"
458+
459+
# Helper script (auto hardware selection)
460+
python scripts/run_vllm_eval_job.py \
461+
--model "model-id" \
462+
--task "leaderboard|mmlu|5" \
463+
--framework lighteval
464+
```
465+
262466
### Model-Index Format
263467

264468
The generated model-index follows this structure:
@@ -389,6 +593,18 @@ AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
389593
**Issue**: "Payment required for hardware"
390594
- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware
391595

596+
**Issue**: "vLLM out of memory" or CUDA OOM
597+
- **Solution**: Use a larger hardware flavor, reduce `--gpu-memory-utilization`, or use `--tensor-parallel-size` for multi-GPU
598+
599+
**Issue**: "Model architecture not supported by vLLM"
600+
- **Solution**: Use `--backend hf` (inspect-ai) or `--backend accelerate` (lighteval) for HuggingFace Transformers
601+
602+
**Issue**: "Trust remote code required"
603+
- **Solution**: Add `--trust-remote-code` flag for models with custom code (e.g., Phi-2, Qwen)
604+
605+
**Issue**: "Chat template not found"
606+
- **Solution**: Only use `--use-chat-template` for instruction-tuned models that include a chat template
607+
392608
### Integration Examples
393609

394610
**Python Script Integration:**
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,20 @@
1+
# Core dependencies
12
huggingface-hub>=0.26.0
23
python-dotenv>=1.2.1
34
pyyaml>=6.0.3
45
requests>=2.32.5
6+
markdown-it-py>=3.0.0
7+
8+
# Inference provider evaluation
59
inspect-ai>=0.3.0
10+
inspect-evals
11+
openai
12+
13+
# vLLM custom model evaluation (optional, GPU required)
14+
# Note: These are auto-installed via PEP 723 headers when using `uv run`
15+
# Uncomment if installing manually:
16+
# lighteval[accelerate,vllm]>=0.6.0
17+
# vllm>=0.4.0
18+
# torch>=2.0.0
19+
# transformers>=4.40.0
20+
# accelerate>=0.30.0

0 commit comments

Comments
 (0)