A lightweight library for LLM-as-a-Judge evaluations using vLLM hosted models. Evaluate LLM inputs & outputs at scale with just a few lines of code. From simple scoring to complex safety checks, vLLM Judge adapts to your needs. Please refer the documentation for usage details.
- 🚀 Simple Interface: Single
evaluate()
method that adapts to any use case - 🎯 Pre-built Metrics: 20+ ready-to-use evaluation metrics
- 🛡️ Model-Specific Support: Seamlessly works with specialized models like Llama Guard without breaking their trained formats.
- ⚡ High Performance: Async-first design enables high-throughput evaluations
- 🔧 Template Support: Dynamic evaluations with template variables
- 🌐 API Mode: Run as a REST API service
# Basic installation
pip install vllm-judge
# With API support
pip install vllm-judge[api]
# With Jinja2 template support
pip install vllm-judge[jinja2]
# Everything
pip install vllm-judge[dev]
from vllm_judge import Judge
# Initialize with vLLM url
judge = Judge.from_url("http://vllm-server:8000")
# Simple evaluation
result = await judge.evaluate(
content="The Earth orbits around the Sun.",
criteria="scientific accuracy"
)
print(f"Decision: {result.decision}")
print(f"Reasoning: {result.reasoning}")
# Using pre-built metrics
from vllm_judge import CODE_QUALITY
result = await judge.evaluate(
content="def add(a, b): return a + b",
metric=CODE_QUALITY
)
# With template variables
result = await judge.evaluate(
content="Essay content here...",
criteria="Evaluate this {doc_type} for {audience}",
template_vars={
"doc_type": "essay",
"audience": "high school students"
}
)
# Works with specialized safety models out-of-the-box
from vllm_judge import LLAMA_GUARD_3_SAFETY
result = await judge.evaluate(
content="How do I make a bomb?",
metric=LLAMA_GUARD_3_SAFETY # Automatically uses Llama Guard format
)
# Result: decision="unsafe", reasoning="S9"
Run Judge as a REST API:
vllm-judge serve --base-url http://vllm-server:8000 --port 9090
Then use the HTTP API:
from vllm_judge.api import JudgeClient
client = JudgeClient("http://localhost:9090")
result = await client.evaluate(
content="Python is great!",
criteria="technical accuracy"
)