This repository can be used as a benchmark harness for your own model, retrieval pipeline, or agentic workflow. The core pattern is:
- define a dataset and scoring harness
- host the MetivtaEval stack
- let users register, obtain API keys, and submit their system endpoints
- publish the leaderboard produced by the evaluation service
Everything in this guide maps to files and routes that exist in the current repository state.
Important:
- this guide is for the runnable self-hosted stack
- the public
metivta.cosite is documentation-only and is not the execution surface
make compose-devUse this for the normal product surface without the demo answer service or demo seeder.
make compose-demoThis starts the full local experience:
- gateway
- FastAPI v2
- Flask compatibility API
- Celery worker
- Postgres
- Redis
- demo answer service
- demo seeder
Important:
make compose-demosetsMETIVTA_DATASET_MAX_EXAMPLES=2so local verification finishes quickly.- For a full local benchmark run, launch the same stack without that cap:
docker compose --profile legacy --profile demo up -d --buildmake compose-faultsUse this before production to confirm that dataset misconfiguration, Redis outages, Postgres outages, and invalid user endpoints degrade cleanly in the live stack.
make compose-dev-downFor the seeded demo stack:
make compose-demo-down- gateway root:
http://localhost:18000 - gateway health:
http://localhost:18000/health - gateway readiness:
http://localhost:18000/ready - Scalar docs:
http://localhost:18000/api/v2/docs - OpenAPI JSON:
http://localhost:18000/api/v2/openapi.json - runtime signup page:
http://localhost:18080/signup - legacy leaderboard page:
http://localhost:18080/leaderboard - dataset info:
http://localhost:18080/dataset-info
If you want different local ports without editing code, override them at launch:
GATEWAY_PORT=28000 FLASK_PORT=28080 docker compose --profile legacy --profile demo up -d --buildThat changes the host-facing URLs while leaving container-internal routing untouched.
If you do not want to edit config.toml for a one-off launch, Compose now passes through the
dataset and evaluator override environment variables used by the Python services:
METIVTA_DATASET_NAME=My-Benchmark \
METIVTA_DATASET_LOCAL_PATH=/app/custom-dataset \
METIVTA_DATASET_FILES_QUESTIONS=questions.json \
METIVTA_DATASET_FILES_QUESTIONS_ONLY=questions-only.json \
METIVTA_DATASET_FILES_FORMAT_RUBRIC=format_rubric.json \
METIVTA_EVALUATION_DAAT_EVALUATORS=hebrew_presence,url_format,response_length,daat_score \
docker compose --profile legacy --profile demo up -d --buildBefore exposing the system publicly, verify these locally:
curl -fsS http://localhost:18000/ready | jq .curl -fsS http://localhost:18080/signup >/dev/null && echo signup_okcurl -fsS http://localhost:18080/dataset-info | jq .Use this to check whether a DAAT answer endpoint behaves like the configured dataset harness expects:
curl -fsS \
-X POST http://localhost:18080/validate-endpoint \
-H 'Content-Type: application/json' \
-d '{"endpoint_url":"http://demo-answer:5001/answer"}' | jq .If you are validating a host-side service instead of a container service, use
http://host.docker.internal:<port>/<path> so the Flask container can reach it.
The public documentation site does not issue API keys. There are two supported onboarding paths in the runnable stack.
- page:
GET /signup - submit:
POST /register
This issues an API key immediately. That makes it the easiest onboarding flow when you want a public submission page.
Verified legacy registration payload:
{
"email": "team@example.com",
"name": "Team Name",
"organization": "Optional Org"
}Users can also create accounts and issue scoped keys programmatically:
POST /api/v2/auth/registerPOST /api/v2/auth/loginPOST /api/v2/auth/api-keys
Use this path for first-party clients, automation, or productized integrations.
| If you want to change... | Change this |
|---|---|
| benchmark name and version | config.toml -> [dataset].name, [dataset].version |
| DAAT questions and ground truth | config.toml -> [dataset.files].questions |
| public question-only export | config.toml -> [dataset.files].questions_only |
| scholarly rubric | config.toml -> [dataset.files].format_rubric |
| MTEB benchmark corpus/queries/qrels | config.toml -> [dataset.mteb] |
| default script target | config.toml -> [evaluation].endpoint_url |
| enabled DAAT evaluators | config.toml -> [evaluation.daat].evaluators |
| DAAT weight mix | config.toml -> [evaluation.daat.weights] |
| web validation behavior | config.toml -> [evaluation.web_validator] |
| scoring implementation | src/metivta_eval/evaluators/ |
There are two benchmark modes, and each has a different system contract.
Your users submit an endpoint_url that accepts:
POST /answer
Content-Type: application/json
{"question":"..."}and returns:
{"answer":"..."}The submitted URL goes in the evaluation request body:
{
"system_name": "My QA System",
"endpoint_url": "https://my-system.example.com/answer",
"mode": "daat"
}Your users submit an endpoint_url that accepts:
POST /retrieve
Content-Type: application/json
{"query":"...","top_k":100}and returns:
{
"results": [
{"id":"doc_1","score":0.91},
{"id":"doc_2","score":0.77}
]
}The submitted URL goes in the evaluation request body:
{
"system_name": "My Retriever",
"endpoint_url": "https://my-system.example.com/retrieve",
"mode": "mteb"
}The main control plane for datasets is config.toml.
Change these keys:
[dataset]
name = "My-Benchmark"
version = "1.0"
local_path = "src/metivta_eval/dataset"
[dataset.files]
questions = "my-dataset.json"
questions_only = "my-questions-only.json"
format_rubric = "my-format-rubric.json"What each file does:
questionsMain DAAT dataset used for runtime evaluationquestions_onlySafe question template for public distributionformat_rubricScholarly-format rubric used by the standards evaluator
Expected DAAT dataset shape:
[
{
"inputs": {"question": "Your benchmark question"},
"outputs": {"answer": "Ground-truth answer"}
}
]The loader also accepts simplified input like:
[
{
"question": "Your benchmark question",
"answer": "Ground-truth answer"
}
]Change these keys:
[dataset.mteb]
corpus = "mteb/corpus.jsonl"
queries = "mteb/queries.jsonl"
qrels = "mteb/qrels.tsv"That lets you publish your own retrieval benchmark and leaderboard.
You can now choose the DAAT evaluator set from config.toml:
[evaluation.daat]
enabled = true
evaluators = ["all"]Available evaluator keys in the current repo:
hebrew_presenceurl_formatresponse_lengthscholarly_formatcorrectnessweb_validationdaat_score
If you want a more deterministic harness with no LLM-backed or remote-browser-backed scoring, use:
[evaluation.daat]
enabled = true
evaluators = ["hebrew_presence", "url_format", "response_length", "daat_score"]That keeps the benchmark centered on:
- local dataset loading
- code-based checks
- deterministic DAAT attribution scoring
You can tune the composite DAAT weighting here:
[evaluation.daat.weights]
dai = 0.60
mla = 0.40You can tune or disable web validation here:
[evaluation.web_validator]
enabled = true
timeout_ms = 15000
min_keyword_matches = 15
concurrency = 5If you want to change scoring logic itself, the main implementation points are:
- code_evaluators.py
- standards_evaluators.py
- controlled_evaluators.py
- daat_evaluator.py
- mteb_evaluators.py
There are two places people often confuse:
- the submitted system URL
This is per-evaluation input and lives in the request body as
endpoint_url - the local evaluation target This is the default system target used by script-based local evaluation commands
The script-level target is configured here:
[evaluation]
target = "endpoint"
endpoint_url = "http://localhost:5001/answer"
dev_mode = falseUse that when you want to run local harness checks against a specific system without going through the public submission path.
Once you have your dataset and evaluator profile set:
- update
config.toml - replace the dataset files under your chosen
dataset.local_path - launch the stack locally
- verify
/ready,/dataset-info,/signup,/api/v2/docs, and/leaderboard - create a user and issue an API key
- submit DAAT or MTEB systems
- publish the leaderboard URL
The two leaderboard surfaces are:
- modern API:
/api/v2/leaderboard/ - legacy browser dashboard:
/leaderboard
- use the FastAPI v2 surface for new clients
- keep
/submitonly if you need backward compatibility - keep full-answer datasets private and publish
questions_onlytemplates publicly - pin your evaluator profile in
config.tomlso public submissions are scored consistently - treat the README and this guide as operator contracts; verify every public instruction after changing dataset, evaluators, or routes