This repository is used to build verifiers environments and tools for the MedARC medical language model project.
It also contains the medarc-verifiers package, which provides additional tools for creating verifiers environments.
The steps below guide you through creating a new environment package under environments/[my-new-env], installing it locally, testing it with Verifiers tooling, and optionally publishing it through Prime Intellect's Environments Hub.
- Python 3.11 or 3.12
uvfor dependency management- The
primeCLI for scaffolding and publishing - An OpenAI-compatible API key (export it as
OPENAI_API_KEY) or OpenAI compatible model for testing the environment withvf-eval
Create and activate a virtual environment, then install the required tooling:
uv venv --python 3.12
source .venv/bin/activate
uv tool install prime
uv pip install verifiersAfter this setup the prime env, vf-install, and vf-eval commands will be available (or runnable via uv run <command>).
Always place new Verifiers packages inside environments/my-new-env. The Prime CLI ensures this by default:
# from the repository root
prime env init my-new-envThe template produces:
environments/my_new_env/
├── my_new_env.py
├── pyproject.toml
└── README.md
Edit my_new_env.py to configure datasets, parsers, and rubrics, and update the package metadata in pyproject.toml (name, version, dependencies, tags, etc.).
If the prime env init command doesn't add it, you'll want to add the following prime env metadata so prime/verifiers knows where the environment is in a flat repo:
[tool.prime.environment]
loader = "my_new_env:load_environment"
display_name = "My New Env"
visibility = "PUBLIC"Install your new environment in editable mode so changes are picked up immediately:
vf-install my-new-env
# equivalent to:
# uv pip install -e ./environments/my_new_envYou can now import it from Python or let Verifiers discover it with verifiers.load_environment("my-new-env").
Run a small batch of rollouts to confirm the environment behaves as expected. Set OPENAI_API_KEY (or whichever OpenAI client compatible credentials you plan to use) before invoking the CLI.
export OPENAI_API_KEY=sk-...
vf-eval my-new-env -m gpt-4.1-mini -n 5 -sA few useful arguments:
- -m selects the inference model
- -n controls dataset size
- -s saves results locally.
Use vf-eval -h for the full set of options (rollouts per example, max concurrency, etc.)
During development you can iterate quickly by tweaking prompts, parser logic, or reward functions, reinstalling with vf-install if dependencies change, and rerunning vf-eval to view the results.
After running with -s, inspect saved runs with vf-tui, which provides a terminal UI for browsing prompts, completions, and rewards under the generated outputs/evals folders.
Once your tooling is set up you can install MedARC-maintained environments directly from the Prime Environments Hub (for example medarc/medcasereasoning or medarc/metamedqa).
-
Install from the Hub. Run
prime env install medarc/medcasereasoningto pull the latest published version (add@versionto pin a release). -
Run an evaluation. Execute
vf-eval medcasereasoning -m gpt-4.1-mini -n 10 -sto generate a small batch of rollouts. -
Load programmatically. Environments installed via the Hub are importable like any other Verifiers module:
import verifiers as vf env = vf.load_environment("medcasereasoning", split="validation") results = env.evaluate(model_client, "gpt-4.1-mini", num_examples=5)
medarc-eval wraps the upstream vf-eval flow and adds environment-specific flags generated from each environment's load_environment signature to the CLI instead of requiring a json blob via --env-args.
uv run medarc-eval medqa -m gpt-4.1-mini -n 5uv run medarc-eval medbullets --helpuv run medarc-eval medbullets --num-options 4 --env-args '{"shuffle": true}'Explicit flags always override JSON input. For list parameters, repeat the flag to replace the default entirely:
uv run medarc-eval longhealth --section cardio --section neuroUse --env-args for complex structures (dicts, nested generics) that cannot be mapped to simple flags:
uv run medarc-eval medagentbench --env-args '{"config": {"mode": "fast"}}'Print the detected environment schema:
uv run medarc-eval mmlu_pro_health --print-env-schemaToken usage is automatically tracked when using medarc-eval. Each result/rollout includes a token_usage column with nested dict:
{
"token_usage": {
"model": {"prompt": 450, "completion": 280, "total": 730},
"judge": {"prompt": 3200, "completion": 150, "total": 3350},
"total": {"prompt": 3650, "completion": 430, "total": 4080}
}
}Token tracking works automatically:
uv run medarc-eval medqa -m gpt-4.1-mini -n 5 -sTo enable token tracking with vf-eval, add medarc_verifiers as a dependency in your environment's pyproject.toml:
[project]
dependencies = [
"verifiers>=0.1.2.post0",
"medarc_verifiers>=0.1.0",
]
[tool.uv.sources]
medarc_verifiers = { git = "https://github.com/MedARC-AI/med-lm-envs" }Then reinstall the environment:
uv pip install -e ./environments/your-env
vf-eval your-env -m gpt-4.1-mini -n 5 -sexport MEDARC_DISABLE_TOKEN_TRACKING=true- Works with any OpenAI-compatible provider
- Tokens extracted from API
response.usagefield - If provider doesn't return usage data, defaults to 0
- Model tokens include all inference API calls
- Judge tokens include all LLM-as-judge calls via
judge()method (e.g., FactScore: 6-20 verification calls per example) - Note: Some judge implementations (e.g., FactScore claim extraction) make additional API calls (claim extraction) that are currently not tracked not part of judge() calls or get stored in state["responses"]. These represent a small overhead (~10-20% of total judge tokens) and are present in existing implementations like MedRedQA, keep in mind when calculating.