-
--batch_sizeworks for both open-source and api model evaluation. When evaluating open-source models, you have to adjust thebatch_sizeaccording to the GPU memory; when evaluating api models,--batch_sizespecifies the number of parallel calls to the target api model. You should set it properly according to your OpenAI user tier to avoid rate limits. -
--api_parallel_numspecifies the number of parallel calls to the model parser api. In general, if you are a Tier-5 user, you can set--api_parallel_numto 100 or more to parse results in 30 seconds. -
Specify the
--api_base_urlif you wish to use other api such as llama.cpp server and Azure OpenAI API. -
You can use
--max_gpu_memoryto specify the maximum memory per GPU for storing model weights. This allows it to allocate more memory for activations, so you can use longer context lengths or largerbatch_size. E.g., with 4 GPUs, we can set--max_gpu_memory 5GiBforgemma_11_7b_instruct. -
Model response files and scores will be saved to
<output_folder>/<model_name>/<benchmark>/<version>/, for example,mix_eval/data/model_responses/gemma_11_7b_instruct/mixeval_hard/2024-06-01/. We take theoverall scoreas the reported score in Leaderboard. -
There is a resuming mechanism, which means that if you run evaluation with the same config as the run you want to resume, it will resume from where it stopped last time.
-
If you are evaluating base models, set the
--extract_base_model_responseflag to only retain the meaningful part in models' response when parsing to get more stablized parsing results. -
If you are evaluating api models, you should add a line in
.env. E.g., for OpenAI key, you should add:k_oai=<your openai api key>The key name here is 'k_oai'. You can find the key name in the model's class. For example,
claude_3_haiku's key can be found inmixeval.models.claude_3_haiku's__init__function:api_key=os.getenv('k_ant'), wherek_antis the key name.
If you are evaluating a local checkpoint, specify the --model_path <your model path> and --model_name local_chat (or --model_name local_base if you are evaluating a base model):
python -m mix_eval.evaluate \
--model_name local_chat \
--model_path <your model path> \
--benchmark mixeval_hard \
--version 2024-06-01 \
--batch_size 20 \
--max_gpu_memory 5GiB \
--output_dir mix_eval/data/model_responses/ \
--api_parallel_num 20
Modify the mix_eval/models/local_chat.py or mix_eval/models/local_base.py according to your model config. You need to overwrite the build_model function if your checkpoint cannot be loaded by 'transformers.AutoModelForCausalLM.from_pretrained'. The same applies to build_tokenizer.
Some of you might use Azure OpenAI endpoint instead of direct usage of OpenAI API.
You can simply drop you Azure credentials in the .env like this:
OPENAI_API_TYPE=azure
OPENAI_API_KEY=xyz
OPENAI_API_BASE=xyz
OPENAI_API_VERSION=2023-07-01-preview
❗ If you are using Azure, there shouldn't be a MODEL_PARSER_API entry in .env, otherwise it will still use the OpenAI api.
Specify the --api_base_url if you wish to use other api such as llama.cpp server.