An open-source platform for evaluating and comparing different language models in head-to-head battles.
- Battle Arena: Test the same prompt across multiple LLM models simultaneously
- Custom Evaluation: Define your own judging criteria for comparing responses
- Multiple Providers: Support for OpenAI, Anthropic, Google, Mistral, and Llama models
- Battle History: Save and review past evaluations
- Environment-based Configuration: Configure API keys via environment variables
- Beautiful UI: Modern, responsive interface with battle theme
- Open Source: Free to use, modify, and contribute
- Node.js 22.x or higher
- yarn
- API keys for at least one of the supported LLM providers
-
Clone the repository:
git clone https://github.com/lemonberrylabs/evals-arena cd evals-arena -
Install dependencies:
yarn install
-
Set up your environment variables:
Create a
.env.localfile in the project root with your API keys (see ENV_SETUP.md for details).OPENAI_API_KEY=your-openai-api-key NEXT_PUBLIC_ENABLED_PROVIDERS=openai -
Start the development server:
yarn dev
-
Open http://localhost:3000 in your browser
See ENV_SETUP.md for complete details on configuring the application with environment variables.
- Enter a user prompt - the task or question you want to test
- (Optional) Add developer instructions - system instructions for the models
- Define judging criteria - how responses should be evaluated
- Select models to participate in the battle
- Click "Battle" to start the evaluation
- See all model responses side by side
- Review the judge's evaluation and scoring
- See which model won based on the criteria
- Save results for future reference
The application also supports a REST API for running battles.
- route:
/api/battle(see route.ts) - method:
POST - body:
BattleSetup(see BattleSetup)userPrompt (required): The prompt to testselectedModels (required): An array of model ids to include in the battlejudgeCriteria (optional): The criteria for judging the responses. If not provided, the default judge criteria will be used (see defaultJudgeCriteria).developerPrompt (optional): System instructions for the models
- response:
BattleResponse(see BattleResponse)modelResponses: An array of model responses (see ModelResponse)judgeEvaluation: An array of judge evaluations (see JudgeEvaluation)
Example:
curl localhost:3000/api/battle -H "Content-Type: application/json" -d '
{
"userPrompt": "tell me about the moors like I was 8 years old",
"selectedModels": ["gpt-4o", "gemini-1.5-pro"],
"developerPrompt": "The response should be in the style of a childrens book."
}
'All of the configuration for supported providers should be centralized in src/config/models.ts.
If you are considering adding a new provider, make sure that no other files are affected.
- Add the provider to the
Providerenum - Add the provider config to the
configsarray and include at least one model config. - Make sure to follow naming conventions.
- Update README.md and ENV_SETUP.md to reflect the new provider.
Note:
Currently the code assumes that all providers support the OpenAI compatible API as we are using an OpenAI client implementation.
If you are adding a provider that does not support the OpenAI compatible API, you you will need to abstract the API calls for all providers.
This application is designed for local development and educational purposes. In its current state:
- API keys are stored in environment variables that are not exposed to the client
- For production use, consider implementing a backend API gateway to proxy requests
- Next.js - React framework
- TypeScript - Type safety
- Tailwind CSS - Styling
- Zustand - State management
- React Hook Form - Form handling
- Zod - Schema validation
- Lucide Icons - Beautiful icons
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by the need for better tools to compare LLM performance
- Thanks to all the LLM providers for their amazing models
- Built with ❤️ by lemonberry labs for the open-source community



