LLM Evals Arena

An open-source platform for evaluating and comparing different language models in head-to-head battles.

Features

Battle Arena: Test the same prompt across multiple LLM models simultaneously
Custom Evaluation: Define your own judging criteria for comparing responses
Multiple Providers: Support for OpenAI, Anthropic, Google, Mistral, and Llama models
Battle History: Save and review past evaluations
Environment-based Configuration: Configure API keys via environment variables
Beautiful UI: Modern, responsive interface with battle theme
Open Source: Free to use, modify, and contribute

Getting Started

Prerequisites

Node.js 22.x or higher
yarn
API keys for at least one of the supported LLM providers

Installation

Clone the repository:

git clone https://github.com/lemonberrylabs/evals-arena
cd evals-arena

Install dependencies:
```
yarn install
```
Set up your environment variables:

Create a .env.local file in the project root with your API keys (see ENV_SETUP.md for details).
```
OPENAI_API_KEY=your-openai-api-key
NEXT_PUBLIC_ENABLED_PROVIDERS=openai
```
Start the development server:
```
yarn dev
```
Open http://localhost:3000 in your browser

Configuration

See ENV_SETUP.md for complete details on configuring the application with environment variables.

Usage

Creating a Battle

Enter a user prompt - the task or question you want to test
(Optional) Add developer instructions - system instructions for the models
Define judging criteria - how responses should be evaluated
Select models to participate in the battle
Click "Battle" to start the evaluation

Viewing Results

See all model responses side by side
Review the judge's evaluation and scoring
See which model won based on the criteria
Save results for future reference

API Mode

The application also supports a REST API for running battles.

route: /api/battle (see route.ts)
method: POST
body: BattleSetup (see BattleSetup)
- userPrompt (required): The prompt to test
- selectedModels (required): An array of model ids to include in the battle
- judgeCriteria (optional): The criteria for judging the responses. If not provided, the default judge criteria will be used (see defaultJudgeCriteria).
- developerPrompt (optional): System instructions for the models
response: BattleResponse (see BattleResponse)
- modelResponses: An array of model responses (see ModelResponse)
- judgeEvaluation: An array of judge evaluations (see JudgeEvaluation)

Example:

curl localhost:3000/api/battle -H "Content-Type: application/json" -d '
{
  "userPrompt": "tell me about the moors like I was 8 years old",
  "selectedModels": ["gpt-4o", "gemini-1.5-pro"],
  "developerPrompt": "The response should be in the style of a childrens book."
}
'

Screenshots

Adding a custom LLM provider

In src/config/models.ts:

All of the configuration for supported providers should be centralized in src/config/models.ts. If you are considering adding a new provider, make sure that no other files are affected.

Add the provider to the Provider enum
Add the provider config to the configs array and include at least one model config.
Make sure to follow naming conventions.
Update README.md and ENV_SETUP.md to reflect the new provider.

Note:

Currently the code assumes that all providers support the OpenAI compatible API as we are using an OpenAI client implementation. If you are adding a provider that does not support the OpenAI compatible API, you you will need to abstract the API calls for all providers.

Security Considerations

This application is designed for local development and educational purposes. In its current state:

API keys are stored in environment variables that are not exposed to the client
For production use, consider implementing a backend API gateway to proxy requests

Tech Stack

Next.js - React framework
TypeScript - Type safety
Tailwind CSS - Styling
Zustand - State management
React Hook Form - Form handling
Zod - Schema validation
Lucide Icons - Beautiful icons

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by the need for better tools to compare LLM performance
Thanks to all the LLM providers for their amazing models
Built with ❤️ by lemonberry labs for the open-source community

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
.vscode		.vscode
public		public
src		src
.gitignore		.gitignore
.prettierrc		.prettierrc
ENV_SETUP.md		ENV_SETUP.md
LICENSE		LICENSE
README.md		README.md
components.json		components.json
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Evals Arena

Features

Getting Started

Prerequisites

Installation

Configuration

Usage

Creating a Battle

Viewing Results

API Mode

Screenshots

Adding a custom LLM provider

Security Considerations

Tech Stack

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Evals Arena

Features

Getting Started

Prerequisites

Installation

Configuration

Usage

Creating a Battle

Viewing Results

API Mode

Screenshots

Adding a custom LLM provider

Security Considerations

Tech Stack

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages