Open Deep Research is an open-source agentic framework developed by Hugging Face that autonomously conducts web-based research. Built using the smolagents framework, it can browse the internet, extract and analyze information, and perform data manipulations to answer complex queries.
We optimized the Open Deep Research framework using our proposed SEW(Self-Evolving Workflow) optimizer, with a primary focus on improving the prompts within the framework. The performance of Open Deep Research with original and optimized prompts on the full GAIA validation set is shown in the following figure:
Figure 1: Performance comparison between original and optimized Open Deep Research on the full GAIA validation setThe results indicate that our optimized prompts improve the performance by 18.41% on average, with noticeable improvements on tasks from all three levels of the GAIA benchmark.
In our experiments, we leveraged the OpenAI o3
model to optimize the prompts, and used gpt-4o-mini
to run the model during evaluation. The total investment for this optimization process was approximately $45, with the majority of the cost (about $42) from running the model with gpt-4o-mini
for validation. These results indicate that our optimization process is cost-effective and can achieve remarkable performance improvements.
We chose Open Deep Research because it is one of the few open-source, runnable frameworks on the GAIA leaderboard. Most other submissions are either closed-source or lack runnable code. Alongside OWL , Open Deep Research offers a strong baseline for evaluating and improving web-based research agents. While OWL is optimized in another teammate's repository, this work focuses on optimizing Open Deep Research for the GAIA leaderboard.
Figure 2: GAIA Leaderboard showing Open Deep Research performance and ranking among other submissionsWe made the following modifications to the original framework:
-
LLM Backbone
We change the original
o1
LLM model (used in the leaderboard submission) withgpt-4o-mini
in our experiments.The main reason is the extremely high token consumption of this framework. In our preliminary tests, running just 50 samples with
o1
incurred about $150 in API fees. Running the full validation set of 165 examples would cost approximately $495, making it impractical for iterative optimization. To reduce the cost while preserving reasonable performance, we used the more cost-effectivegpt-4o-mini
. However, even with this smaller model, running the full validation set still costs around $55, highlights the inherently high token consumption of the Open Deep Research framework. -
Optimized Prompts
We optimized the prompts within the Open Deep Research framework using our proposed SEWOptimizer. In our experiments, we randomly sampled 25 questions from the GAIA validation set and used them as a validation subset for optimization. These optimized prompts can be found in the
src/smolagents/prompts
folder:code_agent_4o_mini_optimized.yaml
toolcalling_agent_4o_mini_optimized.yaml
-
Easier Run with
--optimized
FlagWe modified the script
run_gaia.py
in theexamples/open_deep_research
folder to include the--optimized
argument. This allows users to switch between original and optimized prompts effortlessly. You can follow the instructions under the folderexamples/open_deep_research
to run the framework. -
Evaluation Script
We add a new script
evaluate.py
in theexamples/open_deep_research
folder to facilitate the evaluation of model outputs. -
Evaluation Results
To facilitate the evaluation, we provide the results of the original and optimized prompts on the full GAIA validation set in the
output/validation
folder:- results with original prompts:
gpt-4o-mini_results.jsonl
- results with optimized prompts:
gpt-4o-mini_optimized_results.jsonl
These files can be directly used for quick comparison and reproduce the results.
- results with original prompts:
Follow the instructions in the examples/open_deep_research
folder to setup the environment. Then create a file .env
with the following content:
OPENAI_API_KEY=your_openai_api_key
SERPER_API_KEY=your_serper_api_key
Run the following commands to reproduce the original Open Deep Research implementation on the validation set of the GAIA benchmark:
python run_gaia.py --concurrency 10 --model-id gpt-4o-mini --run-name gpt-4o-mini_results
Run the following command to reproduce our optimized implementation on the validation set of the GAIA benchmark:
cd examples/open_deep_research
python run_gaia.py --concurrency 10 --model-id gpt-4o-mini --run-name gpt-4o-mini_optimized_results --optimized
Run the following command to evaluate the performance:
python evaluate.py --output_file /path/to/your/results.jsonl