Skip to content

Commit 36e3021

Browse files
bugszXuhuiZhouopenhands-agentautofix-ci[bot]
authored
Add customizable evaluation dimensions (sotopia-lab#256)
* add customizable evaluation dimensions * add docs * fix mypy error & refactor examples * add docs for evaluation dimensions * update docs and examples * add test cases and fix mypy issue * fix mypy issue * Fix test_create_custom_dimension to use CustomEvaluationDimension.get(pk) (sotopia-lab#262) Co-authored-by: openhands <[email protected]> * Fix/custom eval dimension test (sotopia-lab#263) * Fix test_create_custom_dimension to use CustomEvaluationDimension.get(pk) * Update documentation for SotopiaDimension and EvaluationDimensionBuilder * [autofix.ci] apply automated fixes * Add API documentation for evaluation dimensions * Refine API documentation for evaluation_dimensions.py to match style * [autofix.ci] apply automated fixes --------- Co-authored-by: openhands <[email protected]> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> * add doc --------- Co-authored-by: XuhuiZhou <[email protected]> Co-authored-by: openhands <[email protected]> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
1 parent ba88f9c commit 36e3021

File tree

7 files changed

+595
-1
lines changed

7 files changed

+595
-1
lines changed
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
## Overview
2+
3+
Evaluation dimensions are used to evaluate the quality of social interactions.
4+
In original Sotopia paper, there are 7 dimensions to evaluate the quality of social interactions, where we named them as `sotopia` evaluation dimensions:
5+
- believability
6+
- relationship
7+
- knowledge
8+
- secret
9+
- social rules
10+
- financial and material benefits
11+
- goal
12+
13+
The `SotopiaDimensions` can be used directly without initializing the database. It provides a set of predefined evaluation dimensions that are ready to use for evaluating social interactions. For example,
14+
15+
```python
16+
from sotopia.envs.parallel import ParallelSotopiaEnv
17+
from sotopia.envs.evaluators import EvaluationForTwoAgents, ReachGoalLLMEvaluator, RuleBasedTerminatedEvaluator, SotopiaDimensions
18+
19+
env = ParallelSotopiaEnv(
20+
env_profile=env_profile,
21+
model_name=model_names["env"],
22+
action_order="round-robin",
23+
evaluators=[
24+
RuleBasedTerminatedEvaluator(max_turn_number=20, max_stale_turn=2),
25+
],
26+
terminal_evaluators=[
27+
ReachGoalLLMEvaluator(
28+
model_names["env"],
29+
EvaluationForTwoAgents[SotopiaDimensions], # type: ignore
30+
# TODO check how to do type annotation
31+
),
32+
],
33+
)
34+
```
35+
36+
37+
However we observe under many use cases people may want to evaluate with customized evaluation metrics, so we provide a way to build custom evaluation dimensions.
38+
For a quick reference, you can directly check out the `examples/use_custom_dimensions.py`.
39+
40+
### CustomEvaluationDimension
41+
The [`CustomEvaluationDimension`](/python_API/database/evaluation_dimensions) is a class that can be used to create a custom evaluation dimension.
42+
There are four parameters:
43+
- name: the name of the dimension
44+
- description: the description of the dimension
45+
- range_low: the minimum score of the dimension (should be an integer)
46+
- range_high: the maximum score of the dimension (should be an integer)
47+
48+
### CustomEvaluationDimensionList
49+
The [`CustomEvaluationDimensionList`](/python_API/database/evaluation_dimensions) is a class that can be used to create a custom evaluation dimension list based on the existing dimensions. It helps one to group multiple dimensions together for a specific use case.
50+
There are two parameters:
51+
- name: the name of the dimension list
52+
- dimension_pks: the primary keys of the dimensions in the dimension list
53+
54+
### EvaluationDimensionBuilder
55+
The [`EvaluationDimensionBuilder`](/python_API/database/evaluation_dimensions) is a class that can be used to generate a custom evaluation dimension model based on the existing dimensions.
56+
57+
58+
## Usage
59+
### Initialize the database
60+
The default evaluation metric is still `SotopiaDimensions` in `sotopia.env.evaluators`.There is no `CustomEvaluationDimension` in the database by default. To initialize the database, please refer to `examples/use_custom_dimensions.py`.
61+
62+
63+
### Use the custom evaluation dimensions
64+
After you initialize your customized evaluation dimensions, you can choose to use any one of these methods provided below:
65+
66+
#### Method 1: Choose dimensions by names
67+
```python
68+
evaluation_dimensions = (
69+
EvaluationDimensionBuilder.select_existing_dimension_model_by_name(
70+
["transactivity", "verbal_equity"]
71+
)
72+
)
73+
```
74+
75+
#### Method 2: Directly choose the grouped evaluation dimension list
76+
```python
77+
evaluation_dimensions = (
78+
EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
79+
"sotopia"
80+
)
81+
)
82+
```
83+
84+
#### Method 3: Build a custom evaluation dimension model temporarily
85+
We provide multiple ways to build a custom evaluation dimension model with `EvaluationDimensionBuilder`, specifically:
86+
- `generate_dimension_model`: build an evaluation dimension from existing dimension primary keys.
87+
- `generate_dimension_model_from_dict`: build an evaluation dimension from a dictionary that specifies the parameters of the `CustomEvaluationDimension`. For example
88+
```json
89+
[
90+
{
91+
"name": "believability",
92+
"description": "The believability of the interaction",
93+
"range_low": 0,
94+
"range_high": 10
95+
},
96+
...
97+
]
98+
```
99+
- `select_existing_dimension_model_by_name`: build an evaluation dimension from existing dimension names. For example `['believability', 'goal']`
100+
- `select_existing_dimension_model_by_list_name`: build an evaluation dimension from existing `CustomEvaluationDimensionList` list names. For example, directly use `sotopia`.
101+
102+
103+
After you get the evaluation dimension model, you can pass it as a parameter for the `Evaluator`, for example,
104+
```python
105+
evaluation_dimensions = (
106+
EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
107+
"sotopia"
108+
)
109+
)
110+
terminal_evaluators=[
111+
ReachGoalLLMEvaluator(
112+
model_names["env"],
113+
EvaluationForTwoAgents[evaluation_dimensions], # type: ignore
114+
),
115+
],
116+
```
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# `evaluation_dimensions.py`
2+
3+
This module provides classes and utilities for defining and managing custom evaluation dimensions within the Sotopia environment. It includes classes for individual dimensions, lists of dimensions, and a builder for creating dimension models.
4+
5+
## Classes
6+
7+
### `CustomEvaluationDimension`
8+
9+
Represents a custom evaluation dimension with specific attributes such as name, description, and score range.
10+
11+
#### Attributes
12+
- `name`: `str`. The name of the dimension.
13+
- `description`: `str`. A brief description of the dimension.
14+
- `range_low`: `int`. The minimum score for the dimension.
15+
- `range_high`: `int`. The maximum score for the dimension.
16+
17+
### `CustomEvaluationDimensionList`
18+
19+
Groups multiple custom evaluation dimensions together.
20+
21+
#### Attributes
22+
- `name`: `str`. The name of the dimension list.
23+
- `dimension_pks`: `list[str]`. A list of primary keys for the dimensions included in the list.
24+
25+
### `EvaluationDimensionBuilder`
26+
27+
Provides utility methods to create and manage evaluation dimension models.
28+
29+
#### Methods
30+
- `create_range_validator(low: int, high: int)`: Creates a validator for score ranges.
31+
32+
**Arguments:**
33+
- `low`: `int`. The minimum score allowed.
34+
- `high`: `int`. The maximum score allowed.
35+
36+
- `build_dimension_model(dimension_ids: list[str])`: Builds a dimension model from primary keys.
37+
38+
**Arguments:**
39+
- `dimension_ids`: `list[str]`. A list of dimension primary keys.
40+
41+
- `build_dimension_model_from_dict(dimensions: list[dict[str, Union[str, int]]])`: Builds a dimension model from a dictionary.
42+
43+
**Arguments:**
44+
- `dimensions`: `list[dict[str, Union[str, int]]]`. A list of dictionaries specifying dimension attributes.
45+
46+
- `select_existing_dimension_model_by_name(dimension_names: list[str])`: Selects a dimension model by dimension names.
47+
48+
**Arguments:**
49+
- `dimension_names`: `list[str]`. A list of dimension names.
50+
51+
- `select_existing_dimension_model_by_list_name(list_name: str)`: Selects a dimension model by list name.
52+
53+
**Arguments:**
54+
- `list_name`: `str`. The name of the dimension list.

examples/experiment_eval.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
EnvAgentComboStorage,
1818
EnvironmentProfile,
1919
EpisodeLog,
20+
EvaluationDimensionBuilder,
2021
)
2122
from sotopia.envs.evaluators import (
2223
EvaluationForTwoAgents,
@@ -34,6 +35,7 @@
3435
)
3536
from sotopia.server import run_async_server
3637
from sotopia_conf.gin_utils import parse_gin_flags, run
38+
# from sotopia.database import EvaluationDimensionBuilder
3739

3840
_DEFAULT_GIN_SEARCH_PATHS = [
3941
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
@@ -109,6 +111,18 @@ def _iterate_env_agent_combo_not_in_db(
109111
tag: str | None = None,
110112
) -> Generator[EnvAgentCombo[Observation, AgentAction], None, None]:
111113
"""We iterate over each environment and return the **first** env-agent combo that is not in the database."""
114+
# loading evaluation metric
115+
try:
116+
evaluation_dimensions = EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
117+
"sotopia"
118+
) # Initialize your customized dimension, please refer to `examples/use_custom_dimensions.py`
119+
except Exception as e:
120+
print(
121+
"No customized evaluation dimensions found, using default SotopiaDimensions",
122+
e,
123+
)
124+
evaluation_dimensions = SotopiaDimensions
125+
112126
if not env_ids:
113127
env_ids = list(EnvironmentProfile.all_pks())
114128
for env_id in env_ids:
@@ -152,7 +166,8 @@ def _iterate_env_agent_combo_not_in_db(
152166
terminal_evaluators=[
153167
ReachGoalLLMEvaluator(
154168
model_names["env"],
155-
EvaluationForTwoAgents[SotopiaDimensions],
169+
EvaluationForTwoAgents[evaluation_dimensions], # type: ignore
170+
# TODO check how to do type annotation
156171
),
157172
],
158173
)

0 commit comments

Comments
 (0)