You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 1, 2025. It is now read-only.
BotSIM is a Bot SIMulation toolkit for performing large-scale data-efficient end-to-end evaluation, diagnosis and remediation of commercial task-oriented dialog (TOD) systems to accelerate bot development and evaluation, reduce cost and time-to-market.
39
-
As a modular framework, BotSIM can be extended by bot developers to support new bot platforms. As a toolkit, it offers an easy-to-use App and a suite of command line tools for bot admins or practitioners to readily perform evaluation and remediation of their bots.
38
+
BotSIM is a Bot SIMulation toolkit for performing large-scale data-efficient end-to-end evaluation, diagnosis and remediation of commercial task-oriented dialog (TOD) systems, a.k.a. "Chatbots".
39
+
As a modular framework, BotSIM can be extended by bot developers to support new bot platforms. As a toolkit, it offers an easy-to-use App and a suite of command line tools for bot admins or practitioners to readily perform evaluation and remediation of their bots at scale. Consequently, BotSIM can accelerate bot development and evaluation, reduce cost and time-to-market.
40
40
41
41
Key features of BotSIM include:
42
42
@@ -57,15 +57,15 @@ Key features of BotSIM include:
The most straightforward way of getting started with BotSIM is the Streamlit Web App. The app is developed as a multi-page app to guide users to leverage BotSIM's "generation-simulation-remediation" pipeline for evaluation, diagnosis and remediation of their bots.
66
+
The most straightforward way of getting started with BotSIM is the Streamlit Web App. The multi-page App is developed to guide users to leverage BotSIM's "generation-simulation-remediation" pipeline for evaluation, diagnosis and remediation of their bots.
The following commands can be used to start the Streamlit Web App locally:
@@ -82,22 +82,22 @@ The App can also be deployed as a docker image:
82
82
docker run -p 8501:8501 botsim-streamlit
83
83
```
84
84
### Command Line Tools
85
-
Alternatively, users can also use the command line tools to deep-dive into BotSIM's generation-simulation-remediation pipeline.
85
+
Alternatively, users can also deep-dive to learn more about BotSIM's system components through the command line tools. Details are given in the [tutorial section](https://opensource.salesforce.com/botsim//latest/tutorials.html#botsim-command-line-tools) of the code documentation(https://opensource.salesforce.com/botsim//latest/tutorials.html).
86
86
87
87
## Tutorial
88
-
We provide the following tutorials in the [tutorial section](https:///latest/tutorials.html) of the [code documentation]().
89
-
-[Streamlit Web App](https://latest/tutorials.html#streamlit-web-app)
90
-
-[BotSIM command line tools](https://latest/tutorials.html#botsim-command-line-tools)
91
-
-[Bot health dashboard navigation](https://atest/dashboard.html)
For more details of the system components and advanced usages, please refer to [code documentation]((https://opensource.salesforce.com/botsim//latest/index.html#)]).
95
+
For more details of the system components and advanced usages, please refer to the [code documentation](https://opensource.salesforce.com/botsim//latest/index.html#).
96
96
We welcome the contribution from the open-source community to improve the toolkit! To support new bot platforms, please also follow the guidelines detailed in the code documentation.
97
97
98
98
## System Demo Paper and Technical Report
99
-
You can find more details in our technical report and system demo paper.
100
-
If you're using BotSIM in your research or applications, please cite using this BibTeX for technical report:
99
+
You can find more details of system designs in our technical report. Detailed system descriptions are given in our EMNLP 2022 system demo paper.
100
+
If you're using BotSIM in your research or applications, please cite using this BibTeX for the technical report:
101
101
```
102
102
@article{guangsen2022-botsim-tr,
103
103
author = {Guangsen Wang and Junnan Li and Shafiq Joty and Steven Hoi},
@@ -108,7 +108,7 @@ If you're using BotSIM in your research or applications, please cite using this
108
108
archivePrefix = {arXiv},
109
109
}
110
110
```
111
-
or the following BibTex for our system demo paper:
111
+
or the following BibTex for the system demo paper:
112
112
```
113
113
@article{guangsen2022-botsim-demo,
114
114
author = {Guangsen Wang and Samson Tan and Shafqi Joty and Guang Wu and Jimmy Au and Steven Hoi},
The parser interface is defined in generator.parser and has the following important functions to implement.
9
-
As these functions are highly platform dependent, the implementation might be non-trivial and require access to bot design documentations from the bot platform provider.
9
+
As these functions are highly platform dependent, the implementation might be non-trivial and require access to bot design documentation from the bot platform provider.
10
10
We provide our initial parser implementations for the Einstein BotBuilder (``platform.botbuilder``) and Google DialogFlow CX (``platform.dialogflow_cx``) platforms.
11
11
The utility functions supporting the parsers are under ``modules.generator.utils.<platform-name>/parser_utilities.py``
12
12
13
-
1. ``extract_local_dialog_act_map`` function generates a “local” dialog act map by ignoring incoming and outputting transitions. In other words, the local map only considers the messages/actions explicitly defined within the dialog. These local dialog act maps are modelled as graph nodes during the subsequent conversation graph modelling. In particular, the messages for the two special dialog acts, namely "intent_success_message"and "dialog_success_message" are also generated here according to the following heuristics: "intent_success_message" contains the first request message and all its previous normal messages "dialog_success_message" contains the last messages.
14
-
2. ``conversation_graph_modelling`` models the entire bot design as a graph. Each individual dialog is represented by its local dialog act maps and modelled as the graph nodes. Transitions among the individual dialogs are modelled as the graph edges. The graph modelling is based on the networkx package. There are two outputs from the function: the final dialog act maps and the graph data for conversation path visualisation.
13
+
1. ``extract_local_dialog_act_map`` function generates a “local” dialog act map by ignoring incoming and output transitions. In other words, the local map only considers the messages/actions explicitly defined within the dialog. These local dialog act maps are modelled as graph nodes during the subsequent conversation graph modelling. In particular, the messages for the two special dialog acts, namely "intent_success_message"and "dialog_success_message" are also generated here according to the following heuristics: "intent_success_message" contains the first request message and all its previous normal messages "dialog_success_message" contains the last messages.
14
+
2. ``conversation_graph_modelling`` models the entire bot design as a graph. Each individual dialog is represented by its local dialog act maps and modelled as the graph nodes. Transitions among the individual dialogs are modelled as the graph edges. The graph modelling is based on the ``networkx`` package. There are two outputs from the function: the final dialog act maps and the graph data for conversation path visualisation.
15
15
3. ``parse`` function defines a general parser pipeline for all platforms starting from parsed local dialog act maps.
The bot health dashboard consists of a set of multi-level performance reports. At the highest level, users can have a historical view of most recent simulation/test sessions (e.g., after each major bot update). The historical performance comparison can help users evaluate the impacts of bot changes quantitatively, from which they can make decisions like whether or not keep certain changes.
8
-
In the session-specific performance summary, users can zoom in for more details of a selected test session including the data distribution, overall dialog performance metrics. Furthermore, one can select a dialog/intent of the specific testing session to investigate the detailed intent and NER performance in the dialog-specific performance summary. Through the dialog-specific performance report, one can quickly identify the most confusing intents and entities. This saves significant efforts and helps better allocation of resources for troubleshooting and bot improvement.
7
+
The bot health dashboard consists of a set of multi-level performance reports. At the highest level,
8
+
users can have a historical view of most recent simulation/test sessions (e.g., after each major bot update).
9
+
The historical performance comparison can help users evaluate the impacts of bot changes quantitatively,
10
+
from which they can make decisions like whether or not to keep certain changes.
11
+
In the session-specific performance summary, users can zoom in for more details of a selected test session
12
+
including the data distribution, overall dialog performance metrics. Furthermore, one can select a dialog/intent of
13
+
the specific testing session to investigate the detailed intent and NER performance in the dialog-specific performance summary.
14
+
Through the dialog-specific performance report, one can quickly identify the most confusing intents and entities.
15
+
This saves significant efforts and helps better allocation of resources for troubleshooting and bot improvement.
9
16
10
17
.. image:: _static/BotSIM_Performance_Report.png
11
18
:width:550
@@ -16,8 +23,8 @@ In addition to the diagnosis reports, the remediator also provides actionable in
16
23
The remediation dashboards given below allow detailed investigation of all intent or NER errors along with their corresponding simulated chat logs.
17
24
The root causes of the failed conversations are identified via backtracking of the simulation agenda.
18
25
For troubleshooting intent models, the remediator attempts to identify the intent utterances and paraphrases that are wrongly predicted by the current model. Depending on the wrongly classified intent classes, the remediator would suggest some follow-up actions including 1) augmenting the intent training set with the queries deemed to be out-of-domain by the current intent model, 2) moving the intent utterance to another intent if most of paraphrases of the former intent utterance are classified to the latter intent.
19
-
Similarly for NER model, the remediator collects all the wrongly extracted entities and the messages with such entities. Depending on the entity extraction method, users can follow the suggestions to troubleshooting or improving the bot NER capabilities.
20
-
Note the suggestions are meant to be used as guidelines rather than strictly followed. More importantly, they can always be extended by users to include domain expertise in troubleshooting bots related to their products/services.
26
+
Similarly for the NER model, the remediator collects all the wrongly extracted entities and the messages with such entities. Depending on the entity extraction method, users can follow the suggestions to troubleshooting or improving the bot NER capabilities.
27
+
Note the suggestions are meant to be used as guidelines rather than strictly followed. More importantly, users can always extend them to include domain expertise in troubleshooting bots related to their products/services.
Another useful component of the Remediator is the suite of conversation analytical tools. They further help bot practitioners gain more insights for troubleshooting and
31
38
improving their dialog systems. The confusion matrix analysis breaks down the intent model performance into (sortable) recall, precision and F1 accuracies to help identify the
32
39
worse performing intents. Another useful analytical tool is the tSNE~clustering of the intent utterances using sentence transformer embeddings. The tSNE visualisation enables users
33
-
to gauge the training data quality. It is also an effective tool in identifying overlapping intents and can potential benefit new intent discovery as well.
40
+
to gauge the training data quality. It is also an effective tool in identifying overlapping intents and can potentially benefit new intent discovery as well.
34
41
Lastly, powered by parsers' conversation graph modelling capability, the dialog path explorer can be used to visualise different conversation flows of the current bot design.
35
42
For example, users can select the source and target dialogs and investigate the generated dialog paths. Not only is the tool valuable for comprehensive testing coverage of conversation paths,
36
43
it also offers a controllable approach to troubleshooting dialog design related errors or even improving the current design.
@@ -44,16 +51,16 @@ Apply intent model remediation suggestions
44
51
The most straightforward approach of applying remediation suggestions is to augment the the recommended misclassified paraphrases to the original
45
52
training set to retrain the intent model.
46
53
47
-
For Einstein BotBuilder platform, new intent sets can be created as a csv file to include the augmented training set. The csv file can be deployed
54
+
For the Einstein BotBuilder platform, new intent sets can be created as a csv file to include the augmented training set. The csv file can be deployed
48
55
to users' org via `Salesforce Workbench <https://workbench.developerforce.com/login.php>`_. The new intent model can be retrained by associate the
49
56
new intent set name ``report_issue_dev_augmented`` with the ``Report an Issue`` intent.
50
57
51
-
.. csv-table:: Snippet of augmented intent set csv file for Einstein BotBuilder Platform
58
+
.. csv-table:: Snippet of augmented intent set csv file for the Einstein BotBuilder Platform
52
59
:file: augmented.csv
53
60
:widths: 5,5,90
54
61
:header-rows: 1
55
62
56
-
For DialogFlow CX, the recommanded paraphrases can be add back to the corresponding training set and the intent model will be automatically retrained.
63
+
For DialogFlow CX, the recommended paraphrases can be add back to the corresponding training set and the intent model will be automatically retrained.
57
64
58
65
The table below shows the intent F1 score comparison before and after intent model retraining based on the simulation goals created from the same evaluation set.
0 commit comments