Skip to content

Commit e0695a5

Browse files
authored
Merge pull request #44 from georgian-io/ai21/jurassic
Ai21/jurassic
2 parents 3d31322 + 9f63c51 commit e0695a5

File tree

6 files changed

+522
-0
lines changed

6 files changed

+522
-0
lines changed

jurassic/README.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# Contents:
2+
3+
- [Contents:](#contents)
4+
- [What is Jurassic?](#what-is-jurassic)
5+
- [Variations of Jurassic and Parameters](#variations-of-jurassic-and-parameters)
6+
- [What does this folder contain?](#what-does-this-folder-contain)
7+
- [Evaluation Framework](#evaluation-framework)
8+
- [ Performance ](#-performance-)
9+
- [Classification](#classification)
10+
- [Summarization](#summarization)
11+
- [ Time \& Cost to Train ](#--time--cost-to-train--)
12+
- [ Inference ](#-inference-)
13+
14+
## What is Jurassic?
15+
16+
Jurassic-2, created by [AI21 Labs](https://www.ai21.com/), are large language models that give maximum flexibility to provide AI-first reading and writing experiences. Jurassic line of models are available to developers via the [AI21 Studio](https://www.ai21.com/studio).
17+
18+
## Variations of Jurassic and Parameters
19+
20+
Jurassic models, for API use, come in a variety of sizes, and can be leveraged depending on the task at hand.
21+
22+
* Jurassic-2 Ultra: Unmatched quality
23+
* Jurassic-2 Mid: Optimal balance of quality, speed, and cost
24+
* Jurassic-2 Light: Fast and cost-effective
25+
26+
More information about each model can be found on their documentation [website](https://docs.ai21.com/docs/jurassic-2-models).
27+
28+
In parallel, there are a set of 3 models, which we think are similar to the above three, but with different names.
29+
30+
| Jurassic variation | Parameters |
31+
|:---------------------:|:-------------:|
32+
| Jurassic-2 Light |7B |
33+
| Jurassic-2 Grande |17B |
34+
| Jurassic-2 Jumbo |178B |
35+
36+
There are also other more specialized instruction-tuned models which can be found on their website. In this repository, we have experimented with all variations of Jurassic-2 models for out-of-the-box tasks as well as fine-tuning purposes.
37+
38+
## What does this folder contain?
39+
40+
This folder contains ready-to-use scripts, using which you can do the following:
41+
42+
* Prompts used:
43+
* ```prompts.py```: Zero-shot, Few-shot and instruction tuning for classification and summarization
44+
* Infer Jurassic-2 models:
45+
* ```jurassic_baseline_inference.py```: Zero-shot, Few-shot and fine-tuned versions for summarization task
46+
* ```baseline_inference.sh```: Shell script to loop over all combinations of Jurassic-2's models for out-of-the-box summarization capabilities
47+
* ```custom_inference.sh```: Shell script to loop over fine-tuned Jurassic-2's models
48+
49+
50+
Note: To learn how to fine-tune Jurassic-2 line of models, please follow the simple and easy-to-understand [tutorial](https://docs.ai21.com/docs/custom-models) on AI21's website.
51+
52+
## Evaluation Framework
53+
54+
In this section, we bring to you our insights after extensively experimenting with InstructJurassic-30B across different tasks. For a thorough evaluation, we need to evaluate the __four pillars__:
55+
56+
* Performance
57+
* Cost to Train
58+
* Time to Train
59+
* Inference Costs
60+
61+
62+
### <img src="../assets/rocket.gif" width="32" height="32"/> Performance <img src="../assets/rocket.gif" width="32" height="32"/>
63+
64+
We evaluated InstructJurassic-30B under the following conditions:
65+
66+
* Tasks & Datasets:
67+
* Summarization: Samsum dataset.
68+
* Experiments:
69+
* Zero-Shot prompting vs Few-Shot prompting vs Fine-tuning
70+
* Hardware:
71+
* No requirements as all the foundational and custom models are hosted by AI21 Labs.
72+
73+
#### Classification ####
74+
75+
We track accuracy to evaluate model's performance on classification task.
76+
77+
<u> Table 1: Zero-Shot prompting </u>
78+
79+
|Model | Accuracy (in %) |
80+
|:-------------:|:---------------:|
81+
|J2-Light | 1.82 |
82+
|J2-Mid | 22.93 |
83+
|J2-Ultra | 43.62 |
84+
85+
86+
<u> Insight: </u>
87+
88+
* Jurassic-2 models show consistent improvement in zero-shot prompting setting as the model size increases.
89+
* J2-Ultra achieves the highest accuracy, followed by J2-Mid and J2-Light.
90+
* In our opinion, these numbers are very high considering these models are used out-of-the-box.
91+
92+
#### Summarization ####
93+
94+
We track ROUGE-1 and ROUGE-2 metrics to evaluate model's performance on summarization task.
95+
96+
<u> Table 2: Zero-Shot prompting </u>
97+
98+
|Model | ROUGE-1 (in %) | ROUGE-2 (in %) |
99+
|:-------------:|:--------------:|:--------------:|
100+
|J2-Light | 38.218 | 14.780 |
101+
|J2-Mid | 39.119 | 15.591 |
102+
|J2-Ultra | 41.635 | 17.273 |
103+
104+
105+
<u> Table 3: Few-Shot prompting </u>
106+
107+
|Model | ROUGE-1 (in %) | ROUGE-2 (in %) |
108+
|:-------------:|:--------------:|:--------------:|
109+
|J2-Light | 40.736 | 17.092 |
110+
|J2-Mid | 43.390 | 18.346 |
111+
|J2-Ultra | 45.317 | 19.276 |
112+
113+
114+
<u> Table 4: Fine-tuning (custom model) </u>
115+
116+
|Model | ROUGE-1 (in %) | ROUGE-2 (in %) |
117+
|:-------------:|:--------------:|:--------------:|
118+
|J2-Light | 44.694 | 20.153 |
119+
|J2-Grande | 48.385 | 23.901 |
120+
|J2-Jumbo | DNF\* | DNF\* |
121+
122+
123+
Note: All models for fine-tuned for 1 epoch.
124+
125+
126+
<u> Insight: </u>
127+
128+
* For out-of-the-box performance, Jurassic-2's biggest model J2-Ultra achieves the best ROUGE-1 and ROUGE-2 when compared to the other models.
129+
* Jurassic-2's model performance steadily increases from Light to Mid to Ultra, across both zero-shot and few-shot settings.
130+
* Few-shot prompting consistently performs better than zero-shot prompting across all model variations.
131+
* As expected, J2-Light and J2-Grande fine tuned (custom) models achieve better performance than their out-of-the-box counterparts, i.e., zero-shot and few-shot.
132+
* Despite debugging efforts, the custom version of J2-Jumbo (the biggest model) does not generate anything (DNF) for the exact same inputs.
133+
134+
135+
136+
### <img src="../assets/time.gif" width="32" height="32"/> <img src="../assets/money.gif" width="32" height="32"/> Time & Cost to Train <img src="../assets/money.gif" width="32" height="32"/> <img src="../assets/time.gif" width="32" height="32"/>
137+
138+
139+
* Fine-tuning Jurassic-2 custom models for 1 epoch took between 15 minutes to 1 hour, depending on the size of the model in consideration.
140+
141+
142+
Following is the cost breakdown for fine-tuning Jurassic-2's different models, as per what AI21 charged:
143+
144+
|Model | MB / Epochs | Cost |
145+
|:-------------:|:-----------:|:-----:|
146+
|J2-Light | 8 | $0.88 |
147+
|J2-Grande | 8 | $4.00 |
148+
|J2-Jumbo | 8 | $26.47|
149+
150+
151+
<u> Insights: </u>
152+
153+
* J2-Jumbo (the biggest model) does not generate anything for the exact same inputs.
154+
* MB / Epochs is basically the size of training data being used per epoch. In the case of summarization, Samsum's training split is roughly 8 MB in size.
155+
156+
157+
### <img src="../assets/progress.gif" width="32" height="32"/> Inference <img src="../assets/progress.gif" width="32" height="32"/>
158+
159+
Following are the tables of the APIs' mean response time (MRT), i.e., how long it took for the API to return responses on average.
160+
161+
#### Classification ####
162+
163+
<u> Table 5: Zero-Shot prompting </u>
164+
165+
|Model | Mean Response Time (MRT)(in sec) |
166+
|:-------------:|:--------------------------------:|
167+
|J2-Light | 0.58 ± 0.19 |
168+
|J2-Mid | 0.53 ± 0.27 |
169+
|J2-Ultra | 0.66 ± 0.45 |
170+
171+
172+
#### Summarization ####
173+
174+
<u> Table 5: Zero-Shot prompting VS Few-shot prompting VS Fine-tuning </u>
175+
176+
|Model | Zero-Shot MRT | Few-Shot MRT | Fine-tune MRT |
177+
|:-------------:|:-------------:|:------------:|:-------------:|
178+
|J2-Light | 0.63 ± 0.18 | 0.56 ± 0.17 | 0.48 ± 0.08 |
179+
|J2-Mid | 1.23 ± 6.45 | 0.83 ± 0.31 | 1.76 ± 1.01 |
180+
|J2-Ultra | 1.00 ± 0.44 | 0.92 ± 0.51 | DNF\* |
181+
182+
<u> Insights: </u>
183+
184+
* The MRT for Few-shot summarization is lower across all three models when compared with Zero-shot's MRT. This could be because the models generate longer summaries in a zero-shot setting as they do not have a sense of how long the summaries should be. As examples from the training split of SAMSUM are provided to the models in the few-shot setting, they guide the models to generate summaries that are smaller in length, similar to their ground-truth counterparts.
185+
* At this time, conclusive insights cannot be drawn from the fine-tuning MRTs.

jurassic/baseline_inference.sh

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
python jurassic_baseline_inference.py --model_type j2-light --task_type summarization --prompt_type zero-shot & wait
2+
python jurassic_baseline_inference.py --model_type j2-mid --task_type summarization --prompt_type zero-shot & wait
3+
python jurassic_baseline_inference.py --model_type j2-ultra --task_type summarization --prompt_type zero-shot & wait
4+
5+
python jurassic_baseline_inference.py --model_type j2-light --task_type summarization --prompt_type zero-shot & wait
6+
python jurassic_baseline_inference.py --model_type j2-mid --task_type summarization --prompt_type zero-shot & wait
7+
python jurassic_baseline_inference.py --model_type j2-ultra --task_type summarization --prompt_type zero-shot & wait
8+
9+
python jurassic_baseline_inference.py --model_type j2-light --task_type classification --prompt_type zero-shot & wait
10+
python jurassic_baseline_inference.py --model_type j2-mid --task_type classification --prompt_type zero-shot & wait
11+
python jurassic_baseline_inference.py --model_type j2-ultra --task_type classification --prompt_type zero-shot

jurassic/custom_model_inference.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
python jurassic_baseline_inference.py --prompt_type fine-tuned --task_type summarization --model_type j2-large --custom_model j2-light-samsum & wait
2+
python jurassic_baseline_inference.py --prompt_type fine-tuned --task_type summarization --model_type j2-grande --custom_model j2-mid-samsum & wait
3+
python jurassic_baseline_inference.py --prompt_type fine-tuned --task_type summarization --model_type j2-jumbo --custom_model j2-ultra-samsum

0 commit comments

Comments
 (0)