|
| 1 | +# Contents: |
| 2 | + |
| 3 | +- [Contents:](#contents) |
| 4 | + - [What is Jurassic?](#what-is-jurassic) |
| 5 | + - [Variations of Jurassic and Parameters](#variations-of-jurassic-and-parameters) |
| 6 | + - [What does this folder contain?](#what-does-this-folder-contain) |
| 7 | + - [Evaluation Framework](#evaluation-framework) |
| 8 | + - [ Performance ](#-performance-) |
| 9 | + - [Classification](#classification) |
| 10 | + - [Summarization](#summarization) |
| 11 | + - [ Time \& Cost to Train ](#--time--cost-to-train--) |
| 12 | + - [ Inference ](#-inference-) |
| 13 | + |
| 14 | +## What is Jurassic? |
| 15 | + |
| 16 | +Jurassic-2, created by [AI21 Labs](https://www.ai21.com/), are large language models that give maximum flexibility to provide AI-first reading and writing experiences. Jurassic line of models are available to developers via the [AI21 Studio](https://www.ai21.com/studio). |
| 17 | + |
| 18 | +## Variations of Jurassic and Parameters |
| 19 | + |
| 20 | +Jurassic models, for API use, come in a variety of sizes, and can be leveraged depending on the task at hand. |
| 21 | + |
| 22 | +* Jurassic-2 Ultra: Unmatched quality |
| 23 | +* Jurassic-2 Mid: Optimal balance of quality, speed, and cost |
| 24 | +* Jurassic-2 Light: Fast and cost-effective |
| 25 | + |
| 26 | +More information about each model can be found on their documentation [website](https://docs.ai21.com/docs/jurassic-2-models). |
| 27 | + |
| 28 | +In parallel, there are a set of 3 models, which we think are similar to the above three, but with different names. |
| 29 | + |
| 30 | +| Jurassic variation | Parameters | |
| 31 | +|:---------------------:|:-------------:| |
| 32 | +| Jurassic-2 Light |7B | |
| 33 | +| Jurassic-2 Grande |17B | |
| 34 | +| Jurassic-2 Jumbo |178B | |
| 35 | + |
| 36 | +There are also other more specialized instruction-tuned models which can be found on their website. In this repository, we have experimented with all variations of Jurassic-2 models for out-of-the-box tasks as well as fine-tuning purposes. |
| 37 | + |
| 38 | +## What does this folder contain? |
| 39 | + |
| 40 | +This folder contains ready-to-use scripts, using which you can do the following: |
| 41 | + |
| 42 | +* Prompts used: |
| 43 | + * ```prompts.py```: Zero-shot, Few-shot and instruction tuning for classification and summarization |
| 44 | +* Infer Jurassic-2 models: |
| 45 | + * ```jurassic_baseline_inference.py```: Zero-shot, Few-shot and fine-tuned versions for summarization task |
| 46 | + * ```baseline_inference.sh```: Shell script to loop over all combinations of Jurassic-2's models for out-of-the-box summarization capabilities |
| 47 | + * ```custom_inference.sh```: Shell script to loop over fine-tuned Jurassic-2's models |
| 48 | + |
| 49 | + |
| 50 | +Note: To learn how to fine-tune Jurassic-2 line of models, please follow the simple and easy-to-understand [tutorial](https://docs.ai21.com/docs/custom-models) on AI21's website. |
| 51 | + |
| 52 | +## Evaluation Framework |
| 53 | + |
| 54 | +In this section, we bring to you our insights after extensively experimenting with InstructJurassic-30B across different tasks. For a thorough evaluation, we need to evaluate the __four pillars__: |
| 55 | + |
| 56 | +* Performance |
| 57 | +* Cost to Train |
| 58 | +* Time to Train |
| 59 | +* Inference Costs |
| 60 | + |
| 61 | + |
| 62 | +### <img src="../assets/rocket.gif" width="32" height="32"/> Performance <img src="../assets/rocket.gif" width="32" height="32"/> |
| 63 | + |
| 64 | +We evaluated InstructJurassic-30B under the following conditions: |
| 65 | + |
| 66 | +* Tasks & Datasets: |
| 67 | + * Summarization: Samsum dataset. |
| 68 | +* Experiments: |
| 69 | + * Zero-Shot prompting vs Few-Shot prompting vs Fine-tuning |
| 70 | +* Hardware: |
| 71 | + * No requirements as all the foundational and custom models are hosted by AI21 Labs. |
| 72 | + |
| 73 | +#### Classification #### |
| 74 | + |
| 75 | +We track accuracy to evaluate model's performance on classification task. |
| 76 | + |
| 77 | +<u> Table 1: Zero-Shot prompting </u> |
| 78 | + |
| 79 | +|Model | Accuracy (in %) | |
| 80 | +|:-------------:|:---------------:| |
| 81 | +|J2-Light | 1.82 | |
| 82 | +|J2-Mid | 22.93 | |
| 83 | +|J2-Ultra | 43.62 | |
| 84 | + |
| 85 | + |
| 86 | +<u> Insight: </u> |
| 87 | + |
| 88 | +* Jurassic-2 models show consistent improvement in zero-shot prompting setting as the model size increases. |
| 89 | +* J2-Ultra achieves the highest accuracy, followed by J2-Mid and J2-Light. |
| 90 | +* In our opinion, these numbers are very high considering these models are used out-of-the-box. |
| 91 | + |
| 92 | +#### Summarization #### |
| 93 | + |
| 94 | +We track ROUGE-1 and ROUGE-2 metrics to evaluate model's performance on summarization task. |
| 95 | + |
| 96 | +<u> Table 2: Zero-Shot prompting </u> |
| 97 | + |
| 98 | +|Model | ROUGE-1 (in %) | ROUGE-2 (in %) | |
| 99 | +|:-------------:|:--------------:|:--------------:| |
| 100 | +|J2-Light | 38.218 | 14.780 | |
| 101 | +|J2-Mid | 39.119 | 15.591 | |
| 102 | +|J2-Ultra | 41.635 | 17.273 | |
| 103 | + |
| 104 | + |
| 105 | +<u> Table 3: Few-Shot prompting </u> |
| 106 | + |
| 107 | +|Model | ROUGE-1 (in %) | ROUGE-2 (in %) | |
| 108 | +|:-------------:|:--------------:|:--------------:| |
| 109 | +|J2-Light | 40.736 | 17.092 | |
| 110 | +|J2-Mid | 43.390 | 18.346 | |
| 111 | +|J2-Ultra | 45.317 | 19.276 | |
| 112 | + |
| 113 | + |
| 114 | +<u> Table 4: Fine-tuning (custom model) </u> |
| 115 | + |
| 116 | +|Model | ROUGE-1 (in %) | ROUGE-2 (in %) | |
| 117 | +|:-------------:|:--------------:|:--------------:| |
| 118 | +|J2-Light | 44.694 | 20.153 | |
| 119 | +|J2-Grande | 48.385 | 23.901 | |
| 120 | +|J2-Jumbo | DNF\* | DNF\* | |
| 121 | + |
| 122 | + |
| 123 | +Note: All models for fine-tuned for 1 epoch. |
| 124 | + |
| 125 | + |
| 126 | +<u> Insight: </u> |
| 127 | + |
| 128 | +* For out-of-the-box performance, Jurassic-2's biggest model J2-Ultra achieves the best ROUGE-1 and ROUGE-2 when compared to the other models. |
| 129 | +* Jurassic-2's model performance steadily increases from Light to Mid to Ultra, across both zero-shot and few-shot settings. |
| 130 | +* Few-shot prompting consistently performs better than zero-shot prompting across all model variations. |
| 131 | +* As expected, J2-Light and J2-Grande fine tuned (custom) models achieve better performance than their out-of-the-box counterparts, i.e., zero-shot and few-shot. |
| 132 | +* Despite debugging efforts, the custom version of J2-Jumbo (the biggest model) does not generate anything (DNF) for the exact same inputs. |
| 133 | + |
| 134 | + |
| 135 | + |
| 136 | +### <img src="../assets/time.gif" width="32" height="32"/> <img src="../assets/money.gif" width="32" height="32"/> Time & Cost to Train <img src="../assets/money.gif" width="32" height="32"/> <img src="../assets/time.gif" width="32" height="32"/> |
| 137 | + |
| 138 | + |
| 139 | +* Fine-tuning Jurassic-2 custom models for 1 epoch took between 15 minutes to 1 hour, depending on the size of the model in consideration. |
| 140 | + |
| 141 | + |
| 142 | +Following is the cost breakdown for fine-tuning Jurassic-2's different models, as per what AI21 charged: |
| 143 | + |
| 144 | +|Model | MB / Epochs | Cost | |
| 145 | +|:-------------:|:-----------:|:-----:| |
| 146 | +|J2-Light | 8 | $0.88 | |
| 147 | +|J2-Grande | 8 | $4.00 | |
| 148 | +|J2-Jumbo | 8 | $26.47| |
| 149 | + |
| 150 | + |
| 151 | +<u> Insights: </u> |
| 152 | + |
| 153 | +* J2-Jumbo (the biggest model) does not generate anything for the exact same inputs. |
| 154 | +* MB / Epochs is basically the size of training data being used per epoch. In the case of summarization, Samsum's training split is roughly 8 MB in size. |
| 155 | + |
| 156 | + |
| 157 | +### <img src="../assets/progress.gif" width="32" height="32"/> Inference <img src="../assets/progress.gif" width="32" height="32"/> |
| 158 | + |
| 159 | +Following are the tables of the APIs' mean response time (MRT), i.e., how long it took for the API to return responses on average. |
| 160 | + |
| 161 | +#### Classification #### |
| 162 | + |
| 163 | +<u> Table 5: Zero-Shot prompting </u> |
| 164 | + |
| 165 | +|Model | Mean Response Time (MRT)(in sec) | |
| 166 | +|:-------------:|:--------------------------------:| |
| 167 | +|J2-Light | 0.58 ± 0.19 | |
| 168 | +|J2-Mid | 0.53 ± 0.27 | |
| 169 | +|J2-Ultra | 0.66 ± 0.45 | |
| 170 | + |
| 171 | + |
| 172 | +#### Summarization #### |
| 173 | + |
| 174 | +<u> Table 5: Zero-Shot prompting VS Few-shot prompting VS Fine-tuning </u> |
| 175 | + |
| 176 | +|Model | Zero-Shot MRT | Few-Shot MRT | Fine-tune MRT | |
| 177 | +|:-------------:|:-------------:|:------------:|:-------------:| |
| 178 | +|J2-Light | 0.63 ± 0.18 | 0.56 ± 0.17 | 0.48 ± 0.08 | |
| 179 | +|J2-Mid | 1.23 ± 6.45 | 0.83 ± 0.31 | 1.76 ± 1.01 | |
| 180 | +|J2-Ultra | 1.00 ± 0.44 | 0.92 ± 0.51 | DNF\* | |
| 181 | + |
| 182 | +<u> Insights: </u> |
| 183 | + |
| 184 | +* The MRT for Few-shot summarization is lower across all three models when compared with Zero-shot's MRT. This could be because the models generate longer summaries in a zero-shot setting as they do not have a sense of how long the summaries should be. As examples from the training split of SAMSUM are provided to the models in the few-shot setting, they guide the models to generate summaries that are smaller in length, similar to their ground-truth counterparts. |
| 185 | +* At this time, conclusive insights cannot be drawn from the fine-tuning MRTs. |
0 commit comments