|
| 1 | +# 1. Problem |
| 2 | +Large Language Model - GPT-3 175B |
| 3 | + |
| 4 | +# 2. Directions |
| 5 | + |
| 6 | +Our codebase is capable of training large language models with both model and data parallelism. |
| 7 | + |
| 8 | +### Steps to configure machine |
| 9 | + |
| 10 | +To use this repository, please install a supported version of PyTorch with GPU support (python 3.8, pytorch 1.12, cuda 11.6.2, and nccl 2.12.10 and above) and NVIDIA [APEX](https://github.com/NVIDIA/apex#quick-start). We recommend using one of [NGC's PyTorch containers](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch). The latest tested compatible version is `nvcr.io/nvidia/pytorch:22.04-py3`). |
| 11 | + |
| 12 | +### Steps to run and time |
| 13 | + |
| 14 | +To train GPT-3, set `COM_DIR` in `gpt3_blend.sh` to point to the C4 dataset location which contains the dataset after preprocessing. |
| 15 | + |
| 16 | +``` |
| 17 | +sbatch run_gpt3.sh <path to log directory> <path to BPE processed directory> <container> |
| 18 | +``` |
| 19 | + |
| 20 | +Use script `run_gpt3.sh` as shown above to run GPT-3 175B on clusters using slurm. You can adjust number of nodes (tested only with nodes>=8) and job run time in the sbatch command in line #3 of the `run_gpt3.sh` script. |
| 21 | + |
| 22 | +Note that the model trains for 15 mins lesser than that actual run time because the last 15 mins are set aside for storing a checkpoint of the last iteration. |
| 23 | + |
| 24 | +Command line arguments are described in detail in this source file [`arguments.py`](./megatron/arguments.py). |
| 25 | + |
| 26 | + |
| 27 | +# 3. Dataset/Environment |
| 28 | + |
| 29 | +### Training and test data separation |
| 30 | + |
| 31 | +We use C4/en/3.0.1 dataset from [HuggingFace/AllenAI](https://huggingface.co/datasets/allenai/c4). |
| 32 | +We do not host any datasets for GPT training. |
| 33 | +For validation, a subset of the validation dataset has been selected. Details as follows: |
| 34 | +24,567 examples were [selected](https://github.com/sgpyc/training/blob/paxml-llm-draft/large_language_model/paxml/utils/select_example.md) in the validation split to form a smaller eval set. The resulting tfrecord file is at gs://mlperf-llm-public2/c4/en/3.0.1/c4-validation_24567exp.tfrecord , with hashes of the text at gs://mlperf-llm-public2/c4/en/3.0.1/c4-validation_24567exp.hash. |
| 35 | + |
| 36 | +Benchmarking region will use only 1/4th of the 1024 original `json.gz` files. Specifically, the last 1/4th of the files from 768 till 1024 `json.gz` are required. |
| 37 | + |
| 38 | +### Data Preprocessing |
| 39 | + |
| 40 | +Run the following commands to merge these 256 files into 2 `json.gz` files. Each of the `json.gz` files will be preprocessed into a pair of megatron dataset files (`.bin` and `.idx`). |
| 41 | + |
| 42 | +```bash |
| 43 | +cd <path to C4> |
| 44 | + |
| 45 | +# create softlinks to store each shard before merging |
| 46 | +mkdir -p softlinks |
| 47 | +for shard in {6..7}; do |
| 48 | + start=$((shard * 128)) |
| 49 | + end=$((shard * 128 + 127)) |
| 50 | + mkdir -p softlinks/en_$shard |
| 51 | + for ind in $(seq -f "%05g" $start $end); do |
| 52 | + ln -s ../../en/c4-train.${ind}-of-01024.json.gz softlinks/en_${shard}/c4-train.${ind}-of-01024.json.gz |
| 53 | + done |
| 54 | +done |
| 55 | + |
| 56 | +# merge |
| 57 | +mkdir -p en_merge |
| 58 | +for shard in {6..7}; do |
| 59 | + cat softlinks/en_${shard}/*gz > en_merge/c4-train.en_${shard}.json.gz |
| 60 | +done |
| 61 | +cat en/c4-validation.0000* > en_merge/c4-validation.json.gz |
| 62 | +``` |
| 63 | + |
| 64 | +After preparing the data folder, download tokenizer model. |
| 65 | +Currently, SPM trained by google using [these](https://github.com/sgpyc/training/blob/paxml-llm-draft/large_language_model/paxml/utils/generate_spm.md) instructions is used. |
| 66 | + |
| 67 | +Modify `C4_PATH` in `preprocess.sh` and `preprocess_val.sh` to specify |
| 68 | +the correct input/output paths and run preprocessing as follows |
| 69 | +```bash |
| 70 | +cd scripts |
| 71 | +sbatch preprocess.sh <path to c4> |
| 72 | +sbatch preprocess_val.sh <path to c4> |
| 73 | +``` |
| 74 | + |
| 75 | +Currently, the training script expects BPE [vocab.json](https://huggingface.co/gpt2/resolve/main/vocab.json) and [merges.txt](https://huggingface.co/gpt2/resolve/main/merges.txt) files. These files are used to create a BPE tokenizer which is only used for two things at this point in the code since tokenization is already done in the above step: |
| 76 | + |
| 77 | +1. To find out the eod entry index (value is 50256) |
| 78 | +2. To find out the vocab size (value is 50257) |
| 79 | + |
| 80 | +Correctness of the dataset preprocessing can be verified by comparing the checksums provided [here](./checksums/dataset_checksum.log) |
| 81 | + |
| 82 | + |
| 83 | +# 4. Model |
| 84 | +### Publication/Attribution |
| 85 | +Megatron ([1](https://arxiv.org/pdf/1909.08053.pdf), [2](https://arxiv.org/pdf/2104.04473.pdf), and [3](https://arxiv.org/pdf/2205.05198.pdf)) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. |
| 86 | + |
| 87 | +### List of Layers |
| 88 | + |
| 89 | +The model largely follows the GPT-3 paper, refer [here](https://docs.google.com/spreadsheets/d/1VdMXogbmoR-LWQJvdQ0BgIeK0Npe0qk50qVT7VpqIyo/edit?resourcekey=0-F8loESsxQtGsHMNNXMohTw#gid=620389348) for model details. |
| 90 | + |
| 91 | +### Model checkpoint |
| 92 | +#### Conversion |
| 93 | +In the benchmarking region, we should resume training from a PAXML checkpoint which is trained with Global Batch Size of 1536 for 4000 iterations. |
| 94 | +Paxml Checkpoint is available at: gs://mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000 |
| 95 | +To resume training from the above checkpoint on Megatron, it should be converted into a format suitable for Megatron (this step only needs to be done once). |
| 96 | + |
| 97 | +To convert Paxml checkpoint to the Megatron's format, a [script](scripts/convert_paxml_to_megatron_distributed.py) has been provided: |
| 98 | +```bash |
| 99 | +# Convert model and optimizer parameters to Megatron format (runs in ~40 minutes on DGXA100, requires 1TB of CPU memory): |
| 100 | +python -u convert_paxml_to_megatron_distributed.py -gckpt $PAXML_CKPT_PATH -o $EXTERNAL_MODEL_CHECKPOINT_DIR --dtype fp32 # or `--dtype bf16` for BF16 checkpoint |
| 101 | +# Add framework-specific common.pt file to the checkpoint (instantaneous): |
| 102 | +python json_to_torch.py -i common_fp32.json -o $EXTERNAL_MODEL_CHECKPOINT_DIR/common.pt # or `-i common_bf16.json` for BF16 checkpoint |
| 103 | +``` |
| 104 | +Correctness of the Megatron format checkpoint can be verified by comparing the checksums provided [here](./checksums/fp32_checkpoint_checksum.log). Validation log perplexity can also be used as a metric to verify the correctness of the checkpoint and the loading scripts. To do this, the model should be evaluated on the entire validation dataset after loading weights from the checkpoint. We have observed an average log perplexity of 2.7767 and a standard deviation of 0.00035 (data obtained from 16 runs). |
| 105 | + |
| 106 | +**Note: For BF16 training, the conversion scripts need to be run again with the bf16 arguments specified above** |
| 107 | + |
| 108 | +#### Checkpoint Parameters |
| 109 | +There are four groups of parameters in the checkpoint: |
| 110 | +1. model FP32 weights (or BF16 weights) |
| 111 | +2. first moments of the optimizer state |
| 112 | +3. second moments of the optimizer state |
| 113 | +4. model FP32 weights copy (created only for BF16 training) |
| 114 | + |
| 115 | +For each model layer we store a separate directory for each of those groups, e.g. for position embeddings: |
| 116 | +1. `language_model.embedding.position_embeddings.weight` |
| 117 | +2. `optimizer.state.exp_avg.language_model.embedding.position_embeddings.weight` (first moments of the optimizer state) |
| 118 | +3. `optimizer.state.exp_avg_sq.language_model.embedding.position_embeddings.weight` (second moments of the optimizer state) |
| 119 | +4. `optimizer.state.fp32_from_fp16.language_model.embedding.position_embeddings.weight` (model FP32 weights copy created only for BF16 training) |
| 120 | + |
| 121 | +Each directory contains a single Zarr array (see Zarr section below) and corresponds to a single parameter tensor |
| 122 | +(that might be split into different devices during model training). |
| 123 | +Pipeline parallel layers are stacked together in a single array. |
| 124 | +E.g. for a model with 96 transformer layers, the array corresponding to the self-attention QKV bias |
| 125 | +(`language_model.encoder.layers.self_attention.query_key_value.bias`) has shape [**96**, 36864, 12288]. |
| 126 | + |
| 127 | +#### Checkpoint Metadata |
| 128 | +All non-parameters data is stored in a `common.pt` torch file and contains framework specific information. |
| 129 | +An example content of a Megatron specific common.pt file is presented in `scripts/common_bf16.json` file. |
| 130 | + |
| 131 | +Apart from that the checkpoint metadata is stored in `metadata.json` file. |
| 132 | + |
| 133 | +#### Checkpoint Zarr format |
| 134 | +Each parameter is stored in a separate directory as a [Zarr](https://zarr.readthedocs.io/) array to allow parallel access. |
| 135 | +The content of a single directory is an array fragmented into multiple files (e.g. `0.0`, `0.1`, ...) and should be manipulated |
| 136 | +only with Zarr or Zarr-compatible libraries such as [TensorStore](https://google.github.io/tensorstore/). |
| 137 | + |
| 138 | +Megatron features a small library in `megatron.core.dist_checkpointing` that builds on the Zarr and TensorStore primitives |
| 139 | +and allows operating on arrays split into different devices (in tensor or pipeline parallel groups). |
| 140 | + |
| 141 | +We recommend to familiarize with the aforementioned libraries, but for convenience |
| 142 | +here is a snippet allowing to read a single layer array into a numpy array with either tensorstore or zarr: |
| 143 | +```python |
| 144 | +import tensorstore as ts |
| 145 | +import zarr |
| 146 | + |
| 147 | +def open_with_ts(layer_dir): |
| 148 | + spec = {'driver': 'zarr', |
| 149 | + 'metadata_key': '.zarray', |
| 150 | + 'kvstore': {'driver': 'file', 'path': layer_dir}} |
| 151 | + return ts.open(ts.Spec(spec), open=True).result().read().result() |
| 152 | + |
| 153 | +def open_with_zarr(layer_dir): |
| 154 | + return zarr.open(layer_dir)[:] |
| 155 | + |
| 156 | +# e.g. |
| 157 | +layer_norm_weights_optim_state = open_with_ts('/llm_checkpoint/optimizer.state.exp_avg.language_model.encoder.final_layernorm.weight') |
| 158 | +``` |
| 159 | + |
| 160 | +Currently NumPy does not support BF16 datatype natively, but it can be added by just importing the tensorstore library (`import tensorstore`). |
| 161 | + |
| 162 | +### How to run |
| 163 | +To load an external Megatron format checkpoint (in this case, it is a PAXML checkpoint converted to Megatron format) before training, set the following env variables: |
| 164 | +- `EXTERNAL_MODEL_CHECKPOINT_DIR` pointing to the checkpoint directory |
| 165 | +- `EXTERNAL_TRAINING_ITERATIONS` to number of iterations the external checkpoint was trained with (default: 4000) |
| 166 | +- `EXTERNAL_GBS` to global batch size the external checkpoint was trained with to determine number of samples already consumed (default: 1536) |
| 167 | + |
| 168 | +Note that using an external checkpoint is needed only while training from a checkpoint that was not generated during the current training process in the benchmarking region. When _resuming_ Megatron training (e.g. after hitting a preset node time limit), `EXTERNAL_MODEL_CHECKPOINT_DIR` should not be set. |
| 169 | + |
| 170 | +- Set `USE_BF16` env variable to true for BF16 training. |
| 171 | + |
| 172 | +# 5. Quality |
| 173 | + |
| 174 | +### Quality metric |
| 175 | +Log Perplexity |
| 176 | + |
| 177 | +### Quality target |
| 178 | +2.69 |
| 179 | + |
| 180 | +### Evaluation frequency |
| 181 | +Evaluate after every 24576 samples (=50.33B tokens) |
| 182 | + |
| 183 | +### Evaluation thoroughness |
| 184 | +Evaluation on the validation subset that consists of 24567 examples. |
| 185 | + |
0 commit comments