Skip to content

Commit db5f2e6

Browse files
ShriyaRishabmikolajblazanmolgupt
authored
[LLM] NVIDIA Megatron reference (pytorch#562)
* [llm] Init draft NVIDIA reference * [LLM] Add exact HPs used to match NVIDIA's convergence curves * [LLM] Add data preprocessing steps and remove dropout * [LLM] fix eval, add ckpt load util, remove unnecessary files * [LLM] Update data preprocessing stage in README * Full validation and google settings * Apply review comments * Anmolgupt/nvidia llm reference update (pytorch#3) * Update Nvidia LLM reference code version Co-authored-by: Anmol Gupta <[email protected]> * fixes to imports (pytorch#5) Co-authored-by: Anmol Gupta <[email protected]> * distributed checkpoint and mlperf logger support (pytorch#6) * readme and mllogger keywords update (pytorch#7) Co-authored-by: Anmol Gupta <[email protected]> * Update fp32_checkpoint_checksum.log * Update README.md * Update README.md * Update README.md * mlperf logger keywords update (pytorch#8) Co-authored-by: Anmol Gupta <[email protected]> * [LLM] Create framework folder * [LLM] Update README to follow reference template * Describe LLM checkpoint format in README (pytorch#9) Describe LLM checkpoint format in README * [LLM] Readme updates, small fixes * readme update and run script eval update (pytorch#10) Co-authored-by: Anmol Gupta <[email protected]> --------- Co-authored-by: Mikołaj Błaż <[email protected]> Co-authored-by: anmolgupt <[email protected]> Co-authored-by: Anmol Gupta <[email protected]> Co-authored-by: mikolajblaz <[email protected]>
1 parent a9056b8 commit db5f2e6

File tree

160 files changed

+125746
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

160 files changed

+125746
-0
lines changed
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.04-py3
2+
FROM ${FROM_IMAGE_NAME}
3+
4+
# Copy code
5+
WORKDIR /workspace/llm
6+
COPY . .
7+
RUN pip install -r requirements.txt
8+
ENV PYTHONPATH "/workspace/llm:${PYTHONPATH}"

large_language_model/megatron-lm/LICENSE

Lines changed: 376 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# 1. Problem
2+
Large Language Model - GPT-3 175B
3+
4+
# 2. Directions
5+
6+
Our codebase is capable of training large language models with both model and data parallelism.
7+
8+
### Steps to configure machine
9+
10+
To use this repository, please install a supported version of PyTorch with GPU support (python 3.8, pytorch 1.12, cuda 11.6.2, and nccl 2.12.10 and above) and NVIDIA [APEX](https://github.com/NVIDIA/apex#quick-start). We recommend using one of [NGC's PyTorch containers](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch). The latest tested compatible version is `nvcr.io/nvidia/pytorch:22.04-py3`).
11+
12+
### Steps to run and time
13+
14+
To train GPT-3, set `COM_DIR` in `gpt3_blend.sh` to point to the C4 dataset location which contains the dataset after preprocessing.
15+
16+
```
17+
sbatch run_gpt3.sh <path to log directory> <path to BPE processed directory> <container>
18+
```
19+
20+
Use script `run_gpt3.sh` as shown above to run GPT-3 175B on clusters using slurm. You can adjust number of nodes (tested only with nodes>=8) and job run time in the sbatch command in line #3 of the `run_gpt3.sh` script.
21+
22+
Note that the model trains for 15 mins lesser than that actual run time because the last 15 mins are set aside for storing a checkpoint of the last iteration.
23+
24+
Command line arguments are described in detail in this source file [`arguments.py`](./megatron/arguments.py).
25+
26+
27+
# 3. Dataset/Environment
28+
29+
### Training and test data separation
30+
31+
We use C4/en/3.0.1 dataset from [HuggingFace/AllenAI](https://huggingface.co/datasets/allenai/c4).
32+
We do not host any datasets for GPT training.
33+
For validation, a subset of the validation dataset has been selected. Details as follows:
34+
24,567 examples were [selected](https://github.com/sgpyc/training/blob/paxml-llm-draft/large_language_model/paxml/utils/select_example.md) in the validation split to form a smaller eval set. The resulting tfrecord file is at gs://mlperf-llm-public2/c4/en/3.0.1/c4-validation_24567exp.tfrecord , with hashes of the text at gs://mlperf-llm-public2/c4/en/3.0.1/c4-validation_24567exp.hash.
35+
36+
Benchmarking region will use only 1/4th of the 1024 original `json.gz` files. Specifically, the last 1/4th of the files from 768 till 1024 `json.gz` are required.
37+
38+
### Data Preprocessing
39+
40+
Run the following commands to merge these 256 files into 2 `json.gz` files. Each of the `json.gz` files will be preprocessed into a pair of megatron dataset files (`.bin` and `.idx`).
41+
42+
```bash
43+
cd <path to C4>
44+
45+
# create softlinks to store each shard before merging
46+
mkdir -p softlinks
47+
for shard in {6..7}; do
48+
start=$((shard * 128))
49+
end=$((shard * 128 + 127))
50+
mkdir -p softlinks/en_$shard
51+
for ind in $(seq -f "%05g" $start $end); do
52+
ln -s ../../en/c4-train.${ind}-of-01024.json.gz softlinks/en_${shard}/c4-train.${ind}-of-01024.json.gz
53+
done
54+
done
55+
56+
# merge
57+
mkdir -p en_merge
58+
for shard in {6..7}; do
59+
cat softlinks/en_${shard}/*gz > en_merge/c4-train.en_${shard}.json.gz
60+
done
61+
cat en/c4-validation.0000* > en_merge/c4-validation.json.gz
62+
```
63+
64+
After preparing the data folder, download tokenizer model.
65+
Currently, SPM trained by google using [these](https://github.com/sgpyc/training/blob/paxml-llm-draft/large_language_model/paxml/utils/generate_spm.md) instructions is used.
66+
67+
Modify `C4_PATH` in `preprocess.sh` and `preprocess_val.sh` to specify
68+
the correct input/output paths and run preprocessing as follows
69+
```bash
70+
cd scripts
71+
sbatch preprocess.sh <path to c4>
72+
sbatch preprocess_val.sh <path to c4>
73+
```
74+
75+
Currently, the training script expects BPE [vocab.json](https://huggingface.co/gpt2/resolve/main/vocab.json) and [merges.txt](https://huggingface.co/gpt2/resolve/main/merges.txt) files. These files are used to create a BPE tokenizer which is only used for two things at this point in the code since tokenization is already done in the above step:
76+
77+
1. To find out the eod entry index (value is 50256)
78+
2. To find out the vocab size (value is 50257)
79+
80+
Correctness of the dataset preprocessing can be verified by comparing the checksums provided [here](./checksums/dataset_checksum.log)
81+
82+
83+
# 4. Model
84+
### Publication/Attribution
85+
Megatron ([1](https://arxiv.org/pdf/1909.08053.pdf), [2](https://arxiv.org/pdf/2104.04473.pdf), and [3](https://arxiv.org/pdf/2205.05198.pdf)) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.
86+
87+
### List of Layers
88+
89+
The model largely follows the GPT-3 paper, refer [here](https://docs.google.com/spreadsheets/d/1VdMXogbmoR-LWQJvdQ0BgIeK0Npe0qk50qVT7VpqIyo/edit?resourcekey=0-F8loESsxQtGsHMNNXMohTw#gid=620389348) for model details.
90+
91+
### Model checkpoint
92+
#### Conversion
93+
In the benchmarking region, we should resume training from a PAXML checkpoint which is trained with Global Batch Size of 1536 for 4000 iterations.
94+
Paxml Checkpoint is available at: gs://mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000
95+
To resume training from the above checkpoint on Megatron, it should be converted into a format suitable for Megatron (this step only needs to be done once).
96+
97+
To convert Paxml checkpoint to the Megatron's format, a [script](scripts/convert_paxml_to_megatron_distributed.py) has been provided:
98+
```bash
99+
# Convert model and optimizer parameters to Megatron format (runs in ~40 minutes on DGXA100, requires 1TB of CPU memory):
100+
python -u convert_paxml_to_megatron_distributed.py -gckpt $PAXML_CKPT_PATH -o $EXTERNAL_MODEL_CHECKPOINT_DIR --dtype fp32 # or `--dtype bf16` for BF16 checkpoint
101+
# Add framework-specific common.pt file to the checkpoint (instantaneous):
102+
python json_to_torch.py -i common_fp32.json -o $EXTERNAL_MODEL_CHECKPOINT_DIR/common.pt # or `-i common_bf16.json` for BF16 checkpoint
103+
```
104+
Correctness of the Megatron format checkpoint can be verified by comparing the checksums provided [here](./checksums/fp32_checkpoint_checksum.log). Validation log perplexity can also be used as a metric to verify the correctness of the checkpoint and the loading scripts. To do this, the model should be evaluated on the entire validation dataset after loading weights from the checkpoint. We have observed an average log perplexity of 2.7767 and a standard deviation of 0.00035 (data obtained from 16 runs).
105+
106+
**Note: For BF16 training, the conversion scripts need to be run again with the bf16 arguments specified above**
107+
108+
#### Checkpoint Parameters
109+
There are four groups of parameters in the checkpoint:
110+
1. model FP32 weights (or BF16 weights)
111+
2. first moments of the optimizer state
112+
3. second moments of the optimizer state
113+
4. model FP32 weights copy (created only for BF16 training)
114+
115+
For each model layer we store a separate directory for each of those groups, e.g. for position embeddings:
116+
1. `language_model.embedding.position_embeddings.weight`
117+
2. `optimizer.state.exp_avg.language_model.embedding.position_embeddings.weight` (first moments of the optimizer state)
118+
3. `optimizer.state.exp_avg_sq.language_model.embedding.position_embeddings.weight` (second moments of the optimizer state)
119+
4. `optimizer.state.fp32_from_fp16.language_model.embedding.position_embeddings.weight` (model FP32 weights copy created only for BF16 training)
120+
121+
Each directory contains a single Zarr array (see Zarr section below) and corresponds to a single parameter tensor
122+
(that might be split into different devices during model training).
123+
Pipeline parallel layers are stacked together in a single array.
124+
E.g. for a model with 96 transformer layers, the array corresponding to the self-attention QKV bias
125+
(`language_model.encoder.layers.self_attention.query_key_value.bias`) has shape [**96**, 36864, 12288].
126+
127+
#### Checkpoint Metadata
128+
All non-parameters data is stored in a `common.pt` torch file and contains framework specific information.
129+
An example content of a Megatron specific common.pt file is presented in `scripts/common_bf16.json` file.
130+
131+
Apart from that the checkpoint metadata is stored in `metadata.json` file.
132+
133+
#### Checkpoint Zarr format
134+
Each parameter is stored in a separate directory as a [Zarr](https://zarr.readthedocs.io/) array to allow parallel access.
135+
The content of a single directory is an array fragmented into multiple files (e.g. `0.0`, `0.1`, ...) and should be manipulated
136+
only with Zarr or Zarr-compatible libraries such as [TensorStore](https://google.github.io/tensorstore/).
137+
138+
Megatron features a small library in `megatron.core.dist_checkpointing` that builds on the Zarr and TensorStore primitives
139+
and allows operating on arrays split into different devices (in tensor or pipeline parallel groups).
140+
141+
We recommend to familiarize with the aforementioned libraries, but for convenience
142+
here is a snippet allowing to read a single layer array into a numpy array with either tensorstore or zarr:
143+
```python
144+
import tensorstore as ts
145+
import zarr
146+
147+
def open_with_ts(layer_dir):
148+
spec = {'driver': 'zarr',
149+
'metadata_key': '.zarray',
150+
'kvstore': {'driver': 'file', 'path': layer_dir}}
151+
return ts.open(ts.Spec(spec), open=True).result().read().result()
152+
153+
def open_with_zarr(layer_dir):
154+
return zarr.open(layer_dir)[:]
155+
156+
# e.g.
157+
layer_norm_weights_optim_state = open_with_ts('/llm_checkpoint/optimizer.state.exp_avg.language_model.encoder.final_layernorm.weight')
158+
```
159+
160+
Currently NumPy does not support BF16 datatype natively, but it can be added by just importing the tensorstore library (`import tensorstore`).
161+
162+
### How to run
163+
To load an external Megatron format checkpoint (in this case, it is a PAXML checkpoint converted to Megatron format) before training, set the following env variables:
164+
- `EXTERNAL_MODEL_CHECKPOINT_DIR` pointing to the checkpoint directory
165+
- `EXTERNAL_TRAINING_ITERATIONS` to number of iterations the external checkpoint was trained with (default: 4000)
166+
- `EXTERNAL_GBS` to global batch size the external checkpoint was trained with to determine number of samples already consumed (default: 1536)
167+
168+
Note that using an external checkpoint is needed only while training from a checkpoint that was not generated during the current training process in the benchmarking region. When _resuming_ Megatron training (e.g. after hitting a preset node time limit), `EXTERNAL_MODEL_CHECKPOINT_DIR` should not be set.
169+
170+
- Set `USE_BF16` env variable to true for BF16 training.
171+
172+
# 5. Quality
173+
174+
### Quality metric
175+
Log Perplexity
176+
177+
### Quality target
178+
2.69
179+
180+
### Evaluation frequency
181+
Evaluate after every 24576 samples (=50.33B tokens)
182+
183+
### Evaluation thoroughness
184+
Evaluation on the validation subset that consists of 24567 examples.
185+
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
317a1c1b0b17fbd658e3e0b09d118ce9 c4_en_6_c4_spm_text_document.bin
2+
5c8cfe37a26f919fb3998e84d1d07d8e c4_en_6_c4_spm_text_document.idx
3+
5a84af04d55765993ecb5461af56b718 c4_en_7_c4_spm_text_document.bin
4+
35b23332069840094e1a75332cdeab62 c4_en_7_c4_spm_text_document.idx
5+
20d868f6cb865ce616ce7b9cf8312be0 c4_en_validation_subset_c4_spm_text_document.bin
6+
f76050809d0b42611eeef31d67d04224 c4_en_validation_subset_c4_spm_text_document.idx

0 commit comments

Comments
 (0)