Skip to content

Commit 7caf18c

Browse files
jstjohnmoradza
andauthored
Evo2 Megatron Bridge Recipe Prototype (#1357)
### Description <!-- Provide a detailed description of the changes in this PR --> #### Usage Go to the recipe: ```bash cd bionemo-recipes/recipes/evo2_megatron ``` Build the image: ```bash docker build -t evo2_megatron . docker run --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -it evo2_megatron ``` NOTE: some 2xA6000 users in general have problems with 2x GPUs freezing during torchrun. If this happens do the following: ```bash export NCCL_P2P_DISABLE=1 ``` Then execute the mock data example with: ```python torchrun --nproc-per-node 1 --no-python train_evo2 --hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256 --model-size striped_hyena_1b_nv_parallel --max-steps 22 --eval-interval 10 --eval-iters 3 --mock-data --micro-batch-size 32 --global-batch-size 256 --seq-length 1024 --tensor-model-parallel 1 --use-precision-aware-optimizer --dataset-seed 33 --seed 41 --ckpt-async-save --spike-no-more-embedding-init --no-weight-decay-embeddings --cross-entropy-loss-fusion --align-param-gather --overlap-param-gather --grad-reduce-in-fp32 --decay-steps 100 --warmup-steps 10 --mixed-precision-recipe bf16-mixed --no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1 --attention-dropout 0.001 --hidden-dropout 0.001 --eod-pad-in-loss-mask --enable-preemption --log-interval 5 --debug-ddp-parity-freq 10 --wandb-project evo2-recipes-verification-tmp --wandb-run-name tmp_workstation_run_mock_data --result-dir tmp --no-renormalize-loss ``` That should give something like the following: ``` ---------------------------------- Setting rerun_state_machine.current_iteration to 0... Starting training loop at iteration 0 /usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4879: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning. warnings.warn( # warn only once Step Time : 7.04s GPU utilization: 24.7MODEL_TFLOP/s/GPU Number of parameters in transformer layers in billions: 0.86 [2025-12-05 00:25:41] iteration 1/ 10 | consumed samples: 128 | elapsed time per iteration (ms): 7037.3 | learning rate: 3.000000E-05 | global batch size: 128 | lm loss: 6.700717E+00 | loss scale: 1.0 | grad norm: 117.718 | number of skipped iterations: 0 | number of nan iterations: 0 | Number of parameters in embedding layers in billions: 0.00 Total number of parameters in billions: 0.86 Number of parameters in most loaded shard in billions: 0.8614 Theoretical memory footprints: weight and optimizer=9857.42 MB [Rank 0] (after 1 iterations) memory (GB) | mem-allocated-gigabytes: 13.367 | mem-active-gigabytes: 13.367 | mem-inactive-gigabytes: 0.42641 | mem-reserved-gigabytes: 14.259 | mem-max-allocated-gigabytes: 13.367 | mem-max-active-gigabytes: 13.367 | mem-max-inactive-gigabytes: 0.43855 | mem-max-reserved-gigabytes: 14.259 | mem-alloc-retires: 0 | mem-allocated-count: 284 Step Time : 5.96s GPU utilization: 29.2MODEL_TFLOP/s/GPU [2025-12-05 00:25:47] iteration 2/ 10 | consumed samples: 256 | elapsed time per iteration (ms): 5959.8 | learning rate: 6.000000E-05 | global batch size: 128 | lm loss: 6.705625E+00 | loss scale: 1.0 | grad norm: 117.618 | number of skipped iterations: 0 | number of nan iterations: 0 | Step Time : 5.95s GPU utilization: 29.2MODEL_TFLOP/s/GPU [2025-12-05 00:25:53] iteration 3/ 10 | consumed samples: 384 | elapsed time per iteration (ms): 5954.5 | learning rate: 9.000000E-05 | global batch size: 128 | lm loss: 8.673918E-02 | loss scale: 1.0 | grad norm: 5.777 | number of skipped iterations: 0 | number of nan iterations: 0 | Step Time : 5.97s GPU utilization: 29.1MODEL_TFLOP/s/GPU [2025-12-05 00:25:59] iteration 4/ 10 | consumed samples: 512 | elapsed time per iteration (ms): 5974.1 | learning rate: 1.200000E-04 | global batch size: 128 | lm loss: 7.124253E-03 | loss scale: 1.0 | grad norm: 0.827 | number of skipped iterations: 0 | number of nan iterations: 0 | Step Time : 6.00s GPU utilization: 29.0MODEL_TFLOP/s/GPU [2025-12-05 00:26:05] iteration 5/ 10 | consumed samples: 640 | elapsed time per iteration (ms): 5996.1 | learning rate: 1.500000E-04 | global batch size: 128 | lm loss: 1.208314E-03 | loss scale: 1.0 | grad norm: 0.130 | number of skipped iterations: 0 | number of nan iterations: 0 | Step Time : 6.02s GPU utilization: 28.9MODEL_TFLOP/s/GPU [2025-12-05 00:26:11] iteration 6/ 10 | consumed samples: 768 | elapsed time per iteration (ms): 6017.2 | learning rate: 1.800000E-04 | global batch size: 128 | lm loss: 4.079018E-04 | loss scale: 1.0 | grad norm: 0.041 | number of skipped iterations: 0 | number of nan iterations: 0 | Step Time : 6.03s GPU utilization: 28.9MODEL_TFLOP/s/GPU [2025-12-05 00:26:17] iteration 7/ 10 | consumed samples: 896 | elapsed time per iteration (ms): 6034.3 | learning rate: 2.100000E-04 | global batch size: 128 | lm loss: 1.331343E-04 | loss scale: 1.0 | grad norm: 0.010 | number of skipped iterations: 0 | number of nan iterations: 0 | Step Time : 6.05s GPU utilization: 28.8MODEL_TFLOP/s/GPU [2025-12-05 00:26:23] iteration 8/ 10 | consumed samples: 1024 | elapsed time per iteration (ms): 6045.5 | learning rate: 2.400000E-04 | global batch size: 128 | lm loss: 8.991118E-05 | loss scale: 1.0 | grad norm: 0.006 | number of skipped iterations: 0 | number of nan iterations: 0 | Step Time : 6.07s GPU utilization: 28.7MODEL_TFLOP/s/GPU [2025-12-05 00:26:29] iteration 9/ 10 | consumed samples: 1152 | elapsed time per iteration (ms): 6065.8 | learning rate: 2.700000E-04 | global batch size: 128 | lm loss: 7.406142E-05 | loss scale: 1.0 | grad norm: 0.005 | number of skipped iterations: 0 | number of nan iterations: 0 | Step Time : 6.08s GPU utilization: 28.7MODEL_TFLOP/s/GPU [2025-12-05 00:26:35] iteration 10/ 10 | consumed samples: 1280 | elapsed time per iteration (ms): 6078.1 | learning rate: 3.000000E-04 | global batch size: 128 | lm loss: 5.936620E-05 | loss scale: 1.0 | grad norm: 0.004 | number of skipped iterations: 0 | number of nan iterations: 0 | [after training is done] datetime: 2025-12-05 00:26:35 ``` #### Accuracy evaluation We are on-par between bf16 and the previous FP8 runs. However there is a bug where this FP8 recipe is underperforming. This is in addition to the following two issues which also block FP8 use in practice currently: NVIDIA-NeMo/Megatron-Bridge#1730, NVIDIA-NeMo/Megatron-Bridge#1707 <img width="576" height="455" alt="training_loss_comparison" src="https://github.com/user-attachments/assets/1952dce3-1ffb-4154-8f74-2a5694aa0794" /> #### Performance Comparison Both BF16 and FP8 precision outperform the previous FP8 runs in NeMo2. Evo2 1B Run | Seconds per step (lower is better) | Tokens/sec/GPU | Global Batch Size | Number of GPUs | Vocab Size -- | -- | -- | -- | -- | -- MBridge BF16 | 6.10 | 26,859 | 960 | 48 | 256 MBridge FP8 (delayed) | 5.38 | 30,453 | 960 | 48 | 256 MBridge FP8 (delayed) | 5.39 | 30,397 | 960 | 48 | 512 Nemo2 FP8 (delayed) | 6.18 | 26,511 | 960 | 48 | 512 ### Type of changes <!-- Mark the relevant option with an [x] --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run. - [ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip) - Skip all CI tests for this PR - [ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks) - Run Jupyter notebooks execution tests for bionemo2 - [ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow) - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 - [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2. - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes. Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline. For more details, see [CONTRIBUTING](CONTRIBUTING.md) > [!NOTE] > By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully --------- Signed-off-by: John St John <[email protected]> Co-authored-by: amoradzadeh <[email protected]>
1 parent 3c68452 commit 7caf18c

File tree

112 files changed

+25312
-2
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

112 files changed

+25312
-2
lines changed

.devcontainer/recipes/Dockerfile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
# Uncomment to use the latest TE from the NGC registry for debugging changes with latest TE.
22
# FROM gitlab-master.nvidia.com/dl/transformerengine/transformerengine:main-pytorch-py3-base
33
FROM nvcr.io/nvidia/pytorch:25.11-py3
4+
5+
# FIXME: Fix for "No such file or directory: /workspace/TransformerEngine"
6+
# Remove once bug has been addressed in the nvidia/pytorch container.
7+
RUN rm -f /usr/local/lib/python*/dist-packages/transformer_engine-*.dist-info/direct_url.json
8+
49
RUN --mount=type=cache,target=/root/.cache/pip \
510
--mount=type=bind,source=requirements.txt,target=/workspace/requirements.txt \
611
PIP_CONSTRAINT= pip install -r /workspace/requirements.txt

.github/workflows/unit-tests-recipes.yml

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,9 @@ jobs:
158158
- name: Install dependencies
159159
working-directory: ${{ matrix.recipe.dir }}
160160
run: |
161-
if [ -f pyproject.toml ] || [ -f setup.py ]; then
161+
if [ -f .ci_build.sh ]; then
162+
bash .ci_build.sh
163+
elif [ -f pyproject.toml ] || [ -f setup.py ]; then
162164
PIP_CONSTRAINT= pip install -e .
163165
echo "Installed ${{ matrix.recipe.dir }} as editable package"
164166
elif [ -f requirements.txt ]; then
@@ -171,7 +173,12 @@ jobs:
171173
172174
- name: Run tests
173175
working-directory: ${{ matrix.recipe.dir }}
174-
run: pytest -v .
176+
run: |
177+
if [ -f .ci_test_env.sh ]; then
178+
source .ci_test_env.sh
179+
fi
180+
pytest -v .
181+
175182
176183
verify-recipe-tests:
177184
# This job checks the status of the unit-tests matrix and fails if any matrix job failed or was cancelled.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/bin/bash -x
2+
3+
# FIXME: Fix for "No such file or directory: /workspace/TransformerEngine"
4+
# Remove once bug has been addressed in the nvidia/pytorch container.
5+
rm -f /usr/local/lib/python*/dist-packages/transformer_engine-*.dist-info/direct_url.json
6+
7+
export UV_LINK_MODE=copy
8+
uv venv --system-site-packages
9+
10+
# 2. Activate the environment
11+
source .venv/bin/activate
12+
13+
# 3. Install dependencies and ensure that constraints are not violated
14+
pip freeze | grep transformer_engine > pip-constraints.txt
15+
uv pip install -r build_requirements.txt --no-build-isolation # some extra requirements are needed for building
16+
uv pip install -c pip-constraints.txt -e . --no-build-isolation
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
2+
source .venv/bin/activate
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Dockerfile
2+
README.md
3+
checkpoint_export
4+
outputs
5+
tmp*
6+
*.egg-info
7+
.ruff_cache
8+
__pycache__
9+
.pytest_cache
10+
.ruff.toml
11+
.dockerignore
12+
.venv
13+
.ruff_cache
14+
nemo_experiments
15+
*.sqsh
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
pip-constraints.txt
2+
tmp*
3+
*.egg-info/
4+
.ruff_cache
5+
__pycache__
6+
.pytest_cache
7+
.ruff.toml
8+
.venv
9+
.ruff_cache/
10+
nemo_experiments
11+
wandb
12+
*.sqsh
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
extend = "../.ruff.toml"
2+
[lint]
3+
per-file-ignores = { "tokenizer_auto" = ["ALL"] }
4+
ignore = ["E731", "RUF005","C901"]
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# syntax=docker/dockerfile:1.4
2+
FROM nvcr.io/nvidia/pytorch:25.11-py3
3+
4+
# 1. Install uv (Method: COPY from official image is cleanest)
5+
#COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
6+
7+
WORKDIR /workspace/bionemo
8+
COPY . .
9+
10+
# 2. Fix for "No such file or directory: /workspace/TransformerEngine"
11+
# This removes the "direct_url" reference that confuses tools when TE was installed from source in the base image.
12+
RUN rm -f /usr/local/lib/python*/dist-packages/transformer_engine-*.dist-info/direct_url.json
13+
14+
ENV UV_LINK_MODE=copy
15+
# Ensure we use the venv by default for all future commands
16+
ENV VIRTUAL_ENV=/workspace/.venv
17+
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
18+
19+
# 3. Create the venv with access to system packages (Torch, TE, etc.)
20+
# We create it one level up (/workspace/.venv) to keep it out of the source dir
21+
RUN uv venv --system-site-packages --seed $VIRTUAL_ENV
22+
23+
# 4. Create a robust constraints file
24+
# It is safer to freeze ALL system packages to prevent uv from trying to upgrade them
25+
# accidentally, though your pyproject.toml overrides handle the critical ones.
26+
RUN pip freeze | grep transformer_engine > pip-constraints.txt
27+
28+
# 5. Install package and dependencies
29+
RUN --mount=type=secret,id=netrc,target=/root/.netrc \
30+
--mount=type=cache,target=/root/.cache/uv \
31+
--mount=type=cache,target=/root/.cache/pip \
32+
uv pip install -r build_requirements.txt --no-build-isolation && \
33+
uv pip install -c pip-constraints.txt -e . --no-build-isolation
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
2+
Apache License
3+
Version 2.0, January 2004
4+
http://www.apache.org/licenses/
5+
6+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7+
8+
1. Definitions.
9+
10+
"License" shall mean the terms and conditions for use, reproduction,
11+
and distribution as defined by Sections 1 through 9 of this document.
12+
13+
"Licensor" shall mean the copyright owner or entity authorized by
14+
the copyright owner that is granting the License.
15+
16+
"Legal Entity" shall mean the union of the acting entity and all
17+
other entities that control, are controlled by, or are under common
18+
control with that entity. For the purposes of this definition,
19+
"control" means (i) the power, direct or indirect, to cause the
20+
direction or management of such entity, whether by contract or
21+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
22+
outstanding shares, or (iii) beneficial ownership of such entity.
23+
24+
"You" (or "Your") shall mean an individual or Legal Entity
25+
exercising permissions granted by this License.
26+
27+
"Source" form shall mean the preferred form for making modifications,
28+
including but not limited to software source code, documentation
29+
source, and configuration files.
30+
31+
"Object" form shall mean any form resulting from mechanical
32+
transformation or translation of a Source form, including but
33+
not limited to compiled object code, generated documentation,
34+
and conversions to other media types.
35+
36+
"Work" shall mean the work of authorship, whether in Source or
37+
Object form, made available under the License, as indicated by a
38+
copyright notice that is included in or attached to the work
39+
(an example is provided in the Appendix below).
40+
41+
"Derivative Works" shall mean any work, whether in Source or Object
42+
form, that is based on (or derived from) the Work and for which the
43+
editorial revisions, annotations, elaborations, or other modifications
44+
represent, as a whole, an original work of authorship. For the purposes
45+
of this License, Derivative Works shall not include works that remain
46+
separable from, or merely link (or bind by name) to the interfaces of,
47+
the Work and Derivative Works thereof.
48+
49+
"Contribution" shall mean any work of authorship, including
50+
the original version of the Work and any modifications or additions
51+
to that Work or Derivative Works thereof, that is intentionally
52+
submitted to Licensor for inclusion in the Work by the copyright owner
53+
or by an individual or Legal Entity authorized to submit on behalf of
54+
the copyright owner. For the purposes of this definition, "submitted"
55+
means any form of electronic, verbal, or written communication sent
56+
to the Licensor or its representatives, including but not limited to
57+
communication on electronic mailing lists, source code control systems,
58+
and issue tracking systems that are managed by, or on behalf of, the
59+
Licensor for the purpose of discussing and improving the Work, but
60+
excluding communication that is conspicuously marked or otherwise
61+
designated in writing by the copyright owner as "Not a Contribution."
62+
63+
"Contributor" shall mean Licensor and any individual or Legal Entity
64+
on behalf of whom a Contribution has been received by Licensor and
65+
subsequently incorporated within the Work.
66+
67+
2. Grant of Copyright License. Subject to the terms and conditions of
68+
this License, each Contributor hereby grants to You a perpetual,
69+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70+
copyright license to reproduce, prepare Derivative Works of,
71+
publicly display, publicly perform, sublicense, and distribute the
72+
Work and such Derivative Works in Source or Object form.
73+
74+
3. Grant of Patent License. Subject to the terms and conditions of
75+
this License, each Contributor hereby grants to You a perpetual,
76+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77+
(except as stated in this section) patent license to make, have made,
78+
use, offer to sell, sell, import, and otherwise transfer the Work,
79+
where such license applies only to those patent claims licensable
80+
by such Contributor that are necessarily infringed by their
81+
Contribution(s) alone or by combination of their Contribution(s)
82+
with the Work to which such Contribution(s) was submitted. If You
83+
institute patent litigation against any entity (including a
84+
cross-claim or counterclaim in a lawsuit) alleging that the Work
85+
or a Contribution incorporated within the Work constitutes direct
86+
or contributory patent infringement, then any patent licenses
87+
granted to You under this License for that Work shall terminate
88+
as of the date such litigation is filed.
89+
90+
4. Redistribution. You may reproduce and distribute copies of the
91+
Work or Derivative Works thereof in any medium, with or without
92+
modifications, and in Source or Object form, provided that You
93+
meet the following conditions:
94+
95+
(a) You must give any other recipients of the Work or
96+
Derivative Works a copy of this License; and
97+
98+
(b) You must cause any modified files to carry prominent notices
99+
stating that You changed the files; and
100+
101+
(c) You must retain, in the Source form of any Derivative Works
102+
that You distribute, all copyright, patent, trademark, and
103+
attribution notices from the Source form of the Work,
104+
excluding those notices that do not pertain to any part of
105+
the Derivative Works; and
106+
107+
(d) If the Work includes a "NOTICE" text file as part of its
108+
distribution, then any Derivative Works that You distribute must
109+
include a readable copy of the attribution notices contained
110+
within such NOTICE file, excluding those notices that do not
111+
pertain to any part of the Derivative Works, in at least one
112+
of the following places: within a NOTICE text file distributed
113+
as part of the Derivative Works; within the Source form or
114+
documentation, if provided along with the Derivative Works; or,
115+
within a display generated by the Derivative Works, if and
116+
wherever such third-party notices normally appear. The contents
117+
of the NOTICE file are for informational purposes only and
118+
do not modify the License. You may add Your own attribution
119+
notices within Derivative Works that You distribute, alongside
120+
or as an addendum to the NOTICE text from the Work, provided
121+
that such additional attribution notices cannot be construed
122+
as modifying the License.
123+
124+
You may add Your own copyright statement to Your modifications and
125+
may provide additional or different license terms and conditions
126+
for use, reproduction, or distribution of Your modifications, or
127+
for any such Derivative Works as a whole, provided Your use,
128+
reproduction, and distribution of the Work otherwise complies with
129+
the conditions stated in this License.
130+
131+
5. Submission of Contributions. Unless You explicitly state otherwise,
132+
any Contribution intentionally submitted for inclusion in the Work
133+
by You to the Licensor shall be under the terms and conditions of
134+
this License, without any additional terms or conditions.
135+
Notwithstanding the above, nothing herein shall supersede or modify
136+
the terms of any separate license agreement you may have executed
137+
with Licensor regarding such Contributions.
138+
139+
6. Trademarks. This License does not grant permission to use the trade
140+
names, trademarks, service marks, or product names of the Licensor,
141+
except as required for reasonable and customary use in describing the
142+
origin of the Work and reproducing the content of the NOTICE file.
143+
144+
7. Disclaimer of Warranty. Unless required by applicable law or
145+
agreed to in writing, Licensor provides the Work (and each
146+
Contributor provides its Contributions) on an "AS IS" BASIS,
147+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148+
implied, including, without limitation, any warranties or conditions
149+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150+
PARTICULAR PURPOSE. You are solely responsible for determining the
151+
appropriateness of using or redistributing the Work and assume any
152+
risks associated with Your exercise of permissions under this License.
153+
154+
8. Limitation of Liability. In no event and under no legal theory,
155+
whether in tort (including negligence), contract, or otherwise,
156+
unless required by applicable law (such as deliberate and grossly
157+
negligent acts) or agreed to in writing, shall any Contributor be
158+
liable to You for damages, including any direct, indirect, special,
159+
incidental, or consequential damages of any character arising as a
160+
result of this License or out of the use or inability to use the
161+
Work (including but not limited to damages for loss of goodwill,
162+
work stoppage, computer failure or malfunction, or any and all
163+
other commercial damages or losses), even if such Contributor
164+
has been advised of the possibility of such damages.
165+
166+
9. Accepting Warranty or Additional Liability. While redistributing
167+
the Work or Derivative Works thereof, You may choose to offer,
168+
and charge a fee for, acceptance of support, warranty, indemnity,
169+
or other liability obligations and/or rights consistent with this
170+
License. However, in accepting such obligations, You may act only
171+
on Your own behalf and on Your sole responsibility, not on behalf
172+
of any other Contributor, and only if You agree to indemnify,
173+
defend, and hold each Contributor harmless for any liability
174+
incurred by, or claims asserted against, such Contributor by reason
175+
of your accepting any such warranty or additional liability.
176+
177+
END OF TERMS AND CONDITIONS
178+
179+
APPENDIX: How to apply the Apache License to your work.
180+
181+
To apply the Apache License to your work, attach the following
182+
boilerplate notice, with the fields enclosed by brackets "[]"
183+
replaced with your own identifying information. (Don't include
184+
the brackets!) The text should be enclosed in the appropriate
185+
comment syntax for the file format. We also recommend that a
186+
file or class name and description of purpose be included on the
187+
same "printed page" as the copyright notice for easier
188+
identification within third-party archives.
189+
190+
Copyright [yyyy] [name of copyright owner]
191+
192+
Licensed under the Apache License, Version 2.0 (the "License");
193+
you may not use this file except in compliance with the License.
194+
You may obtain a copy of the License at
195+
196+
http://www.apache.org/licenses/LICENSE-2.0
197+
198+
Unless required by applicable law or agreed to in writing, software
199+
distributed under the License is distributed on an "AS IS" BASIS,
200+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201+
See the License for the specific language governing permissions and
202+
limitations under the License.
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Evo2 Recipe
2+
3+
This recipe is work-in-progress rewrite of the nemo2 based bionemo/evo2 package into a self-contained
4+
training repository that makes use of megatron bridge.
5+
6+
## Installation
7+
8+
```
9+
# 1. Create venv (CRITICAL: include system packages so it sees the container's PyTorch)
10+
export UV_LINK_MODE=copy
11+
uv venv --system-site-packages --seed /workspace/.venv
12+
13+
# 2. Activate the environment
14+
source /workspace/.venv/bin/activate
15+
pip freeze | grep transformer_engine > pip-constraints.txt
16+
uv pip install -r build_requirements.txt --no-build-isolation # some extra requirements are needed for building
17+
uv pip install -c pip-constraints.txt -e . --no-build-isolation
18+
```
19+
20+
## Usage
21+
22+
```
23+
# 3. Run an example job
24+
## 2. if on a6000s, you may need to disable p2p to avoid crashing
25+
export NCCL_P2P_DISABLE=1
26+
## 3. Run the job:
27+
torchrun --nproc-per-node 8 --no-python \
28+
train_evo2 \
29+
--hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256 \
30+
--model-size striped_hyena_1b_nv_parallel --max-steps 12 --eval-interval 10 \
31+
--eval-iters 3 --mock-data \
32+
--micro-batch-size 32 --global-batch-size 256 --seq-length 1024 \
33+
--tensor-model-parallel 1 \
34+
--use-precision-aware-optimizer --dataset-seed 33 \
35+
--seed 41 --ckpt-async-save --spike-no-more-embedding-init \
36+
--no-weight-decay-embeddings --cross-entropy-loss-fusion \
37+
--align-param-gather --overlap-param-gather --grad-reduce-in-fp32 \
38+
--decay-steps 100 --warmup-steps 10 \
39+
--mixed-precision-recipe bf16-mixed \
40+
--no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1 \
41+
--attention-dropout 0.001 --hidden-dropout 0.001 \
42+
--eod-pad-in-loss-mask --enable-preemption \
43+
--log-interval 5 --debug-ddp-parity-freq 10 \
44+
--wandb-project evo2-recipes-verification-tmp \
45+
--wandb-run-name tmp_workstation_run_mock_data \
46+
--result-dir tmpbf16 --no-renormalize-loss
47+
```
48+
49+
## Docker build
50+
51+
```
52+
docker build -t evo2_megatron_recipe-$(git rev-parse --short HEAD) .
53+
```

0 commit comments

Comments
 (0)