Skip to content

[usability] Accelerate Support #936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
Jul 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
beba6ef
[usability] accelerate support initial commit
wheresmyhair Feb 22, 2025
ef083b6
[usability] accelerate support - scripts update
wheresmyhair Feb 27, 2025
f7ab42b
change ruff target ver
wheresmyhair Mar 4, 2025
19586d1
separate lisa and custom optims
wheresmyhair Mar 4, 2025
e12e25a
add decorator for base methods
wheresmyhair Mar 4, 2025
edc9236
encoder_decoder model temporarily disabled
wheresmyhair Mar 4, 2025
05e26a9
future support for trainer
wheresmyhair Mar 4, 2025
593f44b
fix fsdp cfg warp policy
wheresmyhair Mar 4, 2025
16eefe0
add note for single gpu
wheresmyhair Mar 4, 2025
29f10d8
add accelerate singlegpu config
wheresmyhair Mar 4, 2025
8194d36
fix qlora script
wheresmyhair Mar 6, 2025
0faa453
[usability] simplify shell scripts
wheresmyhair Mar 6, 2025
43ff307
[usability] add hint for flash_attn
wheresmyhair Mar 6, 2025
0ba9b60
[fix] fix create_customized_optimizer arg
wheresmyhair Mar 6, 2025
16d7af6
[usability] simplify dpov2 script
wheresmyhair Mar 6, 2025
c78ac26
change fsdp default wrap policy to transformer based
wheresmyhair Mar 11, 2025
e7712f1
streamlining
wheresmyhair Mar 11, 2025
71c6664
[fix] qlora+fsdp
wheresmyhair Mar 11, 2025
cdc6702
[usability] scripts default log dir update
wheresmyhair Mar 11, 2025
9d9ea68
temporarily remove raft from autopipeline test
wheresmyhair Mar 12, 2025
21e7156
fix broken test
wheresmyhair Mar 12, 2025
b65b80e
[fix] add condition for tests
wheresmyhair Mar 12, 2025
496eca5
[dev] add test for finetuner
wheresmyhair Mar 12, 2025
54e43fe
[usability] `lora_rank` args are now consistent with peft package `r`
wheresmyhair Mar 14, 2025
fb0e7de
[dev] add tests for finetuner
wheresmyhair Mar 14, 2025
afccd0b
[usability] accelerate support initial commit
wheresmyhair Feb 22, 2025
92e9182
[usability] accelerate support - scripts update
wheresmyhair Feb 27, 2025
ea6dda7
change ruff target ver
wheresmyhair Mar 4, 2025
e11b429
separate lisa and custom optims
wheresmyhair Mar 4, 2025
f68e6e5
add decorator for base methods
wheresmyhair Mar 4, 2025
d28af83
encoder_decoder model temporarily disabled
wheresmyhair Mar 4, 2025
0bff86a
future support for trainer
wheresmyhair Mar 4, 2025
4b6b9ef
fix fsdp cfg warp policy
wheresmyhair Mar 4, 2025
03f8999
add note for single gpu
wheresmyhair Mar 4, 2025
e71c93c
add accelerate singlegpu config
wheresmyhair Mar 4, 2025
c824761
fix qlora script
wheresmyhair Mar 6, 2025
4293034
[usability] simplify shell scripts
wheresmyhair Mar 6, 2025
02dc4ef
[usability] add hint for flash_attn
wheresmyhair Mar 6, 2025
7a55591
[fix] fix create_customized_optimizer arg
wheresmyhair Mar 6, 2025
4331f46
[usability] simplify dpov2 script
wheresmyhair Mar 6, 2025
d560851
change fsdp default wrap policy to transformer based
wheresmyhair Mar 11, 2025
13c013f
streamlining
wheresmyhair Mar 11, 2025
83f7ad8
[fix] qlora+fsdp
wheresmyhair Mar 11, 2025
79b32e6
[usability] scripts default log dir update
wheresmyhair Mar 11, 2025
8d7b4ef
temporarily remove raft from autopipeline test
wheresmyhair Mar 12, 2025
c8c71f2
fix broken test
wheresmyhair Mar 12, 2025
f49672e
[fix] add condition for tests
wheresmyhair Mar 12, 2025
6f4740f
[dev] add test for finetuner
wheresmyhair Mar 12, 2025
82e0467
[usability] `lora_rank` args are now consistent with peft package `r`
wheresmyhair Mar 14, 2025
a80d291
[dev] add tests for finetuner
wheresmyhair Mar 14, 2025
324ca09
Merge branch 'lmflow-nightly' of https://github.com/OptimalScale/LMFl…
wheresmyhair Mar 18, 2025
d7b2f55
[doc] add diagram for finetuner
wheresmyhair Mar 28, 2025
ce21c23
Merge branch 'main' into lmflow-nightly
wheresmyhair Apr 16, 2025
19c9ef5
qwen3 gemma3 support and estimated gpu mem
wheresmyhair May 13, 2025
402ea07
support load from lora checkpoint
wheresmyhair May 13, 2025
956c0fd
[feature] support count number of tokens
wheresmyhair May 13, 2025
c3f1524
[doc] readme update
wheresmyhair May 15, 2025
63bc859
doc files and assets restructure
wheresmyhair Jul 8, 2025
a6b97aa
[ci] lint
wheresmyhair Jul 8, 2025
3c365ba
[doc] ref old versions
wheresmyhair Jul 9, 2025
acfc334
[doc] setup update
wheresmyhair Jul 9, 2025
6f573fc
[doc] highlight update info
wheresmyhair Jul 9, 2025
9dd1bc7
[doc] news update
wheresmyhair Jul 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
*.ipynb linguist-detectable=false
*RAFT.pdf filter=lfs diff=lfs merge=lfs -text
*.gif filter=lfs diff=lfs merge=lfs -text
assets/*.gif filter=lfs diff=lfs merge=lfs -text
docs/figs/*.gif filter=lfs diff=lfs merge=lfs -text
6 changes: 2 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,13 @@ log/
regression_test/*/new_output_models
regression_test/*/new_log
output_dir/
tests_out

# data files
data/

# output models
output_models/
output_models
adapter_model/

# Distribution / packaging
Expand Down Expand Up @@ -168,9 +169,6 @@ debug.env
#ctags
tags

# pre-commit
.pre-commit*

# .lock
*.lock

Expand Down
8 changes: 8 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: "v0.11.4"
hooks:
- id: ruff
args: ["--fix", "--show-fixes", "--output-format=full"]
exclude: ^.*\.(ipynb)$
- id: ruff-format
59 changes: 34 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<p align="center" width="50%">
<img src="assets/logo.png" alt="LMFlow" style="width: 50%; min-width: 200px; display: block; margin: auto; background-color: transparent;">
<img src="docs/assets/logo.png" alt="LMFlow" style="width: 50%; min-width: 200px; display: block; margin: auto; background-color: transparent;">
</p>

# LMFlow
Expand All @@ -26,25 +26,26 @@
An extensible, convenient, and efficient toolbox for finetuning large machine learning models, designed to be user-friendly, speedy and reliable, and accessible to the entire community.

<p align="center" width="100%">
<img src="assets/features.png" alt="LMFlow-features" style="width: 100%; min-width: 300px; display: block; margin: auto;">
<img src="docs/assets/features.png" alt="LMFlow-features" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>

## Latest News
> [!IMPORTANT]
> * :exclamation: [2025-07-09] We have a major update to LMFlow with full Accelerate support and extensive streamlining. If you're looking for the previous version, please use `git checkout v0.0.10`, or check out the [v0.0.10 branch](https://github.com/OptimalScale/LMFlow/tree/v0.0.10). View all releases [here](https://github.com/OptimalScale/LMFlow/tags).

* [2025-03-18] With full support for Accelerate and lots of streamlining, LMFlow-nightly is now available! Feel free to try out the latest features and improvements by `git checkout lmflow-nightly`.
* [2024-12-02] Support [Hymba](https://github.com/NVlabs/hymba), a new family of small language models featuring a hybrid-head parallel architecture. Check out [Post-training Hymba](https://github.com/OptimalScale/LMFlow/tree/main/experimental/Hymba) for more details.
* [2024-07-01] 🏆 LMFlow receives the [**Best Demo Paper Award**](https://docs.google.com/presentation/d/1TVDooAZqkNObz5ysVhDFtqnnVHR-u8wqYvgix-gzPMs/edit#slide=id.g2e55907bbcc_0_70) at **NAACL 2024**! 🎉
* [2024-06-30] Expanding Optimization Options! We now support custom optimizer training with a variety of optimizers. Dive into the details and try out the new features with our updated script at [custom_optimizers](https://github.com/OptimalScale/LMFlow/blob/main/scripts/run_finetune_with_custom_optim.sh).
* [2024-04-25] :rocket: Support conversation template! We've preset the latest [Llama-3](https://huggingface.co/meta-llama/Meta-Llama-3-70B) and [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) conversation templates as well as some frequently used templates such as `chatml` (see all templates [here](https://optimalscale.github.io/LMFlow/examples/DATASETS.html#conversation-template)), and we are working on adding more preset templates. Adding corresponding `--conversation_template` in the shell script and you are all set! :rocket:
* [2024-03-27] Support [LISA](https://arxiv.org/abs/2403.17919), enabling 7B training in 24G memory without offloading!
* [2023-09-11] Support [speculative decoding](https://arxiv.org/abs/2211.17192). Check out [speculative_decoding](https://github.com/OptimalScale/LMFlow/blob/main/scripts/speculative_decoding/README.md) for the usage and acceleration details.
* [2023-08-14] Support long context inference with position interpolation (Linear & NTK scaling ) for LLaMA models. Check out [postion_interpolation](https://github.com/OptimalScale/LMFlow/blob/main/readme/Position_Interpolation.md) for more details.

<details> <summary>More news...</summary>

* [2024-03-27] Support [LISA](https://arxiv.org/abs/2403.17919), enabling 7B training in 24G memory without offloading!
* [2023-09-11] Support [speculative decoding](https://arxiv.org/abs/2211.17192). Check out [speculative_decoding](https://github.com/OptimalScale/LMFlow/blob/main/scripts/speculative_decoding/README.md) for the usage and acceleration details.
* [2023-08-14] Support long context inference with position interpolation (Linear & NTK scaling ) for LLaMA models. Check out [postion_interpolation](https://github.com/OptimalScale/LMFlow/blob/main/readme/Position_Interpolation.md) for more details.
* [2023-08-07] Support [Flash Attention-2](https://crfm.stanford.edu/2023/07/17/flash2.html). Check out [flash_attention](https://github.com/OptimalScale/LMFlow/blob/main/readme/flash_attn2.md) for more details.
* [2023-08-02] Support [Llama2](https://ai.meta.com/llama/), [ChatGLM2](https://huggingface.co/THUDM/chatglm2-6b), and [Baichuan](https://huggingface.co/baichuan-inc/Baichuan-7B) models.
* [2023-07-23] [LMFlow multimodal chatbot](https://github.com/OptimalScale/LMFlow/blob/main/scripts/run_vis_chatbot_gradio_minigpt4.sh) is now available! Support multimodal inputs of images and texts. [Online Demo](http://multimodal.lmflow.online) is also provided (We hold the service on a single GPU, hence one may experience "queuing" or "application busy" sometimes when multiple users are accessing at the same time, please wait and attempt again later when such event happens)![image](https://github.com/OptimalScale/LMFlow/blob/rpan-vision-encoder/assets/multimodal-chatbot-demo.gif)
* [2023-07-23] [LMFlow multimodal chatbot](https://github.com/OptimalScale/LMFlow/blob/main/scripts/run_vis_chatbot_gradio_minigpt4.sh) is now available! Support multimodal inputs of images and texts. [Online Demo](http://multimodal.lmflow.online) is also provided (We hold the service on a single GPU, hence one may experience "queuing" or "application busy" sometimes when multiple users are accessing at the same time, please wait and attempt again later when such event happens)![image](https://github.com/OptimalScale/LMFlow/blob/rpan-vision-encoder/docs/assets/multimodal-chatbot-demo.gif)
* [2023-06-22] [LMFlow paper](https://arxiv.org/abs/2306.12420) is out! Check out our implementation details at https://arxiv.org/abs/2306.12420
* [2023-06-16] Our finetuned Robin-33B-V2 scored an impressive 64.1 on the Huggingface LLM leaderboard in our offline evaluation, outperforming major open-source LLMs! All checkpoints (7B, 13B, 33B, and 65B) are [released](https://huggingface.co/OptimalScale)! Checkout the performance [here](https://medium.com/@hkust.ml/robin-v2-launches-achieves-unparalleled-performance-on-openllm-4f6886e822c1).
* [2023-06-07] LMFlow is now officially available on PyPI! Install it with `pip install lmflow-finetune`!
Expand All @@ -69,11 +70,11 @@ An extensible, convenient, and efficient toolbox for finetuning large machine le
- [LMFlow](#lmflow)
- [Latest News](#latest-news)
- [Table of Contents](#table-of-contents)
- [Supported Models](#supported-models)
- [Quick Start](#quick-start)
- [Setup](#setup)
- [Prepare Dataset](#prepare-dataset)
- [Finetuning](#finetuning)
- [Estimated Hardware Requirement](#estimated-hardware-requirement)
- [Full Finetuning](#full-finetuning)
- [LISA](#lisa)
- [LoRA](#lora)
Expand All @@ -85,21 +86,6 @@ An extensible, convenient, and efficient toolbox for finetuning large machine le
- [License](#license)
- [Citation](#citation)

## Supported Models

See all conversation template details [here](https://optimalscale.github.io/LMFlow/examples/supported_conversation_template.html).

| Model | Conversation Template |
| :---: | :-------------------: |
| DeepSeek | `deepseek` <br> `deepseek_v2` <br> `deepseek_r1` <br> `deepseek_r1_distill` <br> `deepseek_v3` |
| Gemma | `gemma` |
| Hymba | `hymba` |
| InternLM2 | `internlm2` |
| LLaMA | `llama2` <br> `llama3` <br> `llama3_for_tool`|
| Phi | `phi3` |
| Qwen | `qwen2` <br> `qwen2_for_tool` <br> `qwen2_5` <br> `qwen2_5_1m` <br> `qwen2_5_math` <br> `qwen_qwq` |
| Yi | `yi` <br> `yi1_5` |
| Zephyr | `zephyr` |

## Quick Start

Expand All @@ -108,15 +94,28 @@ See all conversation template details [here](https://optimalscale.github.io/LMFl
Our package has been tested on Linux OS (Ubuntu 20.04). Other OS platforms (MacOS, Windows) are not fully tested, where you may encounter unexpected errors. If you are using LMFlow for the first time, we recommend you to try on a Linux machine or Google Colab.

```bash
git clone -b v0.0.9 https://github.com/OptimalScale/LMFlow.git
git clone -b v1.0.0 https://github.com/OptimalScale/LMFlow.git
cd LMFlow
conda create -n lmflow python=3.9 -y
conda activate lmflow
conda install mpi4py
pip install -e .
```

<details><summary> Looking for a previous version? </summary>

```bash
git clone -b v0.0.10 https://github.com/OptimalScale/LMFlow.git
cd LMFlow
conda create -n lmflow python=3.9 -y
conda activate lmflow
conda install mpi4py
pip install -e .
```

<details><summary> for CUDA versions 10.3-11.7 </summary>
</details>

<details><summary> For CUDA versions 10.3-11.7 </summary>

```bash
git clone -b v0.0.5 https://github.com/OptimalScale/LMFlow.git
Expand Down Expand Up @@ -162,6 +161,16 @@ Please refer to our [doc](https://optimalscale.github.io/LMFlow/examples/DATASET

### Finetuning

#### Estimated Hardware Requirement

| Method | 0.5B | 3B | 7B | 14B | 30B | 70B | `x`B |
| ---------------------- | ---- | ---- | ---- | ----- | ----- | ----- | ------- |
| Full `bf16`/`fp16` | 9GB | 55GB |120GB | 240GB | 600GB | 1200GB| `18x`GB |
| LoRA | 1GB | 6GB | 16GB | 32GB | 64GB | 160GB | `2x`GB |
| QLoRA `quant_bit=8` | 0.7GB| 3GB | 10GB | 20GB | 40GB | 80GB| `x`GB |
| QLoRA `quant_bit=4` | 0.4GB| 1.5GB| 6GB | 12GB | 24GB | 48GB| `x/2`GB |


#### Full Finetuning

Full training updates all the parameters to finetune a language model.
Expand Down
29 changes: 29 additions & 0 deletions configs/accelerate_fsdp_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP

fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_min_num_params: 1000000
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true

downcast_bf16: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8 # NOTE: distributed_type should be `NO` if you're training on a single GPU
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 1204
29 changes: 29 additions & 0 deletions configs/accelerate_singlegpu_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'

fsdp_config:
fsdp_auto_wrap_policy: SIZE_BASED_WRAP
fsdp_min_num_params: 1000000
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_sharding_strategy: 'NO_SHARD'
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true

downcast_bf16: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 1204
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
1 change: 0 additions & 1 deletion configs/iterative_dpo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ preprocessing_num_workers: 16
output_dir: ./output_models/iterative_dpo
run_name: iterative_dpo
random_seed: 42
use_accelerator: True
enable_distributed_inference: True
distributed_inference_num_instances: 8
initial_iter_idx: 0 # 0 refers to the first dataset in dataset_path_list
Expand Down
Loading