Skip to content

Commit 8d96d6c

Browse files
janeyx99ebsmothers
andauthored
Remove nonexistent flag for acc offloading in memory_optimizations.rst (#1772)
Co-authored-by: ebsmothers <[email protected]>
1 parent 27b0fcc commit 8d96d6c

File tree

1 file changed

+5
-7
lines changed

1 file changed

+5
-7
lines changed

docs/source/tutorials/memory_optimizations.rst

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -98,19 +98,17 @@ See `PyTorch autograd hook tutorial <https://pytorch.org/tutorials/intermediate/
9898
for more details about how this is implemented through saved_tensors_hooks.
9999

100100
This setting is especially helpful for larger batch sizes, or longer context lengths when you're memory constrained.
101-
However, these savings in memory can come at the cost of training speed (i.e. tokens per-second), as it takes runtime
102-
and resources to move Tensors from GPU to CPU and back. The implementation in torchtune has the ``offload_with_streams``
103-
option to use multiple CUDA streams in order to overlap the extra communication with the computation to hide the extra
104-
runtime. As the communication workload is variable depending on the number and size of tensors being offloaded, it is
105-
common to not offload every single activation. In fact, once can use offloading in conjunction with activations
101+
While of course it takes runtime and resources to move Tensors from GPU to CPU and back, the implementation in
102+
torchtune uses multiple CUDA streams (when available) in order to overlap the extra communication with the computation
103+
to hide the extra runtime. As the communication workload is variable depending on the number and size of tensors being
104+
offloaded, it is common to not offload every single activation. In fact, one can use offloading in conjunction with activations
106105
checkpointing, where all activations will either be recomputed later in the backward or brought back from the CPU.
107106

108107
*Sounds great! How do I use it?*
109108

110109
To enable activation offloading, use the ``enable_activation_offloading`` config entry or flag
111110
in our lora finetuning single device recipe, e.g. ``enable_activation_offloading=True``. To allow
112-
usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.dev20240907 and
113-
specify ``offload_with_streams=True``.
111+
usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.dev20240907.
114112

115113
.. _glossary_grad_accm:
116114

0 commit comments

Comments
 (0)