@@ -98,19 +98,17 @@ See `PyTorch autograd hook tutorial <https://pytorch.org/tutorials/intermediate/
9898for more details about how this is implemented through saved_tensors_hooks.
9999
100100This setting is especially helpful for larger batch sizes, or longer context lengths when you're memory constrained.
101- However, these savings in memory can come at the cost of training speed (i.e. tokens per-second), as it takes runtime
102- and resources to move Tensors from GPU to CPU and back. The implementation in torchtune has the ``offload_with_streams ``
103- option to use multiple CUDA streams in order to overlap the extra communication with the computation to hide the extra
104- runtime. As the communication workload is variable depending on the number and size of tensors being offloaded, it is
105- common to not offload every single activation. In fact, once can use offloading in conjunction with activations
101+ While of course it takes runtime and resources to move Tensors from GPU to CPU and back, the implementation in
102+ torchtune uses multiple CUDA streams (when available) in order to overlap the extra communication with the computation
103+ to hide the extra runtime. As the communication workload is variable depending on the number and size of tensors being
104+ offloaded, it is common to not offload every single activation. In fact, one can use offloading in conjunction with activations
106105checkpointing, where all activations will either be recomputed later in the backward or brought back from the CPU.
107106
108107*Sounds great! How do I use it? *
109108
110109To enable activation offloading, use the ``enable_activation_offloading `` config entry or flag
111110in our lora finetuning single device recipe, e.g. ``enable_activation_offloading=True ``. To allow
112- usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.dev20240907 and
113- specify ``offload_with_streams=True ``.
111+ usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.dev20240907.
114112
115113.. _glossary_grad_accm :
116114
0 commit comments