Improve Wandb experience #660

tcapelle · 2024-04-05T13:30:32Z

The idea is create a better W&B experience for the end user

Log the config used, taking into account the param override at runtime. We dump the OmegaConf inside the checkpoint_dir and save the file. This could also be added to other metric_loggers. We also keep track of the config in the Overview tab by updating the underlying wandb.config. We also add the run-id to the config filename, so they don't automatically overwrite. The naming is f"torchtune_config_{self._wandb.run.id}.yaml" at this time.
I also updated the memory_stats_log function to return a dict so we can also log that to W&B. Also only log memory stats when training on GPU (some recipe tests use CPU)
Haven't added artifact logging, this is useful for corporate users to keep track of their models and hav lineage.

…meta-pytorch#650) Co-authored-by: Kartikay Khandelwal <[email protected]>

ebsmothers

Thanks for working on this! Left a bunch of comments, please let me know if any of them are unclear. A couple other general things:

I think you need to run pre-commit hooks, you can follow the instructions here for setting that up.
It looks like the doc build job is failing. If it helps, you should be able to serve the docs locally for debugging by following the instructions here

ebsmothers · 2024-04-12T16:31:20Z

recipes/full_finetune_single_device.py

-        )
+        memory_stats = utils.memory_stats_log(device=self._device)
+        log.info(f"Memory Stats:\n{memory_stats}")
+        log.info(f"Model trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)/1e6:,.2f}M")


Can we remove this line?

ebsmothers · 2024-04-12T16:32:05Z

recipes/full_finetune_single_device.py

-            )
-        )
+        memory_stats = utils.memory_stats_log(device=self._device)
+        log.info(f"Memory Stats:\n{memory_stats}")


nit: change back to "Memory Stats after model init:" just to be explicit

ebsmothers · 2024-04-12T16:35:05Z

recipes/full_finetune_single_device.py

+                    self._metric_logger.log_dict(memory_stats, step=self.total_training_steps)
+                    log.info(f"Memory Stats:\n{memory_stats}")


So right now we log to stdout only when WandB is not enabled, but log to both WandB and stdout when it is enabled? Don't all our metric loggers support log_dict? If so, can we just call log_dict only here?

torchtune/utils/metric_logging.py

ebsmothers · 2024-04-12T16:36:42Z

torchtune/utils/metric_logging.py

+        project: str = "torchtune",
        entity: Optional[str] = None,
        group: Optional[str] = None,
+        log_strategy: Optional[str] = "main",


nit: could type as a literal, like we do e.g. here

torchtune/utils/metric_logging.py

ebsmothers · 2024-04-12T16:58:39Z

torchtune/utils/metric_logging.py

+        log_strategy (Optional[str]): Strategy to use for logging. Options are "main", "node", "all".
+            Default: "main"


Would add more detail here explaining what each of these mean

RdoubleA · 2024-04-12T16:48:56Z

recipes/configs/llama2/tiny_llama.yaml

+
+
+# Training env
+device: mps


We have not thoroughly tested our recipes yet with mps... it makes sense that a lot of folks would run this config on their macbook but for now I would keep this as cuda or cpu (if it fits) and for your personal testing on mac override this from the command-line

RdoubleA · 2024-04-12T16:49:13Z

recipes/configs/llama2/tiny_llama.yaml

+# Memory management
+enable_activation_checkpointing: True
+
+# Reduced precision


nit: this is full and not reduced precision

RdoubleA · 2024-04-12T16:51:40Z

recipes/configs/llama2/tiny_llama.yaml

+  project: torchtune
+log_every_n_steps: 1
+
+# # Logging


should we remove this?

torchtune/utils/metric_logging.py

RdoubleA · 2024-04-12T17:05:19Z

torchtune/utils/metric_logging.py

+        if (
+                (self.log_strategy == "main" and self.rank == 0) 
+                or (self.log_strategy == "node" and self.local_rank == 0)
+                or self.log_strategy == "all"


I'm inclined to make this a quick private method (e.g., _should_log()) since the logic here is more complex, that way you can test this logic in an isolated way

RdoubleA

I think you'll need to run pre-commit run --all-files to fix some of the spacing and indent issues

recipes/full_finetune_distributed.py

recipes/lora_finetune_single_device.py

recipes/full_finetune_distributed.py

Co-authored-by: ebsmothers <[email protected]>

Co-authored-by: Rafi Ayub <[email protected]>

ebsmothers · 2024-04-15T02:36:55Z

recipes/lora_finetune_single_device.py

-                "Memory Stats after model init:", device=self._device
-            )
-        )
+        if self._device == torch.device("cuda"):


This is for the CPU recipe tests?

yes, this actually wont throw the log just doesn't print anything

joecummings · 2024-04-15T02:36:55Z

docs/source/examples/wandb_logging.rst

+  .. code-block:: bash
+
+    pip install wandb
+


Can you add a "tip" to run wandb login before running?

ebsmothers · 2024-04-15T02:37:15Z

recipes/lora_finetune_distributed.py

                if (
                    self.total_training_steps % self._log_peak_memory_every_n_steps == 0
                    and self._is_rank_zero
+                    and self._device == torch.device("cuda")


Do we really need this one too? For distributed tests they should only run on GPU

mm yeah I can remove this check for distributed recipes

ebsmothers

Couple questions on the device == cuda checks (especially for distributed recipes). Otherwise looks good though

Thomas Capelle and others added 30 commits March 27, 2024 11:56

enable W&B

a3df205

add default project

3f41234

Delete old checkpoint code (meta-pytorch#601)

914a901

fix typo (meta-pytorch#606)

54a5e2a

Chat dataset + SlimOrca refactor + more templates (meta-pytorch#576)

de155dc

Add Acknowledgements (meta-pytorch#613)

97b994d

Full finetune < 16GB (meta-pytorch#527)

08fae5e

Small fix to README for full finetune (meta-pytorch#615)

ae600b2

Add tune run and refactor CLI (meta-pytorch#586)

290beb5

Fix typos in Acknowledgements section (meta-pytorch#617)

dc6e54d

HuggingFace --> Hugging Face (meta-pytorch#618)

a804c23

Configure max_seq_len in InstructDataset (meta-pytorch#620)

0217bfa

Inference (meta-pytorch#619)

98ae830

Print out "Ignoring patterns" for download (meta-pytorch#625)

f60ebb2

[Fix] Update the tune command to kick off training in yaml files (met…

ba93269

…a-pytorch#628)

Remove conversion script (meta-pytorch#629)

ee2f82b

Add Mistral models to recipe registry (meta-pytorch#631)

83660cd

Fix first_finetune tutorial (meta-pytorch#634)

97381a7

update license (meta-pytorch#635)

86c6ee4

Remove _copy_tensor from usage (meta-pytorch#633)

ec3d93e

Add fp32 support for QLoRA (meta-pytorch#595)

4c6460f

Add link to docs from README (meta-pytorch#636)

72dd372

Refactor datasets and tokenizer (meta-pytorch#624)

d876889

Split alpaca_dataset to alpaca + alpaca_cleaned (meta-pytorch#639)

0770781

Add weights_only flag to torchtune checkpointer (meta-pytorch#642)

f085a77

add missing tokenize_messages docstring (meta-pytorch#643)

34accd9

Add string to InstructTemplate, ChatFormat getters (meta-pytorch#641)

07d3813

Add include_package_data to setuptools (meta-pytorch#649)

7fab51f

Add verification of llama model access in first_finetune_tutorial.rst (…

96ecf28

…meta-pytorch#650) Co-authored-by: Kartikay Khandelwal <[email protected]>

grad accum in LoRA distributed recipe (meta-pytorch#644)

77eb695

tcapelle added 2 commits April 12, 2024 18:17

remove big yaml

0b7c13a

add tok

80bd373

tcapelle force-pushed the wandb branch from e228e38 to 80bd373 Compare April 12, 2024 16:51

ebsmothers reviewed Apr 12, 2024

View reviewed changes

RdoubleA reviewed Apr 12, 2024

View reviewed changes

tcapelle added 5 commits April 12, 2024 19:05

update memory logging logic

29a5252

fix output class

c040735

remove unused imports

7069498

better docstrings

f319fbf

integrate logging to other recipes

8202bc9

RdoubleA reviewed Apr 12, 2024

View reviewed changes

recipes/full_finetune_distributed.py Show resolved Hide resolved

recipes/lora_finetune_single_device.py Outdated Show resolved Hide resolved

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved

tcapelle and others added 12 commits April 12, 2024 23:01

remove tiny llama

09f59e8

destroy on rank zero

ce42cc8

missing tab

cf8a948

missing tab

bfb8e98

literal typing

820bef1

add wandb workspace screenshot

8fee701

Update docs/source/examples/wandb_logging.rst

50cc6a8

Co-authored-by: ebsmothers <[email protected]>

Update torchtune/utils/metric_logging.py

f3fe9e5

Co-authored-by: Rafi Ayub <[email protected]>

remove multi-node logi

fffba10

Merge remote-tracking branch 'origin/wandb' into wandb

52c2207

run linter, only log memory stats with cuda

1256087

fix docs, link deep dive in index

97e8aa3

ebsmothers reviewed Apr 15, 2024

View reviewed changes

joecummings reviewed Apr 15, 2024

View reviewed changes

ebsmothers reviewed Apr 15, 2024

View reviewed changes

ebsmothers approved these changes Apr 15, 2024

View reviewed changes

address comments

55513f7

RdoubleA merged commit 5402b29 into meta-pytorch:main Apr 15, 2024

		self._metric_logger.log_dict(memory_stats, step=self.total_training_steps)
		log.info(f"Memory Stats:\n{memory_stats}")

		log_strategy (Optional[str]): Strategy to use for logging. Options are "main", "node", "all".
		Default: "main"

Improve Wandb experience #660

Improve Wandb experience #660

Uh oh!

Conversation

tcapelle commented Apr 5, 2024 • edited by RdoubleA Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RdoubleA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

tcapelle commented Apr 5, 2024 •

edited by RdoubleA

Loading