Skip to content

fix: restrict completion logging to rank 0#2383

Merged
tianyu-l merged 1 commit into
pytorch:mainfrom
fatih-uzlmz:fix/training-completion-logging
Feb 16, 2026
Merged

fix: restrict completion logging to rank 0#2383
tianyu-l merged 1 commit into
pytorch:mainfrom
fatih-uzlmz:fix/training-completion-logging

Conversation

@fatih-uzlmz

Copy link
Copy Markdown
Contributor

Summary

This PR fixes a logging inconsistency where "Training completed" was printed by all ranks, causing redundant console spam at the end of distributed training runs.

Changes

  • Moved logger.info("Training completed") inside the existing if torch.distributed.get_rank() == 0: block in train.py (Line 711).

Impact

  • Ensures a clean exit message only from the master process, matching the behavior of the preceding "Sleeping..." log.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 15, 2026

@tianyu-l tianyu-l left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a comment

Comment thread torchtitan/train.py
time.sleep(2)

logger.info("Training completed")
logger.info("Training completed")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other loggings happening on every rank, if you search logger.info in train.py. Curious why do you care about this one most?

btw if you are using torchrun, this command helps filter printing on certain ranks https://github.com/pytorch/torchtitan/blob/main/run_train.sh#L29

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review :)

You make a great point about torchrun handling the filtering, but my main motivation here was code consistency and intent, similar to recent cleanup work I've been doing in torchtune.

Basically in lines 707-709, we explicitly check if rank == 0 to handle the final sleep/coordination. It appeared that the logger.info("Training completed") immediately following it was intended to be part of that same "clean exit" block but was essentially "orphaned" outside the indentation.

I recently addressed similar logging inconsistencies in torchtune (PR #2950) to ensure rank-zero logging is enforced at the source level rather than relying solely on the runner. I thought applying that same standard here would keep the two codebases aligned.

Comment thread torchtitan/train.py
@@ -707,8 +707,7 @@ def train(self):
if torch.distributed.get_rank() == 0:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fegin should we only sleep on rank 0?

@tianyu-l tianyu-l merged commit 10d8a30 into pytorch:main Feb 16, 2026
11 checks passed
@fatih-uzlmz fatih-uzlmz deleted the fix/training-completion-logging branch February 16, 2026 04:48
TXacs pushed a commit to McmillanTAC/torchtitan that referenced this pull request Apr 13, 2026
## Summary
This PR fixes a logging inconsistency where "Training completed" was
printed by all ranks, causing redundant console spam at the end of
distributed training runs.

## Changes
- Moved `logger.info("Training completed")` inside the existing `if
torch.distributed.get_rank() == 0:` block in `train.py` (Line 711).

## Impact
- Ensures a clean exit message only from the master process, matching
the behavior of the preceding "Sleeping..." log.
ACharacterInASimulation pushed a commit to ACharacterInASimulation/torchtitan that referenced this pull request Apr 21, 2026
## Summary
This PR fixes a logging inconsistency where "Training completed" was
printed by all ranks, causing redundant console spam at the end of
distributed training runs.

## Changes
- Moved `logger.info("Training completed")` inside the existing `if
torch.distributed.get_rank() == 0:` block in `train.py` (Line 711).

## Impact
- Ensures a clean exit message only from the master process, matching
the behavior of the preceding "Sleeping..." log.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants