fix: restrict completion logging to rank 0#2383
Conversation
| time.sleep(2) | ||
|
|
||
| logger.info("Training completed") | ||
| logger.info("Training completed") |
There was a problem hiding this comment.
There are other loggings happening on every rank, if you search logger.info in train.py. Curious why do you care about this one most?
btw if you are using torchrun, this command helps filter printing on certain ranks https://github.com/pytorch/torchtitan/blob/main/run_train.sh#L29
There was a problem hiding this comment.
Thanks for the review :)
You make a great point about torchrun handling the filtering, but my main motivation here was code consistency and intent, similar to recent cleanup work I've been doing in torchtune.
Basically in lines 707-709, we explicitly check if rank == 0 to handle the final sleep/coordination. It appeared that the logger.info("Training completed") immediately following it was intended to be part of that same "clean exit" block but was essentially "orphaned" outside the indentation.
I recently addressed similar logging inconsistencies in torchtune (PR #2950) to ensure rank-zero logging is enforced at the source level rather than relying solely on the runner. I thought applying that same standard here would keep the two codebases aligned.
| @@ -707,8 +707,7 @@ def train(self): | |||
| if torch.distributed.get_rank() == 0: | |||
## Summary
This PR fixes a logging inconsistency where "Training completed" was
printed by all ranks, causing redundant console spam at the end of
distributed training runs.
## Changes
- Moved `logger.info("Training completed")` inside the existing `if
torch.distributed.get_rank() == 0:` block in `train.py` (Line 711).
## Impact
- Ensures a clean exit message only from the master process, matching
the behavior of the preceding "Sleeping..." log.
## Summary
This PR fixes a logging inconsistency where "Training completed" was
printed by all ranks, causing redundant console spam at the end of
distributed training runs.
## Changes
- Moved `logger.info("Training completed")` inside the existing `if
torch.distributed.get_rank() == 0:` block in `train.py` (Line 711).
## Impact
- Ensures a clean exit message only from the master process, matching
the behavior of the preceding "Sleeping..." log.
Summary
This PR fixes a logging inconsistency where "Training completed" was printed by all ranks, causing redundant console spam at the end of distributed training runs.
Changes
logger.info("Training completed")inside the existingif torch.distributed.get_rank() == 0:block intrain.py(Line 711).Impact