Skip to content

Latest commit

Β 

History

History
49 lines (32 loc) Β· 1.5 KB

File metadata and controls

49 lines (32 loc) Β· 1.5 KB

πŸ§ͺ Preflight Overview

Preflight is a diagnostic tool designed for large-scale cluster environments. Before starting distributed training, it benchmarks the compute performance of all GPUs, as well as intra-node and inter-node communication bandwidth and latency. Its primary goal is to help users identify underperforming nodes or network bottlenecks in the cluster, ensuring reliable and efficient training runs.

Run preflight

Torch / single node:

primus-cli preflight \
  --dump-path output/preflight \
  --report-file-name preflight_report

Slurm (multi-node example):

NUM_NODES=8 srun -N ${NUM_NODES} --ntasks-per-node=1 --cpus-per-task=256 \
  primus-cli preflight --dump-path output/preflight --report-file-name preflight_report

πŸ“‚ Output Directory

After running Preflight, all test results and reports are generated under the output/preflight directory.

The final reports are:

  • preflight_report.md – a Markdown version of the test report
  • preflight_report.pdf – a PDF version of the same report

These reports summarize GPU performance, intra-node and inter-node communication results, and help identify potential issues within the cluster.


πŸ“ Directory Structure

output/preflight
β”œβ”€β”€ inter_node_comm
β”œβ”€β”€ intra_node_comm
β”œβ”€β”€ preflight_report.md
β”œβ”€β”€ preflight_report.pdf
β”œβ”€β”€ square_gemm_tflops
└── ...

Note: The exact contents may vary depending on the tests enabled during runtime.