Preflight is a diagnostic tool designed for large-scale cluster environments. Before starting distributed training, it benchmarks the compute performance of all GPUs, as well as intra-node and inter-node communication bandwidth and latency. Its primary goal is to help users identify underperforming nodes or network bottlenecks in the cluster, ensuring reliable and efficient training runs.
Torch / single node:
primus-cli preflight \
--dump-path output/preflight \
--report-file-name preflight_reportSlurm (multi-node example):
NUM_NODES=8 srun -N ${NUM_NODES} --ntasks-per-node=1 --cpus-per-task=256 \
primus-cli preflight --dump-path output/preflight --report-file-name preflight_reportAfter running Preflight, all test results and reports are generated under the output/preflight directory.
The final reports are:
preflight_report.mdβ a Markdown version of the test reportpreflight_report.pdfβ a PDF version of the same report
These reports summarize GPU performance, intra-node and inter-node communication results, and help identify potential issues within the cluster.
output/preflight
βββ inter_node_comm
βββ intra_node_comm
βββ preflight_report.md
βββ preflight_report.pdf
βββ square_gemm_tflops
βββ ...Note: The exact contents may vary depending on the tests enabled during runtime.