Open
Description
Background information
While I tested Open MPI 5 using OMB. I observed segfaults when running some collective benchmarks with cuda buffer.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Open MPI 5: https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.0.tar.bz2
OMB: http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.3.tar.gz
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Configure Open MPI
$ ./configure --enable-debug --with-cuda=/usr/local/cuda --with-cuda-libdir=/lib64
Configure OMB
$ ./configure --with-cuda=/usr/local/cuda --enable-cuda CC=/path/to/ompi5 CXX=/path/to/ompi5
$ PATH=/usr/local/cuda/bin:$PATH make -j install
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
Please describe the system on which you are running
- Operating system/version: Amazon Linux 2. Can also reproduce on Ubuntu 22.04. Installed CUDA 12.2 and 535 driver.
- Computer hardware: p4d.24xlarge instance with A100 GPU
- Network type: EFA. Can also reproduce with
pml ob1
Details of the problem
Here is an example with osu_ireduce
on 4 ranks on a single node.
$ mpirun -n 4 --mca pml ob1 --mca coll_base_verbose 1 osu-micro-benchmarks/mpi/collective/osu_ireduce -d cuda
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07273] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07272] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07271] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07270] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07273] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07272] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07271] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07270] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
# OSU MPI-CUDA Non-blocking Reduce Latency Test
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait
# Size Overall(us) Compute(us) Pure Comm.(us) Overlap(%)
[ip-172-31-31-62.us-west-2.compute.internal:07271] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07273] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62:07270] *** Process received signal ***
[ip-172-31-31-62:07270] Signal: Segmentation fault (11)
[ip-172-31-31-62:07270] Signal code: Invalid permissions (2)
[ip-172-31-31-62:07270] Failing at address: 0x7fb321200000
[ip-172-31-31-62:07272] *** Process received signal ***
[ip-172-31-31-62:07272] Signal: Segmentation fault (11)
[ip-172-31-31-62:07272] Signal code: Invalid permissions (2)
[ip-172-31-31-62:07272] Failing at address: 0x7fe881200000
[ip-172-31-31-62:07270] [ 0] /usr/lib/habanalabs/libhl_logger.so(_Z13signalHandleriP9siginfo_tPv+0x18e)[0x7fb32c69c7be]
[ip-172-31-31-62:07270] [ 1] /lib64/libpthread.so.0(+0x118e0)[0x7fb356ed08e0]
[ip-172-31-31-62:07270] [ 2] [ip-172-31-31-62:07272] [ 0] /usr/lib/habanalabs/libhl_logger.so(_Z13signalHandleriP9siginfo_tPv+0x18e)[0x7fe8687417be]
[ip-172-31-31-62:07272] [ 1] /lib64/libpthread.so.0(+0x118e0)[0x7fe8b60b58e0]
[ip-172-31-31-62:07272] [ 2] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x389c1c)[0x7fb35767cc1c]
[ip-172-31-31-62:07270] [ 3] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x389c1c)[0x7fe8b6861c1c]
[ip-172-31-31-62:07272] [ 3] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d3411)[0x7fb3574c6411]
[ip-172-31-31-62:07270] [ 4] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d3411)[0x7fe8b66ab411]
[ip-172-31-31-62:07272] [ 4] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d4e89)[0x7fb3574c7e89]
[ip-172-31-31-62:07270] [ 5] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d4e89)[0x7fe8b66ace89]
[ip-172-31-31-62:07272] [ 5] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(NBC_Progress+0x3bc)[0x7fb3574c77eb]
[ip-172-31-31-62:07270] [ 6] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(NBC_Progress+0x3bc)[0x7fe8b66ac7eb]
[ip-172-31-31-62:07272] [ 6] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_coll_libnbc_progress+0xc3)[0x7fb3574c508d]
[ip-172-31-31-62:07270] [ 7] /home/ec2-user/openmpi-5.0.0/install/lib/libopen-pal.so.80(opal_progress+0x30)[0x7fb3563cfcc6]
[ip-172-31-31-62:07270] [ 8] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_coll_libnbc_progress+0xc3)[0x7fe8b66aa08d]
[ip-172-31-31-62:07272] [ 7] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0xa335b)[0x7fb35739635b]
[ip-172-31-31-62:07270] [ 9] /home/ec2-user/openmpi-5.0.0/install/lib/libopen-pal.so.80(opal_progress+0x30)[0x7fe8b55b4cc6]
[ip-172-31-31-62:07272] [ 8] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0xa335b)[0x7fe8b657b35b]
[ip-172-31-31-62:07272] [ 9] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7fb3573963c4]
[ip-172-31-31-62:07270] [10] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7fe8b657b3c4]
[ip-172-31-31-62:07272] [10] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(MPI_Wait+0x138)[0x7fb3574355f1]
[ip-172-31-31-62:07270] [11] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x402a8c]
[ip-172-31-31-62:07270] [12] /lib64/libc.so.6(__libc_start_main+0xea)[0x7fb356b3313a]
[ip-172-31-31-62:07270] [13] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x40332a]
/home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(MPI_Wait+0x138)[0x7fe8b661a5f1]
[ip-172-31-31-62:07272] [11] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x402a8c]
[ip-172-31-31-62:07272] [12] [ip-172-31-31-62:07270] *** End of error message ***
Backtrace:
#0 0x00007fb35767cc1c in ompi_op_avx_2buff_add_float_avx512 (_in=0x7fb321200000, _out=0x254fbf0, count=0x7ffcb82a3fc4, dtype=0x7ffcb82a3f88, module=0x181ff30)
at op_avx_functions.c:680
#1 0x00007fb3574c6411 in ompi_op_reduce (op=0x62c760 <ompi_mpi_op_sum>, source=0x7fb321200000, target=0x254fbf0, full_count=1, dtype=0x62e3a0 <ompi_mpi_float>)
at ../../../../ompi/op/op.h:572
#2 0x00007fb3574c7e89 in NBC_Start_round (handle=0x25540e8) at nbc.c:539
#3 0x00007fb3574c77eb in NBC_Progress (handle=0x25540e8) at nbc.c:419
#4 0x00007fb3574c508d in ompi_coll_libnbc_progress () at coll_libnbc_component.c:445
#5 0x00007fb3563cfcc6 in opal_progress () at runtime/opal_progress.c:224
#6 0x00007fb35739635b in ompi_request_wait_completion (req=0x25540e8) at ../ompi/request/request.h:492
#7 0x00007fb3573963c4 in ompi_request_default_wait (req_ptr=0x7ffcb82a43c8, status=0x7ffcb82a43f0) at request/req_wait.c:40
#8 0x00007fb3574355f1 in PMPI_Wait (request=0x7ffcb82a43c8, status=0x7ffcb82a43f0) at wait.c:72
#9 0x0000000000402a8c in main (argc=<optimized out>, argv=<optimized out>) at osu_ireduce.c:136
It appears to be an invalid temp buf in libnbc, note the address target=0x254fbf0