-
-
Notifications
You must be signed in to change notification settings - Fork 123
Closed
Labels
Description
One of our users was having problems with their hybrid MPI/threaded Julia code segment faulting on our clusters.
OS: Linux (CentOS 7)
Julia: 1.3.0
OpenMPI: 3.1.2
I simplified their code down the following demo
using MPI
function main()
MPI.Init()
Threads.@threads for i in 1:100
A = rand(1000,1000)
A1 = inv(A)
oops = A1[1.6]
end
MPI.Finalize()
end
main()
exceptions sometimes turn into segmentation faults inside of@threads for
loops- for a reliable segmentation fault you need to perform a reasonable amount of work in the loop
$ export JULIA_NUM_THREADS=2
$ mpirun -n 2 example.jl
[1578695090.632505] [gra797:13151:0] parser.c:1369 UCX WARN unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1578695091.181319] [gra797:13152:0] parser.c:1369 UCX WARN unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[gra797:13151:1:13155] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b6ccc961008)
[gra797:13152:1:13156] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b5877538008)
==== backtrace ====
0 0x0000000000010e90 __funlockfile() ???:0
1 0x00000000000a9904 maybe_collect() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
2 0x00000000000a9904 jl_gc_managed_malloc() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:3116
3 0x000000000007a160 _new_array_() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:109
4 0x000000000007db3e jl_array_copy() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:1135
5 0x000000000005d22c _jl_invoke() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
6 0x0000000000078e19 jl_apply() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
==== backtrace ====
0 0x0000000000010e90 __funlockfile() ???:0
1 0x00000000000a8d74 maybe_collect() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
2 0x00000000000a8d74 jl_gc_pool_alloc() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:1096
3 0x000000000005d22c _jl_invoke() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
4 0x0000000000078e19 jl_apply() /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 13151 on node gra797 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Here is some possibly relevant info from ompi_info
as well
...
Configure command line: '--prefix=/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc7.3/openmpi/3.1.2'
'--build=x86_64-pc-linux-gnu'
'--host=x86_64-pc-linux-gnu' '--enable-shared'
'--with-verbs' '--enable-mpirun-prefix-by-default'
'--with-hwloc=external' '--without-usnic'
'--with-ucx' '--disable-wrapper-runpath'
'--disable-wrapper-rpath' '--with-munge'
'--with-slurm' '--with-pmi=/opt/software/slurm'
'--enable-mpi-cxx' '--with-hcoll'
'--disable-show-load-errors-by-default'
'--enable-mca-dso=common-libfabric,common-ofi,common-verbs,atomic-mxm,btl-openib,btl-scif,coll-fca,coll-hcoll,ess-tm,fs-lustre,mtl-mxm,mtl-ofi,mtl-psm,mtl-psm2,osc-ucx,oob-ud,plm-tm,pmix-s1,pmix-s2,pml-ucx,pml-yalla,pnet-opa,psec-munge,ras-tm,rml-ofi,scoll-mca,sec-munge,spml-ikrit,'
...
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
...
EDIT: Removed exception bit as as noted below it isn't required.