Skip to content

segmentation faults when combined with @threads with memory allocation #337

@twhitehead

Description

@twhitehead

One of our users was having problems with their hybrid MPI/threaded Julia code segment faulting on our clusters.

OS: Linux (CentOS 7)
Julia: 1.3.0
OpenMPI: 3.1.2

I simplified their code down the following demo

using MPI

function main()
    MPI.Init()

    Threads.@threads for i in 1:100
        A = rand(1000,1000)
        A1 = inv(A)
        oops = A1[1.6]
    end

    MPI.Finalize()
end

main()
  • exceptions sometimes turn into segmentation faults inside of @threads for loops
  • for a reliable segmentation fault you need to perform a reasonable amount of work in the loop
$ export JULIA_NUM_THREADS=2
$ mpirun -n 2 example.jl
[1578695090.632505] [gra797:13151:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1578695091.181319] [gra797:13152:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[gra797:13151:1:13155] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b6ccc961008)
[gra797:13152:1:13156] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b5877538008)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a9904 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a9904 jl_gc_managed_malloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:3116
 3 0x000000000007a160 _new_array_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:109
 4 0x000000000007db3e jl_array_copy()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:1135
 5 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
 6 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a8d74 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a8d74 jl_gc_pool_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:1096
 3 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
 4 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 13151 on node gra797 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Here is some possibly relevant info from ompi_info as well

...
  Configure command line: '--prefix=/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc7.3/openmpi/3.1.2'
                          '--build=x86_64-pc-linux-gnu'
                          '--host=x86_64-pc-linux-gnu' '--enable-shared'
                          '--with-verbs' '--enable-mpirun-prefix-by-default'
                          '--with-hwloc=external' '--without-usnic'
                          '--with-ucx' '--disable-wrapper-runpath'
                          '--disable-wrapper-rpath' '--with-munge'
                          '--with-slurm' '--with-pmi=/opt/software/slurm'
                          '--enable-mpi-cxx' '--with-hcoll'
                          '--disable-show-load-errors-by-default'
                          '--enable-mca-dso=common-libfabric,common-ofi,common-verbs,atomic-mxm,btl-openib,btl-scif,coll-fca,coll-hcoll,ess-tm,fs-lustre,mtl-mxm,mtl-ofi,mtl-psm,mtl-psm2,osc-ucx,oob-ud,plm-tm,pmix-s1,pmix-s2,pml-ucx,pml-yalla,pnet-opa,psec-munge,ras-tm,rml-ofi,scoll-mca,sec-munge,spml-ikrit,'
...
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
...

EDIT: Removed exception bit as as noted below it isn't required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions