Skip to content

Conversation

@domcharrier
Copy link
Contributor

@domcharrier domcharrier commented Oct 1, 2025

Note

  • I will make every commit per-file and as atomic as possible, I will further revert previous versions explicitly so that all versions can be inspected/restored.
  • I am assuming that this PR will eventually be squashed before it is merged. In case it is rebased, I can remove intermediate states and revert commits before the rebase & merge (if that's requested). (Code owners clarified that git rebase should not be used as they track the ids of individual commits.)

Changes made by this PR

  • We explicitly control the visibility of API vs non-API function symbols for the case that a non-MSC compiler is used.
    • This prevents that libllvmlite.so borrows equally named (typically LLVM) symbols from previously loaded shared objects (interposition) and vice versa.
  • UNIX build with statically-linked LLVM: We further link with -Bsymbolic as an additional measure against interposition.

Note

This blog post is a helpful reference for understanding the terminology used in this PR: https://maskray.me/blog/2021-05-16-elf-interposition-and-bsymbolic.

Problem description

We observe a segmentation fault when preloading a libLLVM*.so* indirectly via an Python import statement that appears before another Python statement that imports numba. The segmentation fault happens during a numba.jit compilation process. We experience this segmentation fault with both the numba::llvmlite Conda package and the Numba Python wheel (from PyPI).

Abstracted Python script:

import package_that_loads_libllvm_so

from numba import jit

# Define a function with Numba's JIT decorator
@jit(nopython=True)
def sum_of_squares(n):
    total = 0
    for i in range(n):
        total += i ** 2
    return total

# Test the function
n = 1000000
result = sum_of_squares(n)
print(f"Sum of squares up to {n}: {result}")

Typical trace:

#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=23414046553664) at ./nptl/pthread_kill.c:44
[Current thread is 1 (Thread 0x154b81ed7640 (LWP 3749084))]
*** bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=23414046553664) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=23414046553664) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=23414046553664, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x0000155555042476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  0x0000155530e01bca in llvm::ValueHandleBase::AddToUseList() () from [=[REDACTED]=]/.local/lib/python3.11/site-packages/[=[REDACTED]=]/lib/llvm/lib/libLLVM.so.20.0git

Important

Below we call into libLLVM.so.20.0.git from libllvmlite.so. This is the issue this PR aims to prevent:

#6  0x0000155532985cfe in llvm::BranchProbabilityInfo::eraseBlock(llvm::BasicBlock const*) () from [=[REDACTED]=]/.local/lib/python3.11/site-packages/[=[REDACTED]=]/lib/llvm/lib/libLLVM.so.20.0git
#7  0x0000154c77ab39a2 in llvm::ValueHandleBase::ValueIsDeleted(llvm::Value*) [clone .localalias] () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#8  0x0000154c77ab410d in llvm::Value::~Value() [clone .localalias] () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#9  0x0000154c77977e53 in llvm::BasicBlock::eraseFromParent() () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#10 0x0000154c77fe030d in llvm::DeleteDeadBlocks(llvm::ArrayRef<llvm::BasicBlock*>, llvm::DomTreeUpdater*, bool) [clone .localalias] () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#11 0x0000154c77fe32a1 in llvm::MergeBlockIntoPredecessor(llvm::BasicBlock*, llvm::DomTreeUpdater*, llvm::LoopInfo*, llvm::MemorySSAUpdater*, llvm::MemoryDependenceResults*, bool) [clone .localalias] () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#12 0x0000154c780f3cc0 in llvm::simplifyCFG(llvm::BasicBlock*, llvm::TargetTransformInfo const&, llvm::DomTreeUpdater*, llvm::SimplifyCFGOptions const&, llvm::ArrayRef<llvm::WeakVH>) () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#13 0x0000154c784b8e8b in iterativelySimplifyCFG(llvm::Function&, llvm::TargetTransformInfo const&, llvm::DomTreeUpdater*, llvm::SimplifyCFGOptions const&) () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#14 0x0000154c784b9c25 in simplifyFunctionCFGImpl(llvm::Function&, llvm::TargetTransformInfo const&, llvm::DominatorTree*, llvm::SimplifyCFGOptions const&) () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#15 0x0000154c77a47c0f in llvm::FPPassManager::runOnFunction(llvm::Function&) [clone .localalias] () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#16 0x0000154c77a47d4c in llvm::FPPassManager::runOnModule(llvm::Module&) [clone .localalias] () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#17 0x0000154c77a486f2 in llvm::legacy::PassManagerImpl::run(llvm::Module&) [clone .localalias] () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
#18 0x0000154c779a3faa in LLVMRunPassManager () from [=[REDACTED]=]/conda-envs/[=[REDACTED]=]/lib/python3.11/site-packages/llvmlite/binding/libllvmlite.so
[...]
[...]

Reproduce

Step 1) Install and activate the below conda environment:

Important

No GPU is required to reproduce the issue.

name: pr1314-env
channels:
  - conda-forge
dependencies:
  - python=3.12
  - pip
  - gdb # note: to see a trace
  - numba::llvmlite # note: must come first, otherwise conda-forge may be used as channel
  - numba::numba
  - pip:
    - lief # optional, for patching libllvmlite.so
    - --pre
    - --extra-index-url https://rocm.nightlies.amd.com/v2/gfx94X-dcgpu/
    - rocm[libraries,devel]==7.9.0rc20250925
    - torch==2.7.1+rocm7.9.0rc20250925

Step 2) Run the below reproducer via gdb -ex=run --args python3 reproducer.py:

# The below import order causes a segmentation fault
# NOTE: Workaround 1: Invert the order and the script runs fine.
# NOTE: Workaround 2: Patch the libllvmlite.so via the script in the appendix,
#            and any import order works fine.
import torch
from numba import jit

# Define a function with Numba's JIT decorator
@jit(nopython=True)
def sum_of_squares(n):
    total = 0
    for i in range(n):
        total += i ** 2
    return total

# Test the function
n = 1000000
result = sum_of_squares(n)
print(f"Sum of squares up to {n}: {result}")

Step 3) Enter bt when the debugger stops at the segmentation fault and observe that a llvmlite.so function eventually calls into _rocm_sdk_core/lib/llvm/lib/libLLVM.so.20.0git.

This gives us a slightly different trace than in the problem description, but the same behavior and a similar segmentation fault is observed:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fffed77cb1a in llvm::AnalysisManager<llvm::Function>::getResultImpl(llvm::AnalysisKey*, llvm::Function&) ()
   from /home/AMD/docharri/conda/envs/pr1314-env/lib/python3.12/site-packages/_rocm_sdk_core/lib/llvm/lib/libLLVM.so.20.0git
(gdb) bt
#0  0x00007fffed77cb1a in llvm::AnalysisManager<llvm::Function>::getResultImpl(llvm::AnalysisKey*, llvm::Function&) ()
   from /home/AMD/docharri/conda/envs/pr1314-env/lib/python3.12/site-packages/_rocm_sdk_core/lib/llvm/lib/libLLVM.so.20.0git
#1  0x00007fffeed658fc in llvm::SROAPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) ()
   from /home/AMD/docharri/conda/envs/pr1314-env/lib/python3.12/site-packages/_rocm_sdk_core/lib/llvm/lib/libLLVM.so.20.0git
#2  0x00007ffff0cacf5d in llvm::detail::PassModel<llvm::Function, llvm::SROAPass, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) ()
   from /home/AMD/docharri/conda/envs/pr1314-env/lib/python3.12/site-packages/_rocm_sdk_core/lib/llvm/lib/libLLVM.so.20.0git

Important

Call into libLLVM.so happens here

#3  0x00007ffe24df1a1d in llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) () from /home/AMD/docharri/conda/envs/pr1314-env/lib/python3.12/site-packages/llvmlite/binding/libllvmlite.so
#4  0x00007ffe2074fe29 in LLVMPY_RunNewFunctionPassManager ()
   from /home/AMD/docharri/conda/envs/pr1314-env/lib/python3.12/site-packages/llvmlite/binding/libllvmlite.so

Optional Step 4) Modify the environment as follows and observe the same issue:

conda uninstall numba::numba numba::llvmlite
pip install numba

Proposed fix: Use CMake visibility controls and add -Bsymbolic linker option

  • ffi/core.h: Add function annotation __attribute__((visibility("default"))) to
    API_EXPORT macro in non-MSC case. (No longer guided by CMakeLists.txt).

  • ffi/CMakeLists.txt:

    1. Preset function visibility to hidden as API functions are marked explicitly to use 'default' visibility (MSC already uses __declspec( dllexport )).
    set(CMAKE_CXX_VISIBILITY_PRESET "hidden")
    set(CMAKE_C_VISIBILITY_PRESET "hidden")
    1. Specify -Bsymbolic linker option when conducting LLVM-statically-linking llvmlite build. This will thus affect the numba::llvmlite packages but not the conda-forge::llvmlite ones, e.g.

Note

We constrain the linker option change to the "static" build variant as this seems the one to be built and tested by the numba/llmvlite CI.)

Note on -BSymbolic linker option effect on libllvmlite.so:

The linker option will introduce a SYMBOLIC entry (value: 0x0) to the shared object's dynamic section.
Calling readelf -d <filepath> gives us:

Dynamic section at offset 0x843a058 contains 35 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libz.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libzstd.so.1]
[shortened...]
 0x000000000000000e (SONAME)             Library soname: [libllvmlite.so]
[shortened...]
0x0000000000000010 (SYMBOLIC)           0x0
[shortened...]

Appendix: Workaround: Patch SYMBOLIC dynamic section entry into prebuilt libllvmlite.so via lief Python package

We've applied the below script to patch a libllvmlite.so that was installed via a prebuilt numba::llvmlite Conda package.
This fixed the interposition issue experienced with the above Python snippet in our environment.

import os
import shutil
import sys

import lief

def add_symbolic_flag(filepath):
    """
    Changes the flag in-place.
    Stores a copy of the original to <filepath>.pre
    """
    # Create backup
    if not os.path.exists(filepath+".pre"):
       shutil.copyfile(filepath, filepath+".pre")

    # Parse the binary
    binary = lief.parse(filepath+".pre")

    symbolic_entry = lief.ELF.DynamicEntry(lief.ELF.DynamicEntry.TAG.SYMBOLIC, 0)
    binary += symbolic_entry

    # Write the patched binary to a new file
    binary.write(filepath)

if __name__ == "__main__":
    libllvmlite_so_path = os.path.join(
        os.environ["CONDA_PREFIX"],
        "lib",
        f"python{sys.version_info.major}.{sys.version_info.minor}",
        "site-packages",
        "llvmlite",
        "binding",
        "libllvmlite.so"
    )

    if not os.path.exists(libllvmlite_so_path):
        raise IOError(f"file '{libllvmlite_so_path}' doesn't exist")

    add_symbolic_flag(libllvmlite_so_path)

@domcharrier domcharrier marked this pull request as draft October 1, 2025 21:18
@domcharrier domcharrier force-pushed the fix/hide-non-api-symbols-gcc-clang branch from d90d2a6 to ac3d886 Compare October 1, 2025 21:29
@domcharrier domcharrier changed the title [FIX] Hide non API symbols when a compiler is used that supports GNUC style visibility controls to prevent LLVM symbol export and interposition [FIX] Hide non-API symbols when CXX compiler supports GNUC style visibility controls to prevent LLVM symbol export and interposition Oct 1, 2025
Check if the compiler supports compiler switch `--fvisibility=hidden`
and the function annotation `__attribute__((visibility("default")))`.

If so, add the compiler definition `HAVE_ATTRIBUTE_VISIBILITY` to
the llvmlite target's compiler definitions.
Further  set the target's default visibility to hidden when
compiling CXX files.

NOTE: Changes have no impact on `cmake_minimum_required(VERSION ...)` as
      module `CheckCXXSourceCompiles` and `check_cxx_source_compiles`
      are coming with CMake versions < 3.13.
Add function annotation `__attribute__((visibility("default")))` to
API_EXPORT macro.

When the compiler flag `--fvisibility=hidden` is specified, this
ensures that non-API symbols will be hidden to other shared objects.
However, our main goal with these changes is to prevent
symbol interposition, i.e., to prevent that non-API symbols (mainly the LLVM ones)
are "borrowed" from other shared objects instead of using
llvmlite's own symbols. This is accomplished as well.
@domcharrier domcharrier force-pushed the fix/hide-non-api-symbols-gcc-clang branch from ac3d886 to bca4f36 Compare October 1, 2025 22:03
message(STATUS "LLVM target link libraries: ${llvm_libs}")
target_link_libraries(llvmlite ${llvm_libs})

check_cxx_compiler_supports_fvisibility_and_attribute_visibility()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend removing the probing and just:

set(CMAKE_CXX_VISIBILITY_PRESET "hidden")
set(CMAKE_C_VISIBILITY_PRESET "hidden")
set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)

(CMake already handles feature detection, etc by default a part of its toolchain setup)

Copy link
Contributor Author

@domcharrier domcharrier Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stellaraccident, in this (amazing) blog post, they mention that:

The C++ specific -fvisibility-inlines-hidden is a safer subset of -fvisibility=hidden. The option just violates pointer equality for inline function definitions. As discussed above, this is usually safe.

Guess the set(CMAKE_VISIBILITY_INLINES_HIDDEN ON) can thus be dropped given the set(CMAKE_CXX_VISIBILITY_PRESET "hidden")?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm

ffi/core.h Outdated

#if defined(HAVE_DECLSPEC_DLL)
#define API_EXPORT(RTYPE) __declspec(dllexport) RTYPE
#elif defined(HAVE_ATTRIBUTE_VISIBILITY)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've most often seen this as an unconditional else branch in such ifdefs. In practice, only Windows/COFF compilers don't support he attribute.

Here is an example from LLVM's own codebase of the cascade:

#if (defined(_WIN32) || defined(__CYGWIN__)) &&                                \
    !defined(MLIR_CAPI_ENABLE_WINDOWS_DLL_DECLSPEC)
// Visibility annotations disabled.
#define MLIR_CAPI_EXPORTED
#elif defined(_WIN32) || defined(__CYGWIN__)
// Windows visibility declarations.
#if MLIR_CAPI_BUILDING_LIBRARY
#define MLIR_CAPI_EXPORTED __declspec(dllexport)
#else
#define MLIR_CAPI_EXPORTED __declspec(dllimport)
#endif
#else
// Non-windows: use visibility attributes.
#define MLIR_CAPI_EXPORTED __attribute__((visibility("default")))
#endif

(from https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir-c/Support.h#L33C1-L47C7)

There are many examples on the internet, but typically for a library like this, you always use the default visibility attribute in the else branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stellaraccident Just noticed that the LLVM C ABI cascade has been changed slightly 3 months ago:

/// LLVM_C_ABI is the export/visibility macro used to mark symbols declared in
/// llvm-c as exported when built as a shared library.

#if !defined(LLVM_ABI_GENERATING_ANNOTATIONS)
// TODO(https://github.com/llvm/llvm-project/issues/145406): eliminate need for
// two preprocessor definitions to gate LLVM_ABI macro definitions.
#if defined(LLVM_ENABLE_LLVM_C_EXPORT_ANNOTATIONS) &&                          \
    !defined(LLVM_BUILD_STATIC)
#if defined(_WIN32) && !defined(__MINGW32__)
#if defined(LLVM_EXPORTS)
#define LLVM_C_ABI __declspec(dllexport)
#else
#define LLVM_C_ABI __declspec(dllimport)
#endif
#elif defined(__has_attribute) && __has_attribute(visibility)
#define LLVM_C_ABI __attribute__((visibility("default")))
#endif
#endif
#if !defined(LLVM_C_ABI)
#define LLVM_C_ABI
#endif
#endif

https://github.com/llvm/llvm-project/blame/6d44b9082e42b918a152098ec70ed409c4da8c79/llvm/include/llvm-c/Visibility.h#L35

In particular, they have this in there now:

#elif defined(__has_attribute) && __has_attribute(visibility)
#define LLVM_C_ABI __attribute__((visibility("default")))
#endif

Remark: If I understand the GCC docs correctly then:

#if defined __has_attribute
# if __has_attribute (visibility)
...
#endif
#endif

should actually be preferred vs:

#if defined(__has_attribute) && __has_attribute(visibility)
...
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esc Just a thought: Is there anything that speaks against having llvmlite's ffi/core.h include the LLVM llvm-c/Visibility.h header and reuse the LLVM_C_ABI macro?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be fine, even preferable but it appears that the header only appeared in LLVM 21, and llvmlite is currently LLVM 20 based.

@domcharrier domcharrier force-pushed the fix/hide-non-api-symbols-gcc-clang branch from 38112c8 to e678e7d Compare October 2, 2025 10:18
Specify -Bsymbolic linker option when
conducting LLVM-statically-linking llvmlite build.
This will thus affect the numba::llvmlite packages
but not the conda-forge::llvmlite ones, e.g.

(NOTE: We constrain the change to the "static" build variant
as this seems the one to be built and tested by the numba/llmvlite CI.)

Co-authored-by: stellaraccident
@domcharrier domcharrier force-pushed the fix/hide-non-api-symbols-gcc-clang branch from e678e7d to b9994b8 Compare October 2, 2025 10:18
@esc
Copy link
Member

esc commented Oct 2, 2025

@domcharrier hello and thank you for opening this to help improve llvmlite. @stellaraccident thank you for your valuable feedback here. Once this PR is ready for review, please do add a comment requesting such, thank you. Once the review has commenced, please refrain from using history rewriting techniques such as git rebase as the reviewers tend to use commit-ids to keep track of the review process, thank you.

@stellaraccident
Copy link

Current draft LGTM (as a fly on the wall). It covers all of the cases of LLVM symbol interposition I've seen in the field that implicate llvmlite.

@sklam
Copy link
Member

sklam commented Oct 2, 2025

@domcharrier, to clarify the situation, are you encountering the issue in the conda packages in the numba channel or in the wheels on pypi?

@stellaraccident
Copy link

stellaraccident commented Oct 2, 2025

We have reproduced the interposition issues with both conda and pypi. The original report was with conda, but we also did some work with the pypi wheels (enough to believe they were similarly impacted).

@domcharrier domcharrier marked this pull request as ready for review October 2, 2025 20:22
@esc esc added this to the 0.46.0rc1 milestone Oct 6, 2025
@esc
Copy link
Member

esc commented Oct 6, 2025

We have reproduced the interposition issues with both conda and pypi. The original report was with conda, but we also did some work with the pypi wheels (enough to believe they were similarly impacted).

Is there any chance you can give more infromation about how your reproduced this? It would be important to know that, so that a potential reviewer can validate this fix.

@stellaraccident
Copy link

We have reproduced the interposition issues with both conda and pypi. The original report was with conda, but we also did some work with the pypi wheels (enough to believe they were similarly impacted).

Is there any chance you can give more infromation about how your reproduced this? It would be important to know that, so that a potential reviewer can validate this fix.

The exact setup can't be described well, by I'll give the mechanic. This was in a build of ROCM that uses a dynamically loaded libLLVM. That library has private symbol versioning and a private SONAME.

If that library loads first, then the dynamic linker global namespace will have symbols in it for that version of LLVM. Unfortunately, while other dynamic libraries of LLVM will not conflict with it (since they will have their own symbol versions and SONAME), for some reason static linking on Linux defaults to symbol interposition from any source of symbols in the namespace, even if defined locally, and even if the other library's symbols are versioned. So if it loads second, it will get a mix of symbols from elsewhere in the process.

I haven't checked but it should be sufficient to repro to use ctypes to load a libLLVM from elsewhere (can be anything, should just be a different version). Then import numba and attempt to hit a simple example. If the LLVM's are grossly incompatible (likely), you will get a segfault then and there. But you can also see the incorrect bindings by running the test script with LD_DEBUG=symbols,bindings.

Afaik, statically linking LLVM into a DSO should always either use LLVM with symbol hiding enabled at build time (hidden symbols cannot be interposed) and/or link with -Bsymbolic (resolves all definitions in the local DSO before the global namespace). Not doing one of these leaves you open to symbol interposition, which is something you never want in something like a python process.

In the environment where we root caused this, there were several LLVM's in the process and having numba randomly accept interposed symbols via llvmlite has been (we think) the root cause of a drip of hard to explain crashes over many months. It is just that when ROCM itself was implicated, we managed to isolate the cause and raise a patch.

@domcharrier
Copy link
Contributor Author

domcharrier commented Oct 6, 2025

Hey @esc and @stellaraccident, I managed to create a minimum conda environment to reproduce the issue and made it part of the issue description. I want to stress that while we are installing GPU-related packages into this conda environment no GPU is required for run the reproducer.

domcharrier added a commit to ROCm/rocm-llvm-python that referenced this pull request Oct 7, 2025
We use `-Wl,-Bsymbolic` to prevent interposition
by other preloaded LLVM shared objects that export
LLVM symbols. This can cause difficult
to debug segfaults.

See <numba/llvmlite#1314>
for a discussion and description of such
a scenario.
Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch and extensive investigation into this. We had also noticed some rare interposition-like-but-couldn't-quite-work-out-exactly-what issues in the LLVM-15 based versions of llvmlite too, the explanation presented in here may also explain that.

I've looked at the PR, the code changes are small and make sense. I've also tested this manually on a Linux system. First I used main and ran:

$ nm -C ffi/build/libllvmlite.so|grep -E ' [[:upper:]] '|grep -v "LLVMPY"|grep -v "std::"|grep -v " U "

what this does it report all the symbols and then filters out the global ones (I guess the -g switch would also do this) and then filters out those which are exported by llvmlite (LLVMPY prefix) and then filters out C++ std:: symbols and those which are undefined. There were quite a lot remaining, 3 main groups:

  1. T llvm::<something that llvmlite refprune pass likely references>
  2. D vtable for llvmlite<something that llvmlite refprune pass likely references>
  3. B llvm::getTypeName<something>::name

I did my own independent trace of loading this before the torch example appeared, and figured it was a LLVM library loaded into the process most probably with RTLD_GLOBAL that was doing the interposition, and so set up a trivial example to check (this is check.py):

import ctypes
import os
llvm = ctypes.CDLL("/path/to/some/libLLVM.so", os.RTLD_GLOBAL)
from llvmlite import binding

Then doing:

$ LD_DEBUG=bindings python check.py 2>&1|c++filt|grep libllvmlite.so|grep LLVM|grep -v LLVMPY

will show the symbols listed above in nm being interposed by those from the libLLVM.so.

Doing the same again with a libllvmlite.so build from this PR yields the following.

  1. From the nm check there are just symbols like llvm::LlvmliteMemoryManager* exported, which is expected.
  2. From the LD_DEBUG trace on the python check.py there's no interposition.

I do wonder about the behaviour of LLVM's TypeID under -Bsymbolic but I don't think that is something that should block this PR given the wider issue it is fixing. The llvmlite tests are passing, and also when I ran a large subset of Numba tests against this patch they also passed, both give confidence.

Thanks again for all your work on this @domcharrier and @stellaraccident, much appreciated.

@esc esc merged commit 52a272c into numba:main Oct 8, 2025
22 checks passed
@stellaraccident
Copy link

Thanks for the patch and extensive investigation into this. We had also noticed some rare interposition-like-but-couldn't-quite-work-out-exactly-what issues in the LLVM-15 based versions of llvmlite too, the explanation presented in here may also explain that.

I've looked at the PR, the code changes are small and make sense. I've also tested this manually on a Linux system. First I used main and ran:

$ nm -C ffi/build/libllvmlite.so|grep -E ' [[:upper:]] '|grep -v "LLVMPY"|grep -v "std::"|grep -v " U "

what this does it report all the symbols and then filters out the global ones (I guess the -g switch would also do this) and then filters out those which are exported by llvmlite (LLVMPY prefix) and then filters out C++ std:: symbols and those which are undefined. There were quite a lot remaining, 3 main groups:

  1. T llvm::<something that llvmlite refprune pass likely references>
  2. D vtable for llvmlite<something that llvmlite refprune pass likely references>
  3. B llvm::getTypeName<something>::name

I did my own independent trace of loading this before the torch example appeared, and figured it was a LLVM library loaded into the process most probably with RTLD_GLOBAL that was doing the interposition, and so set up a trivial example to check (this is check.py):

import ctypes
import os
llvm = ctypes.CDLL("/path/to/some/libLLVM.so", os.RTLD_GLOBAL)
from llvmlite import binding

Then doing:

$ LD_DEBUG=bindings python check.py 2>&1|c++filt|grep libllvmlite.so|grep LLVM|grep -v LLVMPY

will show the symbols listed above in nm being interposed by those from the libLLVM.so.

Doing the same again with a libllvmlite.so build from this PR yields the following.

  1. From the nm check there are just symbols like llvm::LlvmliteMemoryManager* exported, which is expected.
  2. From the LD_DEBUG trace on the python check.py there's no interposition.

I do wonder about the behaviour of LLVM's TypeID under -Bsymbolic but I don't think that is something that should block this PR given the wider issue it is fixing. The llvmlite tests are passing, and also when I ran a large subset of Numba tests against this patch they also passed, both give confidence.

Thanks again for all your work on this @domcharrier and @stellaraccident, much appreciated.

Thanks for the detailed checking. Yes, in our situation, the libLLVM was coming from something with RTLD_GLOBAL scope. But the rules for how symbols/DSOs can get implicitly promoted to (effectively) RTLD_GLOBAL are complicated and I've found it generally best to make sure that one's independent libraries are armored against that, since it is a perfectly legal thing to either be done explicitly (at dlopen time) or for the dynamic linker to implicitly do based on heuristics.

I suspect that this issue has been an undiagnosed cause of crashes for many for a long time. It's often hard to discern from top level issue reports, but I know that once we isolated this, it explained quite a number of things that had been worked around over time.

Regarding TypeID, I think it should have no impact. Basically, the effect when statically linking LLVM like this into a -Bsymbolic DSO is equivalent to if you had dynamically linked with private symbol versions: all definitions of the TypeID members will be resolved locally (and that is the behavior you want since no one should be linking against libllvmlite.so as part of a larger LLVM binary). Other LLVMs in the process will all get their own. I'm less familiar with the history of LLVM type ids, but I helped design/harden the MLIR TypeIDs and at one point had enough knowledge in cache to explain this in more detail. But the details have been replaced by a sticky-note in my head that says "this will work as intended".

The fact that your TypeID symbols are BSS defs is good/correct. We had to triage/fix issues in MLIR that were the result of vague linkage of TypeID (V in nm output). That creates a more complicated linking situation that is likely ok for this specific case but activates even more dark corners of the dynamic linker which can require careful thought (I think we fixed this across LLVM many years ago, so nice to see that confirmed).

Glad to see this put to rest.

@stuartarchibald
Copy link
Contributor

@stellaraccident Thanks for the explanation regarding TypeID, I think we'll assume it is working ok until proven other wise, I imagine other folks would have hit it before us if it was not. I'm also glad to see this issue potentially all resolved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants