-
Notifications
You must be signed in to change notification settings - Fork 350
[FIX] Hide non-API symbols when CXX compiler supports GNUC style visibility controls to prevent LLVM symbol export and interposition #1314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Hide non-API symbols when CXX compiler supports GNUC style visibility controls to prevent LLVM symbol export and interposition #1314
Conversation
d90d2a6 to
ac3d886
Compare
Check if the compiler supports compiler switch `--fvisibility=hidden`
and the function annotation `__attribute__((visibility("default")))`.
If so, add the compiler definition `HAVE_ATTRIBUTE_VISIBILITY` to
the llvmlite target's compiler definitions.
Further set the target's default visibility to hidden when
compiling CXX files.
NOTE: Changes have no impact on `cmake_minimum_required(VERSION ...)` as
module `CheckCXXSourceCompiles` and `check_cxx_source_compiles`
are coming with CMake versions < 3.13.
Add function annotation `__attribute__((visibility("default")))` to
API_EXPORT macro.
When the compiler flag `--fvisibility=hidden` is specified, this
ensures that non-API symbols will be hidden to other shared objects.
However, our main goal with these changes is to prevent
symbol interposition, i.e., to prevent that non-API symbols (mainly the LLVM ones)
are "borrowed" from other shared objects instead of using
llvmlite's own symbols. This is accomplished as well.
ac3d886 to
bca4f36
Compare
ffi/CMakeLists.txt
Outdated
| message(STATUS "LLVM target link libraries: ${llvm_libs}") | ||
| target_link_libraries(llvmlite ${llvm_libs}) | ||
|
|
||
| check_cxx_compiler_supports_fvisibility_and_attribute_visibility() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend removing the probing and just:
set(CMAKE_CXX_VISIBILITY_PRESET "hidden")
set(CMAKE_C_VISIBILITY_PRESET "hidden")
set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)
(CMake already handles feature detection, etc by default a part of its toolchain setup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @stellaraccident, in this (amazing) blog post, they mention that:
The C++ specific -fvisibility-inlines-hidden is a safer subset of -fvisibility=hidden. The option just violates pointer equality for inline function definitions. As discussed above, this is usually safe.
Guess the set(CMAKE_VISIBILITY_INLINES_HIDDEN ON) can thus be dropped given the set(CMAKE_CXX_VISIBILITY_PRESET "hidden")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm
ffi/core.h
Outdated
|
|
||
| #if defined(HAVE_DECLSPEC_DLL) | ||
| #define API_EXPORT(RTYPE) __declspec(dllexport) RTYPE | ||
| #elif defined(HAVE_ATTRIBUTE_VISIBILITY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've most often seen this as an unconditional else branch in such ifdefs. In practice, only Windows/COFF compilers don't support he attribute.
Here is an example from LLVM's own codebase of the cascade:
#if (defined(_WIN32) || defined(__CYGWIN__)) && \
!defined(MLIR_CAPI_ENABLE_WINDOWS_DLL_DECLSPEC)
// Visibility annotations disabled.
#define MLIR_CAPI_EXPORTED
#elif defined(_WIN32) || defined(__CYGWIN__)
// Windows visibility declarations.
#if MLIR_CAPI_BUILDING_LIBRARY
#define MLIR_CAPI_EXPORTED __declspec(dllexport)
#else
#define MLIR_CAPI_EXPORTED __declspec(dllimport)
#endif
#else
// Non-windows: use visibility attributes.
#define MLIR_CAPI_EXPORTED __attribute__((visibility("default")))
#endif
There are many examples on the internet, but typically for a library like this, you always use the default visibility attribute in the else branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stellaraccident Just noticed that the LLVM C ABI cascade has been changed slightly 3 months ago:
/// LLVM_C_ABI is the export/visibility macro used to mark symbols declared in
/// llvm-c as exported when built as a shared library.
#if !defined(LLVM_ABI_GENERATING_ANNOTATIONS)
// TODO(https://github.com/llvm/llvm-project/issues/145406): eliminate need for
// two preprocessor definitions to gate LLVM_ABI macro definitions.
#if defined(LLVM_ENABLE_LLVM_C_EXPORT_ANNOTATIONS) && \
!defined(LLVM_BUILD_STATIC)
#if defined(_WIN32) && !defined(__MINGW32__)
#if defined(LLVM_EXPORTS)
#define LLVM_C_ABI __declspec(dllexport)
#else
#define LLVM_C_ABI __declspec(dllimport)
#endif
#elif defined(__has_attribute) && __has_attribute(visibility)
#define LLVM_C_ABI __attribute__((visibility("default")))
#endif
#endif
#if !defined(LLVM_C_ABI)
#define LLVM_C_ABI
#endif
#endifIn particular, they have this in there now:
#elif defined(__has_attribute) && __has_attribute(visibility)
#define LLVM_C_ABI __attribute__((visibility("default")))
#endifRemark: If I understand the GCC docs correctly then:
#if defined __has_attribute
# if __has_attribute (visibility)
...
#endif
#endifshould actually be preferred vs:
#if defined(__has_attribute) && __has_attribute(visibility)
...
#endifThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@esc Just a thought: Is there anything that speaks against having llvmlite's ffi/core.h include the LLVM llvm-c/Visibility.h header and reuse the LLVM_C_ABI macro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be fine, even preferable but it appears that the header only appeared in LLVM 21, and llvmlite is currently LLVM 20 based.
38112c8 to
e678e7d
Compare
Specify -Bsymbolic linker option when conducting LLVM-statically-linking llvmlite build. This will thus affect the numba::llvmlite packages but not the conda-forge::llvmlite ones, e.g. (NOTE: We constrain the change to the "static" build variant as this seems the one to be built and tested by the numba/llmvlite CI.) Co-authored-by: stellaraccident
e678e7d to
b9994b8
Compare
|
@domcharrier hello and thank you for opening this to help improve llvmlite. @stellaraccident thank you for your valuable feedback here. Once this PR is ready for review, please do add a comment requesting such, thank you. Once the review has commenced, please refrain from using history rewriting techniques such as |
|
Current draft LGTM (as a fly on the wall). It covers all of the cases of LLVM symbol interposition I've seen in the field that implicate llvmlite. |
|
@domcharrier, to clarify the situation, are you encountering the issue in the conda packages in the |
|
We have reproduced the interposition issues with both conda and pypi. The original report was with conda, but we also did some work with the pypi wheels (enough to believe they were similarly impacted). |
Is there any chance you can give more infromation about how your reproduced this? It would be important to know that, so that a potential reviewer can validate this fix. |
The exact setup can't be described well, by I'll give the mechanic. This was in a build of ROCM that uses a dynamically loaded libLLVM. That library has private symbol versioning and a private SONAME. If that library loads first, then the dynamic linker global namespace will have symbols in it for that version of LLVM. Unfortunately, while other dynamic libraries of LLVM will not conflict with it (since they will have their own symbol versions and SONAME), for some reason static linking on Linux defaults to symbol interposition from any source of symbols in the namespace, even if defined locally, and even if the other library's symbols are versioned. So if it loads second, it will get a mix of symbols from elsewhere in the process. I haven't checked but it should be sufficient to repro to use ctypes to load a libLLVM from elsewhere (can be anything, should just be a different version). Then import numba and attempt to hit a simple example. If the LLVM's are grossly incompatible (likely), you will get a segfault then and there. But you can also see the incorrect bindings by running the test script with Afaik, statically linking LLVM into a DSO should always either use LLVM with symbol hiding enabled at build time (hidden symbols cannot be interposed) and/or link with In the environment where we root caused this, there were several LLVM's in the process and having numba randomly accept interposed symbols via llvmlite has been (we think) the root cause of a drip of hard to explain crashes over many months. It is just that when ROCM itself was implicated, we managed to isolate the cause and raise a patch. |
|
Hey @esc and @stellaraccident, I managed to create a minimum conda environment to reproduce the issue and made it part of the issue description. I want to stress that while we are installing GPU-related packages into this conda environment no GPU is required for run the reproducer. |
We use `-Wl,-Bsymbolic` to prevent interposition by other preloaded LLVM shared objects that export LLVM symbols. This can cause difficult to debug segfaults. See <numba/llvmlite#1314> for a discussion and description of such a scenario.
stuartarchibald
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch and extensive investigation into this. We had also noticed some rare interposition-like-but-couldn't-quite-work-out-exactly-what issues in the LLVM-15 based versions of llvmlite too, the explanation presented in here may also explain that.
I've looked at the PR, the code changes are small and make sense. I've also tested this manually on a Linux system. First I used main and ran:
$ nm -C ffi/build/libllvmlite.so|grep -E ' [[:upper:]] '|grep -v "LLVMPY"|grep -v "std::"|grep -v " U "
what this does it report all the symbols and then filters out the global ones (I guess the -g switch would also do this) and then filters out those which are exported by llvmlite (LLVMPY prefix) and then filters out C++ std:: symbols and those which are undefined. There were quite a lot remaining, 3 main groups:
T llvm::<something that llvmlite refprune pass likely references>D vtable for llvmlite<something that llvmlite refprune pass likely references>B llvm::getTypeName<something>::name
I did my own independent trace of loading this before the torch example appeared, and figured it was a LLVM library loaded into the process most probably with RTLD_GLOBAL that was doing the interposition, and so set up a trivial example to check (this is check.py):
import ctypes
import os
llvm = ctypes.CDLL("/path/to/some/libLLVM.so", os.RTLD_GLOBAL)
from llvmlite import bindingThen doing:
$ LD_DEBUG=bindings python check.py 2>&1|c++filt|grep libllvmlite.so|grep LLVM|grep -v LLVMPY
will show the symbols listed above in nm being interposed by those from the libLLVM.so.
Doing the same again with a libllvmlite.so build from this PR yields the following.
- From the
nmcheck there are just symbols likellvm::LlvmliteMemoryManager*exported, which is expected. - From the
LD_DEBUGtrace on thepython check.pythere's no interposition.
I do wonder about the behaviour of LLVM's TypeID under -Bsymbolic but I don't think that is something that should block this PR given the wider issue it is fixing. The llvmlite tests are passing, and also when I ran a large subset of Numba tests against this patch they also passed, both give confidence.
Thanks again for all your work on this @domcharrier and @stellaraccident, much appreciated.
Thanks for the detailed checking. Yes, in our situation, the libLLVM was coming from something with RTLD_GLOBAL scope. But the rules for how symbols/DSOs can get implicitly promoted to (effectively) RTLD_GLOBAL are complicated and I've found it generally best to make sure that one's independent libraries are armored against that, since it is a perfectly legal thing to either be done explicitly (at dlopen time) or for the dynamic linker to implicitly do based on heuristics. I suspect that this issue has been an undiagnosed cause of crashes for many for a long time. It's often hard to discern from top level issue reports, but I know that once we isolated this, it explained quite a number of things that had been worked around over time. Regarding TypeID, I think it should have no impact. Basically, the effect when statically linking LLVM like this into a The fact that your TypeID symbols are BSS defs is good/correct. We had to triage/fix issues in MLIR that were the result of vague linkage of TypeID ( Glad to see this put to rest. |
|
@stellaraccident Thanks for the explanation regarding TypeID, I think we'll assume it is working ok until proven other wise, I imagine other folks would have hit it before us if it was not. I'm also glad to see this issue potentially all resolved! |
Note
I am assuming that this PR will eventually be squashed before it is merged. In case it is rebased, I can remove intermediate states and revert commits before the rebase & merge (if that's requested).(Code owners clarified that git rebase should not be used as they track the ids of individual commits.)Changes made by this PR
libllvmlite.soborrows equally named (typically LLVM) symbols from previously loaded shared objects (interposition) and vice versa.-Bsymbolicas an additional measure against interposition.Note
This blog post is a helpful reference for understanding the terminology used in this PR: https://maskray.me/blog/2021-05-16-elf-interposition-and-bsymbolic.
Problem description
We observe a segmentation fault when preloading a
libLLVM*.so*indirectly via an Python import statement that appears before another Python statement that imports numba. The segmentation fault happens during anumba.jitcompilation process. We experience this segmentation fault with both thenumba::llvmliteConda package and the Numba Python wheel (from PyPI).Abstracted Python script:
Typical trace:
Important
Below we call into
libLLVM.so.20.0.gitfromlibllvmlite.so. This is the issue this PR aims to prevent:Reproduce
Step 1) Install and activate the below conda environment:
Important
No GPU is required to reproduce the issue.
Step 2) Run the below reproducer via
gdb -ex=run --args python3 reproducer.py:Step 3) Enter
btwhen the debugger stops at the segmentation fault and observe that allvmlite.sofunction eventually calls into_rocm_sdk_core/lib/llvm/lib/libLLVM.so.20.0git.This gives us a slightly different trace than in the problem description, but the same behavior and a similar segmentation fault is observed:
Important
Call into
libLLVM.sohappens hereOptional Step 4) Modify the environment as follows and observe the same issue:
Proposed fix: Use CMake visibility controls and add
-Bsymboliclinker optionffi/core.h: Add function annotation__attribute__((visibility("default")))toAPI_EXPORT macro in non-MSC case. (No longer guided by CMakeLists.txt).
ffi/CMakeLists.txt:__declspec( dllexport )).Note
We constrain the linker option change to the "static" build variant as this seems the one to be built and tested by the numba/llmvlite CI.)
Note on
-BSymboliclinker option effect onlibllvmlite.so:The linker option will introduce a
SYMBOLICentry (value:0x0) to the shared object's dynamic section.Calling
readelf -d <filepath>gives us:Appendix: Workaround: Patch SYMBOLIC dynamic section entry into prebuilt
libllvmlite.sovialiefPython packageWe've applied the below script to patch a
libllvmlite.sothat was installed via a prebuiltnumba::llvmliteConda package.This fixed the interposition issue experienced with the above Python snippet in our environment.