-
Notifications
You must be signed in to change notification settings - Fork 301
Description
Once in a while, when the code aligns just right, binaries will segfault on Windows - usually when the native runtime is trying to traverse the stack to unwind it. You "fix" it by making more random changes to the code until it goes away.
We've been trying to debug this for years, but it's easier to kick the can down the road than to look at the stack in GDB and wonder what's amiss and why. Then the bug got cocky.
This nim-eth PR had some harmless changes for Nim-1.6 compatibility, when a test binary started crashing on Windows. I replaced "unittest" with "unittest2" and the crash went away... only to come back in a different test binary.
The bug was mocking us.
Back to Google the drawing board, then! I started reading all the articles/bug reports I could find about stack corruption related to Mingw-w64, until I landed on a promising connection: setjmp/longjmp known to be buggy - that's exactly what Nim uses, by default, for its exception system.
Nim is not the only compiler to do this, of course, so the problem also hit OCaml: ocaml/ocaml#7638
and FreeBASIC: https://www.freebasic.net/forum/viewtopic.php?f=3&t=29124
and Ruby, R, etc.
The solution people usually come up with is switching to some - undocumented until recently - GCC builtins: __builtin_setjmp() and __builtin_longjmp(). These do require a different size buffer as an argument and come up with a big caveat: you cannot use them for a local jump (in the same function).
https://gcc.gnu.org/onlinedocs/gcc/Nonlocal-Gotos.html is pretty clear about it: "Because __builtin_longjmp depends on the function return mechanism to restore the stack context, it cannot be called from the same function calling __builtin_setjmp to initialize buf. It can only be called from a function called (directly or indirectly) from the function calling __builtin_setjmp."
So it's not "undefined behaviour, but it will probably work". It won't work, period.
Now, I did try it with the nim-eth test suite and it turns out we do throw and catch exceptions in the same Nim procedure, in Chronos.
In theory, you can also end up in that no-go zone when the C compiler inlines functions behind your back. Or maybe it's smarter than I think. Anyway, these builtins are better left for benchmarking and playing.
Back to Mingw-w64 and its overly complicated "setjmp.h" wrapping: https://github.com/msys2-contrib/mingw-w64/blob/cdb052f1d4056cd510cb83197b55868427b87476/mingw-w64-headers/crt/setjmp.h
We're on 64-bit, using SEH, no fancy new UCRT, no fancy defines, so we land here:
# define setjmp(BUF) _setjmp((BUF), __builtin_frame_address (0))Why is there a second argument to this non-standard _setjmp() function? Back when Microsoft decided to embrace, extend and extinguish C, they came up with an exception system for it that matched the one in C++, so "you can ensure that resources, such as memory blocks and files, get released correctly if execution unexpectedly terminates". Meet "Structured Exception Handling" (SEH): https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-170
Something that nobody wanted, with an undocumented second _setjmp() parameter that only VC++ can safely set with some stack/frame pointer info to allow stack unwinding from the destination back to the source of the long jump, complete with triggering C++ class destructors for compatibility with standard C++ exceptions. Utter madness.
And what does Mingw-w64 do with this? It tries to use it, of course, by guessing what it should stuff in that second parameter. To be fair, it probably can't use the provided standard library functions, because all that stack unwinding creates havoc in anything that is not compiled by VC++, but still...
First they passed a stack pointer in there - the result of mingw_getsp(). This lead to sporadic stack corruption and segfaults. Then they thought that maybe a frame pointer is better - in comes __builtin_frame_address(0). Fewer segfaults, high-fives all around. Then somebody figures out that GCC is misaligning that second function parameter. It gets fixed, fewer segfaults, celebrations across the world.
So we just upgrade the C compiler in MSYS2 and be on our merry way, right? Turns out the gcc-11.2.0 in there already has the fix. We had two different segfaults in two different nim-eth test binaries, with a fixed Mingw-w64 header and a fixed GCC...
What do?
I tried to use Nim's --exceptions:goto, but they're too buggy. Apparently, those only get tested with ARC/ORC - where they're also buggy - so that's a no-go. "Quirky" exceptions? I won't even ask.
Some people figured out that you can just pass a NULL instead of that buffer pointer, as the second argument of _setjmp(). The thinking goes like this: if the Windows runtime sees that there's no stack/frame pointer info, it won't try to unwind the stack at all - and it seems to work.
One guy says that "on x64 this argument has not to be NULL, as otherwise function doesn't behalf as standard setjmp/longjmp would expect".
Another guy says something in the vein of "it's totally cool, bro". I trust this guy, because I dig his style.
Going bach to Mingw-w64's "setjmp.h" header, I noticed I could get that NULL if __SEH__ was not defined. I sure hope this is not one of those toolchain-build-time options, because I have no intention of switching from MSYS2's default, but the oracle stays silent on ways to achieve that from the GCC command line.
What if I call _setjmp() myself, from the Nim runtime? Yes, a compiler patch is needed, which usually takes a very long time to be upstreamed, but maybe other Windows users out there might want fewer segfaults in a language primarily developed on Windows, by Windows users for Windows users, methinks.
I present to you *the patch*: nim-lang/Nim@devel...status-im:mingw-longjmp-devel
It's not even its final form. Its final form will actually be a dead link, but until then it will get docs and tests.
What it does is extend the undocumented Nim define nimRawSetjmp - which works with the POSIX version of _setjmp(), the one with a single param - so it also works on Windows, with the two-param version.
I left there my GCC builtins experiment, in case someone wants to benchmark it, under the form of a new Nim define.
Here's the nim-eth Windows CI job, passing with my compiler branch: https://github.com/status-im/nim-eth/runs/4313117389?check_suite_focus=true