From 542eae3d38148943d2a0ce6f4e94211a6b923823 Mon Sep 17 00:00:00 2001 From: Wolfgang Müller Date: Wed, 2 Oct 2024 18:43:28 +0200 Subject: content: Add post: "musl and a curious Rust segfault" --- content/23/index.md | 140 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 140 insertions(+) create mode 100644 content/23/index.md diff --git a/content/23/index.md b/content/23/index.md new file mode 100644 index 0000000..9699334 --- /dev/null +++ b/content/23/index.md @@ -0,0 +1,140 @@ ++++ +date = 2024-10-02T18:42:07+02:00 +title = "musl and a curious Rust segfault" + +[taxonomies] +tags = ["bugs"] + +[extra] +related = [] ++++ + +About a week ago I noticed that [`fd(1)`](https://github.com/sharkdp/fd), a +Rust-based alternative to [`find(1)`](https://www.gnu.org/software/findutils/), +would suddenly segfault on my [musl](https://www.musl-libc.org/)-based server +system. Usually a segfault is nothing particularly special to my eyes, but this +one was different. Even just having `fd(1)` attempt to print its help text was +enough to trigger it, and when I attempted to debug it with +[`gdb(1)`](https://www.sourceware.org/gdb/), I saw the following: + +``` +(gdb) run +Starting program: /usr/bin/fd + +Program received signal SIGSEGV, Segmentation fault. +memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18 +warning: 18 ../src_musl/src/string/x86_64/memcpy.s: No such file or directory +(gdb) bt +#0 memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18 +#1 0x00007ffff7ab7177 in __copy_tls () at ../src_musl/src/env/__init_tls.c:66 +#2 0x00007ffff7ab730d in static_init_tls () at ../src_musl/src/env/__init_tls.c:149 +#3 0x00007ffff7aae89d in __init_libc () at ../src_musl/src/env/__libc_start_main.c:39 +#4 0x00007ffff7aae9c0 in __libc_start_main () at ../src_musl/src/env/__libc_start_main.c:80 +#5 0x00007ffff74107f6 in _start () +``` + +So... the segfault is in musl, not in `fd`!? + +I immediately checked whether other basic programs on the system worked. *They +did.* I checked when I last updated musl. *A couple of months ago, so that can't +be it.* I checked specifically whether another Rust-based program worked. *It +did.* + +`fd(1)` had been updated pretty recently, and I remembered it working correctly +about a month ago, so maybe something specific to `fd(1)`'s usage of Rust +triggered this segfault in musl? I wanted to make sure I could reproduce this in +a development environment, so I cloned the `fd(1)` repository, built a debug +release, and ran it... + +*It worked.* Huh!? + +I decided it was likely that [`portage`](https://wiki.gentoo.org/wiki/Portage), +Gentoo's package manager, was building the program differently, so I took care +to apply the same build flags to the development build. And what can I say: + +``` +error: failed to run custom build command for `crossbeam-utils v0.8.20` + +Caused by: + process didn't exit successfully: `fd/target/[...]/build-script-build` + (signal: 11, SIGSEGV: invalid memory reference) + +``` + +... it didn't even get to build the `fd` binary proper. A segfault again, too. +What on earth was going on? Why didn't this also happen in the `portage` build? + +Thankfully I now had a reproducer, so I did the only sensible thing and started +removing random build flags until I got `fd` to build again. This was our +culprit: + +``` +-Wl,-z,pack-relative-relocs +``` + +Already pretty out of my depth considering the fact that I couldn't fathom how +`fd(1)` got musl to segfault on `memcpy`, I now also found that a piece of the +puzzle required me to understand specific linker flags. *Oof.* + +Unsure what to do next I decided on a whim to compare the working and the broken +binary with `readelf(1)`. The most obvious difference was that the working +binary had its `.rela.dyn` +[relocation](https://en.wikipedia.org/wiki/Relocation_(computing)) section +populated with entries whilst the broken one's didn't seem filled out. At a +loss, I stopped and went to do something else. + +The story would probably have ended here had I not mentioned this conundrum to +[my partner](https://ahti.space/~nortti/) later in the day. We decided to have +another look at the binaries. After some discussion we determined that the +working binary was dynamically linked whilst the broken one wasn't. The other +working Rust-based program, [`rg(1)`](https://github.com/BurntSushi/ripgrep), +was also dynamically linked and had been built a while ago, so **at some point +`portage` must have stopped producing Rust executables that were dynamically +linked**. Finally some progress! + +At this point we need some background. Early on, Rust decided to use the +`x86_64-unknown-linux-musl` target to provide statically-linked binaries that +would run on a wide range of systems. Whilst support for dynamically linked +executables on musl systems was [added back in +2017](https://github.com/rust-lang/rust/pull/40113), the default behaviour was +never changed, so Gentoo has to make sure to disable static linking by passing +the `target-feature=-crt-static` flag. + +It does this in a system-wide fashion by setting an environment variable in +[`/etc/env.d`](https://wiki.gentoo.org/wiki//etc/env.d): + +``` +$ cat /etc/env.d/50rust-bin-1.80.1 +LDPATH="/usr/lib/rust/lib" +MANPATH="/usr/lib/rust/man" +CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_RUSTFLAGS="-C target-feature=-crt-static" +``` + +This setting should therefore be picked up by `portage` as well, but when I +examined its build environment it was simply not there. So finally we come to +the last piece of the puzzle: a [recent +change](https://gitweb.gentoo.org/repo/gentoo.git/commit/eclass/cargo.eclass?id=27d469a2114b4ad0b3e682854c50c806753eb472) +in how `RUSTFLAGS` are set within `portage`. Here's the important part: + +```bash +local -x CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS="-C strip=none -C linker=${LD_A[0]}" +[[ ${#LD_A[@]} -gt 1 ]] && local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+="$(printf -- ' -C link-arg=%s' "${LD_A[@]:1}")" +local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+=" ${RUSTFLAGS}" +``` + +Quoth the `bash(1)` manual: + +> Local variables "shadow" variables with the same name declared at previous +> scopes. For instance, a local variable declared in a function hides a global +> variable of the same name: references and assignments refer to the local +> variable, leaving the global variable unmodified. + +When previously the `RUSTFLAGS` environment variable was only touched when +cross-compiling, it was now overridden. To confirm, I edited the file in +question to include the previous value, and both `fd(1)` and `rg(1)` worked +again. Success! + +This whole saga was also [reported](https://bugs.gentoo.org/940197) to the +Gentoo bug tracker and promptly fixed. A project for another day is figuring out +exactly how a change from static linking to dynamic linking causes segfaults +like this, because I sure would love to know the details. -- cgit v1.2.3-2-gb3c3