+++ date = 2024-10-02T18:42:07+02:00 title = "musl and a curious Rust segfault" [taxonomies] tags = ["bugs"] [extra] related = [] +++ About a week ago I noticed that [`fd(1)`](https://github.com/sharkdp/fd), a Rust-based alternative to [`find(1)`](https://www.gnu.org/software/findutils/), would suddenly segfault on my [musl](https://www.musl-libc.org/)-based server system. Usually a segfault is nothing particularly special to my eyes, but this one was different. Even just having `fd(1)` attempt to print its help text was enough to trigger it, and when I attempted to debug it with [`gdb(1)`](https://www.sourceware.org/gdb/), I saw the following: ``` (gdb) run Starting program: /usr/bin/fd Program received signal SIGSEGV, Segmentation fault. memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18 warning: 18 ../src_musl/src/string/x86_64/memcpy.s: No such file or directory (gdb) bt #0 memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18 #1 0x00007ffff7ab7177 in __copy_tls () at ../src_musl/src/env/__init_tls.c:66 #2 0x00007ffff7ab730d in static_init_tls () at ../src_musl/src/env/__init_tls.c:149 #3 0x00007ffff7aae89d in __init_libc () at ../src_musl/src/env/__libc_start_main.c:39 #4 0x00007ffff7aae9c0 in __libc_start_main () at ../src_musl/src/env/__libc_start_main.c:80 #5 0x00007ffff74107f6 in _start () ``` So... the segfault is in musl, not in `fd`!? I immediately checked whether other basic programs on the system worked. *They did.* I checked when I last updated musl. *A couple of months ago, so that can't be it.* I checked specifically whether another Rust-based program worked. *It did.* `fd(1)` had been updated pretty recently, and I remembered it working correctly about a month ago, so maybe something specific to `fd(1)`'s usage of Rust triggered this segfault in musl? I wanted to make sure I could reproduce this in a development environment, so I cloned the `fd(1)` repository, built a debug release, and ran it... *It worked.* Huh!? I decided it was likely that [`portage`](https://wiki.gentoo.org/wiki/Portage), Gentoo's package manager, was building the program differently, so I took care to apply the same build flags to the development build. And what can I say: ``` error: failed to run custom build command for `crossbeam-utils v0.8.20` Caused by: process didn't exit successfully: `fd/target/[...]/build-script-build` (signal: 11, SIGSEGV: invalid memory reference) ``` ... it didn't even get to build the `fd` binary proper. A segfault again, too. What on earth was going on? Why didn't this also happen in the `portage` build? Thankfully I now had a reproducer, so I did the only sensible thing and started removing random build flags until I got `fd` to build again. This was our culprit: ``` -Wl,-z,pack-relative-relocs ``` Already pretty out of my depth considering the fact that I couldn't fathom how `fd(1)` got musl to segfault on `memcpy`, I now also found that a piece of the puzzle required me to understand specific linker flags. *Oof.* Unsure what to do next I decided on a whim to compare the working and the broken binary with `readelf(1)`. The most obvious difference was that the working binary had its `.rela.dyn` [relocation](https://en.wikipedia.org/wiki/Relocation_(computing)) section populated with entries whilst the broken one's didn't seem filled out. At a loss, I stopped and went to do something else. The story would probably have ended here had I not mentioned this conundrum to [my partner](https://ahti.space/~nortti/) later in the day. We decided to have another look at the binaries. After some discussion we determined that the working binary was dynamically linked whilst the broken one wasn't. The other working Rust-based program, [`rg(1)`](https://github.com/BurntSushi/ripgrep), was also dynamically linked and had been built a while ago, so **at some point `portage` must have stopped producing Rust executables that were dynamically linked**. Finally some progress! At this point we need some background. Early on, Rust decided to use the `x86_64-unknown-linux-musl` target to provide statically-linked binaries that would run on a wide range of systems. Whilst support for dynamically linked executables on musl systems was [added back in 2017](https://github.com/rust-lang/rust/pull/40113), the default behaviour was never changed, so Gentoo has to make sure to disable static linking by passing the `target-feature=-crt-static` flag. It does this in a system-wide fashion by setting an environment variable in [`/etc/env.d`](https://wiki.gentoo.org/wiki//etc/env.d): ``` $ cat /etc/env.d/50rust-bin-1.80.1 LDPATH="/usr/lib/rust/lib" MANPATH="/usr/lib/rust/man" CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_RUSTFLAGS="-C target-feature=-crt-static" ``` This setting should therefore be picked up by `portage` as well, but when I examined its build environment it was simply not there. So finally we come to the last piece of the puzzle: a [recent change](https://gitweb.gentoo.org/repo/gentoo.git/commit/eclass/cargo.eclass?id=27d469a2114b4ad0b3e682854c50c806753eb472) in how `RUSTFLAGS` are set within `portage`. Here's the important part: ```bash local -x CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS="-C strip=none -C linker=${LD_A[0]}" [[ ${#LD_A[@]} -gt 1 ]] && local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+="$(printf -- ' -C link-arg=%s' "${LD_A[@]:1}")" local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+=" ${RUSTFLAGS}" ``` Quoth the `bash(1)` manual: > Local variables "shadow" variables with the same name declared at previous > scopes. For instance, a local variable declared in a function hides a global > variable of the same name: references and assignments refer to the local > variable, leaving the global variable unmodified. When previously the `RUSTFLAGS` environment variable was only touched when cross-compiling, it was now overridden. To confirm, I edited the file in question to include the previous value, and both `fd(1)` and `rg(1)` worked again. Success! This whole saga was also [reported](https://bugs.gentoo.org/940197) to the Gentoo bug tracker and promptly fixed. A project for another day is figuring out exactly how a change from static linking to dynamic linking causes segfaults like this, because I sure would love to know the details.