summaryrefslogtreecommitdiffstatshomepage
path: root/content/23/index.md
diff options
context:
space:
mode:
Diffstat (limited to 'content/23/index.md')
-rw-r--r--content/23/index.md140
1 files changed, 140 insertions, 0 deletions
diff --git a/content/23/index.md b/content/23/index.md
new file mode 100644
index 0000000..442f383
--- /dev/null
+++ b/content/23/index.md
@@ -0,0 +1,140 @@
++++
+date = 2024-10-02T18:42:07+02:00
+title = "musl and a curious Rust segfault"
+
+[taxonomies]
+tags = ["bugs"]
+
+[extra]
+related = []
++++
+
+About a week ago I noticed that [`fd(1)`](https://github.com/sharkdp/fd), a
+Rust-based alternative to [`find(1)`](https://www.gnu.org/software/findutils/),
+would suddenly segfault on my [musl](https://www.musl-libc.org/)-based server
+system. Usually a segfault is nothing particularly special to my eyes, but this
+one was different. Even just having `fd(1)` attempt to print its help text was
+enough to trigger it, and when I attempted to debug it with
+[`gdb(1)`](https://www.sourceware.org/gdb/), I saw the following:
+
+```
+(gdb) run
+Starting program: /usr/bin/fd
+
+Program received signal SIGSEGV, Segmentation fault.
+memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18
+warning: 18 ../src_musl/src/string/x86_64/memcpy.s: No such file or directory
+(gdb) bt
+#0 memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18
+#1 0x00007ffff7ab7177 in __copy_tls () at ../src_musl/src/env/__init_tls.c:66
+#2 0x00007ffff7ab730d in static_init_tls () at ../src_musl/src/env/__init_tls.c:149
+#3 0x00007ffff7aae89d in __init_libc () at ../src_musl/src/env/__libc_start_main.c:39
+#4 0x00007ffff7aae9c0 in __libc_start_main () at ../src_musl/src/env/__libc_start_main.c:80
+#5 0x00007ffff74107f6 in _start ()
+```
+
+So... the segfault is in musl, not in `fd`!?
+
+I immediately checked whether other basic programs on the system worked. *They
+did.* I checked when I last updated musl. *A couple of months ago, so that can't
+be it.* I checked specifically whether another Rust-based program worked. *It
+did.*
+
+`fd(1)` had been updated pretty recently, and I remembered it working correctly
+about a month ago, so maybe something specific to `fd(1)`'s usage of Rust
+triggered this segfault in musl? I wanted to make sure I could reproduce this in
+a development environment, so I cloned the `fd(1)` repository, built a debug
+release, and ran it...
+
+*It worked.* Huh!?
+
+I decided it was likely that [`portage`](https://wiki.gentoo.org/wiki/Portage),
+Gentoo's package manager, was building the program differently, so I took care
+to apply the same build flags to the development build. And what can I say:
+
+```
+error: failed to run custom build command for `crossbeam-utils v0.8.20`
+
+Caused by:
+ process didn't exit successfully: `fd/target/[...]/build-script-build`
+ (signal: 11, SIGSEGV: invalid memory reference)
+
+```
+
+... it didn't even get to build the `fd` binary proper. A segfault again, too.
+What on earth was going on? Why didn't this also happen in the `portage` build?
+
+Thankfully I now had a reproducer, so I did the only sensible thing and started
+removing random build flags until I got `fd` to build again. This was our
+culprit:
+
+```
+-Wl,-z,pack-relative-relocs
+```
+
+Already pretty out of my depth considering the fact that I couldn't fathom how
+`fd(1)` got musl to segfault on `memcpy`, I now also found that a piece of the
+puzzle required me to understand specific linker flags. *Oof.*
+
+Unsure what to do next I decided on a whim to compare the working and the
+broken binary with `readelf(1)`. The most obvious difference was that the
+working binary had its `.rela.dyn`
+[relocation](https://en.wikipedia.org/wiki/Relocation_(computing)) section
+populated with entries whilst the broken one was missing `.rela.dyn` but had
+`.relr.dyn` instead. At a loss, I stopped and went to do something else.
+
+The story would probably have ended here had I not mentioned this conundrum to
+[my partner](https://ahti.space/~nortti/) later in the day. We decided to have
+another look at the binaries. After some discussion we determined that the
+working binary was dynamically linked whilst the broken one wasn't. The other
+working Rust-based program, [`rg(1)`](https://github.com/BurntSushi/ripgrep),
+was also dynamically linked and had been built a while ago, so **at some point
+`portage` must have stopped producing Rust executables that were dynamically
+linked**. Finally some progress!
+
+At this point we need some background. Early on, Rust decided to use the
+`x86_64-unknown-linux-musl` target to provide statically-linked binaries that
+would run on a wide range of systems. Whilst support for dynamically linked
+executables on musl systems was [added back in
+2017](https://github.com/rust-lang/rust/pull/40113), the default behaviour was
+never changed, so Gentoo has to make sure to disable static linking by passing
+the `target-feature=-crt-static` flag.
+
+It does this in a system-wide fashion by setting an environment variable in
+[`/etc/env.d`](https://wiki.gentoo.org/wiki//etc/env.d):
+
+```
+$ cat /etc/env.d/50rust-bin-1.80.1
+LDPATH="/usr/lib/rust/lib"
+MANPATH="/usr/lib/rust/man"
+CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_RUSTFLAGS="-C target-feature=-crt-static"
+```
+
+This setting should therefore be picked up by `portage` as well, but when I
+examined its build environment it was simply not there. So finally we come to
+the last piece of the puzzle: a [recent
+change](https://gitweb.gentoo.org/repo/gentoo.git/commit/eclass/cargo.eclass?id=27d469a2114b4ad0b3e682854c50c806753eb472)
+in how `RUSTFLAGS` are set within `portage`. Here's the important part:
+
+```bash
+local -x CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS="-C strip=none -C linker=${LD_A[0]}"
+[[ ${#LD_A[@]} -gt 1 ]] && local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+="$(printf -- ' -C link-arg=%s' "${LD_A[@]:1}")"
+local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+=" ${RUSTFLAGS}"
+```
+
+Quoth the `bash(1)` manual:
+
+> Local variables "shadow" variables with the same name declared at previous
+> scopes. For instance, a local variable declared in a function hides a global
+> variable of the same name: references and assignments refer to the local
+> variable, leaving the global variable unmodified.
+
+When previously the `RUSTFLAGS` environment variable was only touched when
+cross-compiling, it was now overridden. To confirm, I edited the file in
+question to include the previous value, and both `fd(1)` and `rg(1)` worked
+again. Success!
+
+This whole saga was also [reported](https://bugs.gentoo.org/940197) to the
+Gentoo bug tracker and promptly fixed. A project for another day is figuring out
+exactly how a change from static linking to dynamic linking causes segfaults
+like this, because I sure would love to know the details.