1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
|
+++
date = 2024-10-02T18:42:07+02:00
title = "musl and a curious Rust segfault"
[taxonomies]
tags = ["bugs"]
[extra]
related = []
+++
About a week ago I noticed that [`fd(1)`](https://github.com/sharkdp/fd), a
Rust-based alternative to [`find(1)`](https://www.gnu.org/software/findutils/),
would suddenly segfault on my [musl](https://www.musl-libc.org/)-based server
system. Usually a segfault is nothing particularly special to my eyes, but this
one was different. Even just having `fd(1)` attempt to print its help text was
enough to trigger it, and when I attempted to debug it with
[`gdb(1)`](https://www.sourceware.org/gdb/), I saw the following:
```
(gdb) run
Starting program: /usr/bin/fd
Program received signal SIGSEGV, Segmentation fault.
memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18
warning: 18 ../src_musl/src/string/x86_64/memcpy.s: No such file or directory
(gdb) bt
#0 memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18
#1 0x00007ffff7ab7177 in __copy_tls () at ../src_musl/src/env/__init_tls.c:66
#2 0x00007ffff7ab730d in static_init_tls () at ../src_musl/src/env/__init_tls.c:149
#3 0x00007ffff7aae89d in __init_libc () at ../src_musl/src/env/__libc_start_main.c:39
#4 0x00007ffff7aae9c0 in __libc_start_main () at ../src_musl/src/env/__libc_start_main.c:80
#5 0x00007ffff74107f6 in _start ()
```
So... the segfault is in musl, not in `fd`!?
I immediately checked whether other basic programs on the system worked. *They
did.* I checked when I last updated musl. *A couple of months ago, so that can't
be it.* I checked specifically whether another Rust-based program worked. *It
did.*
`fd(1)` had been updated pretty recently, and I remembered it working correctly
about a month ago, so maybe something specific to `fd(1)`'s usage of Rust
triggered this segfault in musl? I wanted to make sure I could reproduce this in
a development environment, so I cloned the `fd(1)` repository, built a debug
release, and ran it...
*It worked.* Huh!?
I decided it was likely that [`portage`](https://wiki.gentoo.org/wiki/Portage),
Gentoo's package manager, was building the program differently, so I took care
to apply the same build flags to the development build. And what can I say:
```
error: failed to run custom build command for `crossbeam-utils v0.8.20`
Caused by:
process didn't exit successfully: `fd/target/[...]/build-script-build`
(signal: 11, SIGSEGV: invalid memory reference)
```
... it didn't even get to build the `fd` binary proper. A segfault again, too.
What on earth was going on? Why didn't this also happen in the `portage` build?
Thankfully I now had a reproducer, so I did the only sensible thing and started
removing random build flags until I got `fd` to build again. This was our
culprit:
```
-Wl,-z,pack-relative-relocs
```
Already pretty out of my depth considering the fact that I couldn't fathom how
`fd(1)` got musl to segfault on `memcpy`, I now also found that a piece of the
puzzle required me to understand specific linker flags. *Oof.*
Unsure what to do next I decided on a whim to compare the working and the
broken binary with `readelf(1)`. The most obvious difference was that the
working binary had its `.rela.dyn`
[relocation](https://en.wikipedia.org/wiki/Relocation_(computing)) section
populated with entries whilst the broken one was missing `.rela.dyn` but had
`.relr.dyn` instead. At a loss, I stopped and went to do something else.
The story would probably have ended here had I not mentioned this conundrum to
[my partner](https://ahti.space/~nortti/) later in the day. We decided to have
another look at the binaries. After some discussion we determined that the
working binary was dynamically linked whilst the broken one wasn't. The other
working Rust-based program, [`rg(1)`](https://github.com/BurntSushi/ripgrep),
was also dynamically linked and had been built a while ago, so **at some point
`portage` must have stopped producing Rust executables that were dynamically
linked**. Finally some progress!
At this point we need some background. Early on, Rust decided to use the
`x86_64-unknown-linux-musl` target to provide statically-linked binaries that
would run on a wide range of systems. Whilst support for dynamically linked
executables on musl systems was [added back in
2017](https://github.com/rust-lang/rust/pull/40113), the default behaviour was
never changed, so Gentoo has to make sure to disable static linking by passing
the `target-feature=-crt-static` flag.
It does this in a system-wide fashion by setting an environment variable in
[`/etc/env.d`](https://wiki.gentoo.org/wiki//etc/env.d):
```
$ cat /etc/env.d/50rust-bin-1.80.1
LDPATH="/usr/lib/rust/lib"
MANPATH="/usr/lib/rust/man"
CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_RUSTFLAGS="-C target-feature=-crt-static"
```
This setting should therefore be picked up by `portage` as well, but when I
examined its build environment it was simply not there. So finally we come to
the last piece of the puzzle: a [recent
change](https://gitweb.gentoo.org/repo/gentoo.git/commit/eclass/cargo.eclass?id=27d469a2114b4ad0b3e682854c50c806753eb472)
in how `RUSTFLAGS` are set within `portage`. Here's the important part:
```bash
local -x CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS="-C strip=none -C linker=${LD_A[0]}"
[[ ${#LD_A[@]} -gt 1 ]] && local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+="$(printf -- ' -C link-arg=%s' "${LD_A[@]:1}")"
local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+=" ${RUSTFLAGS}"
```
Quoth the `bash(1)` manual:
> Local variables "shadow" variables with the same name declared at previous
> scopes. For instance, a local variable declared in a function hides a global
> variable of the same name: references and assignments refer to the local
> variable, leaving the global variable unmodified.
When previously the `RUSTFLAGS` environment variable was only touched when
cross-compiling, it was now overridden. To confirm, I edited the file in
question to include the previous value, and both `fd(1)` and `rg(1)` worked
again. Success!
This whole saga was also [reported](https://bugs.gentoo.org/940197) to the
Gentoo bug tracker and promptly fixed. A project for another day is figuring out
exactly how a change from static linking to dynamic linking causes segfaults
like this, because I sure would love to know the details.
|