Jammy builds of xen segfault, but only on launchpad x86 builders

Bug #1958389 reported by Christian Ehrhardt 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
binutils
Fix Released
Medium
launchpad-buildd
New
Undecided
Unassigned
binutils (Debian)
Fix Released
Unknown
binutils (Ubuntu)
Fix Released
Critical
Unassigned
Jammy
Fix Released
Critical
Unassigned
xen (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
Invalid
Undecided
Unassigned

Bug Description

FTBFS in Jammy on LP infra:
https://launchpadlibrarian.net/580924961/buildlog_ubuntu-jammy-amd64.xen_4.16.0-1~ubuntu1~jammyppa4_BUILDING.txt.gz
https://launchpadlibrarian.net/581060687/buildlog_ubuntu-jammy-amd64.xen_4.16.0-1~ubuntu1~jammyppa6_BUILDING.txt.gz
Related PPA:
https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4760/+packages

Summary:
- Build reliably fails on LP
- Build in local sbuild works reliably on my Laptop
- Build in local VM (sizing like LP builders) works (other crashes but works)
- Build on AMD server (chip more similar to LP) works reliably

Failing step:

On Launchpad build infrastructure it breaks on ld:
$ x86_64-linux-gnu-ld -mi386pep --subsystem=10 --image-base=0xffff82d040000000 --stack=0,0 --heap=0,0 --section-alignment=0x200000 --file-alignment=0x20 --major-image-version=4 --minor-image-version=16 --major-os-version=2 --minor-os-version=0 --major-subsystem-version=2 --minor-subsystem-version=0 --no-insert-timestamp --build-id=sha1 -T efi.lds -N prelink.o /<<PKGBUILDDIR>>/xen/common/symbols-dummy.o -b pe-x86-64 efi/buildid.o -o /<<PKGBUILDDIR>>/xen/.xen.efi.0xffff82d040000000.0 && :
Segmentation fault (core dumped

---

Steps to recreate (result depends on platform)

# you can grab the package from https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4760/+packages

sudo vim /etc/apt/sources.list
sudo apt update
sudo apt dist-upgrade -y
sudo apt build-dep xen
sudo apt install flex bison python3-dev libpython3-dev dpkg-dev devscripts apport-retrace
sudo mkdir /mnt/build
sudo chmod go+w /mnt/build
cd /mnt/build
# copy in things from host
scp xen_4.16.0-1~ubuntu1~jammyppa6.dsc xen_4.16.0-1~ubuntu1~jammyppa6.debian.tar.xz xen_4.16.0.orig.tar.bz2 ubuntu@<TODO>:/mnt/build
dpkg-source -x xen_4.16.0-1~ubuntu1~jammyppa6.dsc xen_4.16.0
cd xen_4.16.0
dpkg-buildpackage -i -us -uc -b

---

In a jammy VM 4cpu/8G I get some avx2 crashes but the build works:

Jan 19 07:41:27 j kernel: x86_64-linux-gn[130016]: segfault at 0 ip 00007f189432ef3d sp 00007ffc8e2361d8 error 4 in libc.so.6[7f18941bb000+194000]
Jan 19 07:41:27 j kernel: Code: f8 77 c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 89 f8 48 89 fa c5 f9 ef c0 25 ff 0f 00 00 3d e0 0f 00 00 0f 87 33 01 00 00 <c5> fd 74 0f c5 fd d7 c1 85 c0 74 57 f3 0f bc c0 c5 f8 77 c3 66 66

#0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
74 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory.
(gdb) bt
#0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
#1 0x00007fa98d63c2d0 in ?? () from /lib/x86_64-linux-gnu/libbfd-2.37.50-system.20220106.so
#2 0x00007fa98d6021e8 in ?? () from /lib/x86_64-linux-gnu/libbfd-2.37.50-system.20220106.so
#3 0x00007fa98d602509 in coff_write_alien_symbol () from /lib/x86_64-linux-gnu/libbfd-2.37.50-system.20220106.so
#4 0x00007fa98d6033bd in _bfd_coff_final_link () from /lib/x86_64-linux-gnu/libbfd-2.37.50-system.20220106.so
#5 0x0000562bdaaae3bf in ?? ()
#6 0x00007fa98d2e8fd0 in __libc_start_call_main (main=main@entry=0x562bdaaad5e0, argc=argc@entry=8, argv=argv@entry=0x7ffc797f2968) at ../sysdeps/nptl/libc_start_call_main.h:58
#7 0x00007fa98d2e907d in __libc_start_main_impl (main=0x562bdaaad5e0, argc=8, argv=0x7ffc797f2968, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7ffc797f2958) at ../csu/libc-start.c:409
#8 0x0000562bdaaad515 in ?? ()

^^ that is a different crash than on th LP builders
! And despite those crashes the build does appear to work oO?!

The same crashes I see on my local sbuild runs, the full set of one build is
Jan 19 07:39:02 Keschdeichel kernel: x86_64-linux-gn[4131180]: segfault at 0 ip 00007f566e8b3f3d sp 00007ffde04b75a8 error 4 in libc.so.6[7f566e740000+194000]
Jan 19 07:39:03 Keschdeichel kernel: x86_64-linux-gn[4131332]: segfault at 0 ip 00007fbba26e4f3d sp 00007fffab8a5b68 error 4 in libc.so.6[7fbba2571000+194000]
Jan 19 07:39:03 Keschdeichel kernel: x86_64-linux-gn[4131382]: segfault at 0 ip 00007fe3681b7f3d sp 00007ffcbbf16628 error 4 in libc.so.6[7fe368044000+194000]
Jan 19 07:39:42 Keschdeichel kernel: x86_64-linux-gn[4134584]: segfault at 0 ip 00007f241f455f3d sp 00007ffd05c2e7c8 error 4 in libc.so.6[7f241f2e2000+194000]
Jan 19 07:44:57 Keschdeichel kernel: x86_64-linux-gn[4171794]: segfault at 0 ip 00007fcbe1f2bf3d sp 00007fff62005aa8 error 4 in libc.so.6[7fcbe1db8000+194000]
Jan 19 07:44:57 Keschdeichel kernel: x86_64-linux-gn[4172028]: segfault at 0 ip 00007f601dfa3f3d sp 00007ffe67ca2788 error 4 in libc.so.6[7f601de30000+194000]
Jan 19 07:44:58 Keschdeichel kernel: x86_64-linux-gn[4172154]: segfault at 0 ip 00007f1bfabb7f3d sp 00007ffe5ce9dfb8 error 4 in libc.so.6[7f1bfaa44000+194000]
Jan 19 07:45:05 Keschdeichel kernel: x86_64-linux-gn[4174536]: segfault at 0 ip 00007f0f48986f3d sp 00007ffc9e72ea48 error 4 in libc.so.6[7f0f48813000+194000]

I checked, this is not in configure stage where such things sometimes are intentional.

Running in local VM with reduced cpu features (e.g. no avx2) still triggers
the same bfd issues, but still works to build.

---

The LP run is on a Rome chip, from the build env:
Model name: AMD EPYC-Rome Processor
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl xtopology cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save umip rdpid

So I thought I might need to re-create the build on such a chip to check
if it fails there too.

Running on Riccioli from the kernel Team as similar HW (AMD EPYC 7713)
works fine (like my local build does, this time without any crashes)

---

I do not know how to continue, repro on laptop, repro in VM guests, repro on
AMD servers similar to the build farm, ... they all build the package.

But on launchpad it crashes with the reported error.

Is it the toolchain that needs a fix, is it the launchpad builder setup, both?
I do not know ... :-/
Filing this against xen+binutils+launchpad-buildd

Tags: fr-2001
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI: I've seen some shortening of the reported crashes going on.
For example in journal it was "x86_64-linux-gn" and in the build log it was after "x86_64-linux-gnu-ld". The local (non fatal) crash I got eventually was in "x86_64-linux-gnu-ld.bfd".

This might or might not be the same crash on LP and locally, but as I mentioned above one fails critically the other one is ignored.

Could someone run that build manually on LP and/or stop it before cleanup to gather the crash from there for comparison?

Revision history for this message
Colin Watson (cjwatson) wrote :

Both the failures linked here were on lgw01. Have you seen this on lcy02 as well, or only lgw01?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Colin,
so far I've seen it only on lgw01, but I didn't mass-submit it yet to try the other builders.
Is there a better way than re-building/re-submitting to get onto the lcy02?

Revision history for this message
Colin Watson (cjwatson) wrote :

I've tested manually on a staging lcy02 builder and it works fine there, so worst case you should be able to retry until it happens to land on lcy02.

I've reproduced the problem on a staging lgw01 builder. I need ddebs to get a useful backtrace, so I've filed https://code.launchpad.net/~cjwatson/canonical-is-firewalls/+git/firewall-configs/+merge/414372 to get access to those from the relevant builder - I should be able to pick this up again once that's merged.

Revision history for this message
Colin Watson (cjwatson) wrote :

It probably isn't useful, but the backtrace without extra debug symbols looks like this:

buildd@dogfood-lgw01-amd64-001:/build/xen-jT2uET/xen-4.16.0/xen/arch/x86$ gdb --args x86_64-linux-gnu-ld -mi386pep --subsystem=10 --image-base=0xffff82d040000000 --stack=0,0 --heap=0,0 --section-alignment=0x200000 --file-alignment=0x20 --major-image-version=4 --minor-image-version=16 --major-os-version=2 --minor-os-version=0 --major-subsystem-version=2 --minor-subsystem-version=0 --no-insert-timestamp --build-id=sha1 -T efi.lds -N prelink.o /build/xen-jT2uET/xen-4.16.0/xen/common/symbols-dummy.o -b pe-x86-64 efi/buildid.o -o /build/xen-jT2uET/xen-4.16.0/xen/.xen.efi.0xffff82d040000000.0
GNU gdb (Ubuntu 11.2-0ubuntu1) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from x86_64-linux-gnu-ld...
(No debugging symbols found in x86_64-linux-gnu-ld)
(gdb) r
Starting program: /usr/bin/x86_64-linux-gnu-ld -mi386pep --subsystem=10 --image-base=0xffff82d040000000 --stack=0,0 --heap=0,0 --section-alignment=0x200000 --file-alignment=0x20 --major-image-version=4 --minor-image-version=16 --major-os-version=2 --minor-os-version=0 --major-subsystem-version=2 --minor-subsystem-version=0 --no-insert-timestamp --build-id=sha1 -T efi.lds -N prelink.o /build/xen-jT2uET/xen-4.16.0/xen/common/symbols-dummy.o -b pe-x86-64 efi/buildid.o -o /build/xen-jT2uET/xen-4.16.0/xen/.xen.efi.0xffff82d040000000.0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7ca3b4a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007ffff7ca3b4a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff7f672d0 in ?? () from /lib/x86_64-linux-gnu/libbfd-2.37.50-system.20220106.so
#2 0x00007ffff7f2d1e8 in ?? () from /lib/x86_64-linux-gnu/libbfd-2.37.50-system.20220106.so
#3 0x00007ffff7f2d509 in coff_write_alien_symbol ()
   from /lib/x86_64-linux-gnu/libbfd-2.37.50-system.20220106.so
#4 0x00007ffff7f2e3bd in _bfd_coff_final_link ()
   from /lib/x86_64-linux-gnu/libbfd-2.37.50-system.20220106.so
#5 0x000055555559d3bf in ?? ()
#6 0x00007ffff7c13fd0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x00007ffff7c1407d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x000055555559c515 in ?? ()

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Colin, even without debug symbols that still is probably enough to confirm it is actually the same crash that I see on my laptop.

While I do not see why it is fatal for the build on LP but not on local sbuild or dpkg-buildpkg in a local VM it means that most likely someone looking into this from the binutils POV can reproduce it the way I outlined in my report on most common intel laptops.

In use here I have:
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

Maybe you could add what chip the failing lgw01 builders use to try isolating the affected series?

/me is trying to retry until it ran on an lcy02 builder now ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for the hint,
in case anyone needs to do the same - you can see the associated builder right when the build starts - then abort it immediately, refresh and rebuild.
That way you get 1 build try <1min, I got

lgw01 018
lgw01 044
lgw01 023
lcy02 001

And the build on the latter I let finish which indeed let it build successfully.

That also means the crash I got locally is actually relevant to doko or anyone else looking for this from binutils POV. Therefore I'll attach mine.

Changed in xen (Ubuntu):
status: New → Invalid
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This build ran 10:26 -> 10:33 and the crash happened "in between". As i said in local sbuild those are not fatal but on LP they are. Therefore attaching the build log ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The build log I attached above shall help to spot the versions needed for debug symbols to read this crash correctly.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Being a platform dependent (thereby rather severe, but easy to miss) crash in binutils I feel it is time to bump priority of this to get it onto the radar avoiding to ship it that way (potentially breaking more) in our next LTS.

Changed in binutils (Ubuntu):
importance: Undecided → Critical
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

For xen itself we know now how to "work around it" which is retry until running on one of the new builders. Leaving the bug open as it clearly seems to be a real issue.
I also leave the PPA as-is so that you can grab sources from there for recreating this binutils issue.

tags: added: rls-jj-incoming
Revision history for this message
In , Matthias Klose (doko) wrote :

seen with the 2.38 branch 20220123 and 20220126,

https://bugs.debian.org/1004269
https://bugs.launchpad.net/ubuntu/+source/binutils/+bug/1958389

The Debian report contains the more useful stack trace. The Ubuntu report suggests that this is only seen on Intel hardware, not AMD hardware.

Changed in binutils:
importance: Unknown → Medium
status: Unknown → Confirmed
Changed in binutils (Debian):
status: Unknown → New
Revision history for this message
In , Hjl-tools (hjl-tools) wrote :

I can build xen-4.16.0 on Fedora 35. Please provide ALL inputs so that
I can reproduce it.

tags: added: fr-2001
tags: removed: rls-jj-incoming
Revision history for this message
In , Alan Modra (amodra-gmail) wrote :

Created attachment 13937
Likely fix

From the backtrace in https://bugs.debian.org/1004269 it is clear that the problem is triggered by commit e86fc4a5bc37 in which a new extrap field was added to coffcode.h combined_entry_type but is not used on anything except rs6000 coff targets.

Revision history for this message
In , Alan Modra (amodra-gmail) wrote :

HJ, you likely can reproduce the failue with an asan build of binutils, or using MALLOC_PERTURB_. I haven't tested the patch yet.

Revision history for this message
In , Hjl-tools (hjl-tools) wrote :

Works for me:

/export/build/gnu/tools-build/binutils-asan/build-x86_64-linux/ld/ld-new -mi386pep --subsystem=10 --image-base=0xffff82d040000000 --stack=0,0 --heap=0,0 --section-alignment=0x200000 --file-alignment=0x20 --major-image-version=4 --minor-image-version=16 --major-os-version=2 --minor-os-version=0 --major-subsystem-version=2 --minor-subsystem-version=0 --build-id=sha1 -T efi.lds -N prelink.o /export/gnu/import/git/gitlab/xen/xen/.xen.efi.1r.o /export/gnu/import/git/gitlab/xen/xen/.xen.efi.1s.o -b pe-x86-64 efi/buildid.o -o /export/gnu/import/git/gitlab/xen/xen/xen.efi

=================================================================
==1616314==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 862212 byte(s) in 133 object(s) allocated from:
    #0 0x7f3fbb6e791f in __interceptor_malloc (/lib64/libasan.so.6+0xae91f)
    #1 0xb4844b in xmalloc /export/gnu/import/git/gitlab/x86-binutils/libiberty/xmalloc.c:149

SUMMARY: AddressSanitizer: 862212 byte(s) leaked in 133 allocation(s).
[hjl@gnu-tgl-2 x86]$

Revision history for this message
In , Cvs-commit (cvs-commit) wrote :

The master branch has been updated by Alan Modra <email address hidden>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=07c9f243b3a12cc6749bc02ee7b165859979348b

commit 07c9f243b3a12cc6749bc02ee7b165859979348b
Author: Alan Modra <email address hidden>
Date: Fri Jan 28 14:29:34 2022 +1030

    PR28826 x86_64 ld segfaults building xen

    Fallout from commit e86fc4a5bc37

            PR 28826
            * coffgen.c (coff_write_alien_symbol): Init dummy to zeros.

Changed in binutils:
status: Confirmed → In Progress
Changed in binutils (Debian):
status: New → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package binutils - 2.37.90.20220130-0ubuntu2

---------------
binutils (2.37.90.20220130-0ubuntu2) jammy; urgency=medium

  * Also ignore regressions for the cross packages.

 -- Matthias Klose <email address hidden> Sun, 30 Jan 2022 16:59:06 +0100

Changed in binutils (Ubuntu Jammy):
status: New → Fix Released
Changed in binutils (Debian):
status: Confirmed → Fix Released
Revision history for this message
In , Cvs-commit (cvs-commit) wrote :

The binutils-2_38-branch branch has been updated by Alan Modra <email address hidden>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=61ecfbda44fb8d165f01cac3d704a5e9fd321795

commit 61ecfbda44fb8d165f01cac3d704a5e9fd321795
Author: Alan Modra <email address hidden>
Date: Fri Jan 28 14:29:34 2022 +1030

    PR28826 x86_64 ld segfaults building xen

    Fallout from commit e86fc4a5bc37

            PR 28826
            * coffgen.c (coff_write_alien_symbol): Init dummy to zeros.

    (cherry picked from commit 07c9f243b3a12cc6749bc02ee7b165859979348b)

Revision history for this message
In , Alan Modra (amodra-gmail) wrote :

Fixed mainline and 2.38 branch

Changed in binutils:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.