segfault in qemu-system-x86_64

Bug #1630226 reported by Brian Candler
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
qemu (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

[Ubuntu 14.04 amd64 server, fully patched, xenial HWE kernel, on a 16GB Mac Mini]

I am using packer (www.packer.io) to create a VM image. Packer starts a qemu-system-x86_64 process; inside it's running an ubuntu 16.04 image doing a bunch of work including running ansible to create a bunch of lxd containers all running mysql. And then the qemu process itself segfaults :-(

I have caught a coredump but it doesn't seem all that useful:

$ gdb -c /tmp/core_qemu-system-x86.24041 /usr/bin/qemu-system-x86_64
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/qemu-system-x86_64...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 24041]
[New LWP 26214]
[New LWP 24045]
[New LWP 26215]
[New LWP 24043]
[New LWP 26321]
[New LWP 26326]
[New LWP 26017]
[New LWP 26325]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/qemu-system-x86_64 -netdev user,id=user.0,hostfwd=tcp::3234-:22 -devic'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00005648c536ad20 in ?? ()
(gdb) bt
#0 0x00005648c536ad20 in ?? ()
#1 0x00005648c536b96a in ?? ()
#2 0x00005648c536cc92 in ?? ()
#3 0x00005648c5367828 in ?? ()
#4 0x00005648c5317e77 in ?? ()
#5 0x00005648c51bfbd6 in ?? ()
#6 0x00007f4b0e1a9f45 in __libc_start_main (main=0x5648c51be640, argc=17,
    argv=0x7ffc2c0cd578, init=<optimised out>, fini=<optimised out>,
    rtld_fini=<optimised out>, stack_end=0x7ffc2c0cd568) at libc-start.c:287
#7 0x00005648c51c412c in ?? ()
(gdb) info threads
  Id Target Id Frame
  9 Thread 0x7f47777fe700 (LWP 26325) sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
  8 Thread 0x7f47597fa700 (LWP 26017) sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
  7 Thread 0x7f4b04acd700 (LWP 26326) sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
  6 Thread 0x7f4776ffd700 (LWP 26321) sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
  5 Thread 0x7f4affe1d700 (LWP 24043) 0x00007f4b0e2791e7 in ioctl ()
    at ../sysdeps/unix/syscall-template.S:81
  4 Thread 0x7f475bfff700 (LWP 26215) sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
  3 Thread 0x7f4afe5ff700 (LWP 24045) pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  2 Thread 0x7f4759ffb700 (LWP 26214) sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
* 1 Thread 0x7f4b13f24980 (LWP 24041) 0x00005648c536ad20 in ?? ()
(gdb) thread apply all bt

Thread 9 (Thread 0x7f47777fe700 (LWP 26325)):
#0 sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
#1 0x00005648c54ad007 in ?? ()
#2 0x00005648c536effc in ?? ()
#3 0x00007f4b0e555184 in start_thread (arg=0x7f47777fe700)
    at pthread_create.c:312
#4 0x00007f4b0e28237d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 8 (Thread 0x7f47597fa700 (LWP 26017)):
#0 sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
#1 0x00005648c54ad007 in ?? ()
#2 0x00005648c536effc in ?? ()
#3 0x00007f4b0e555184 in start_thread (arg=0x7f47597fa700)
    at pthread_create.c:312
#4 0x00007f4b0e28237d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 7 (Thread 0x7f4b04acd700 (LWP 26326)):
#0 sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
#1 0x00005648c54ad007 in ?? ()
#2 0x00005648c536effc in ?? ()
#3 0x00007f4b0e555184 in start_thread (arg=0x7f4b04acd700)
    at pthread_create.c:312
#4 0x00007f4b0e28237d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 6 (Thread 0x7f4776ffd700 (LWP 26321)):
#0 sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
#1 0x00005648c54ad007 in ?? ()
#2 0x00005648c536effc in ?? ()
#3 0x00007f4b0e555184 in start_thread (arg=0x7f4776ffd700)
    at pthread_create.c:312
#4 0x00007f4b0e28237d in clone ()
---Type <return> to continue, or q <return> to quit---
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7f4affe1d700 (LWP 24043)):
#0 0x00007f4b0e2791e7 in ioctl () at ../sysdeps/unix/syscall-template.S:81
#1 0x00005648c53fe584 in ?? ()
#2 0x00005648c53fe664 in ?? ()
#3 0x00005648c539e612 in ?? ()
#4 0x00007f4b0e555184 in start_thread (arg=0x7f4affe1d700)
    at pthread_create.c:312
#5 0x00007f4b0e28237d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7f475bfff700 (LWP 26215)):
#0 sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
#1 0x00005648c54ad007 in ?? ()
#2 0x00005648c536effc in ?? ()
#3 0x00007f4b0e555184 in start_thread (arg=0x7f475bfff700)
    at pthread_create.c:312
#4 0x00007f4b0e28237d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7f4afe5ff700 (LWP 24045)):
#0 pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00005648c54ace39 in ?? ()
#2 0x00005648c538c2c3 in ?? ()
#3 0x00005648c538c6c0 in ?? ()
#4 0x00007f4b0e555184 in start_thread (arg=0x7f4afe5ff700)
    at pthread_create.c:312
#5 0x00007f4b0e28237d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f4759ffb700 (LWP 26214)):
#0 sem_timedwait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
#1 0x00005648c54ad007 in ?? ()
#2 0x00005648c536effc in ?? ()
#3 0x00007f4b0e555184 in start_thread (arg=0x7f4759ffb700)
---Type <return> to continue, or q <return> to quit---
    at pthread_create.c:312
#4 0x00007f4b0e28237d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f4b13f24980 (LWP 24041)):
#0 0x00005648c536ad20 in ?? ()
#1 0x00005648c536b96a in ?? ()
#2 0x00005648c536cc92 in ?? ()
#3 0x00005648c5367828 in ?? ()
#4 0x00005648c5317e77 in ?? ()
#5 0x00005648c51bfbd6 in ?? ()
#6 0x00007f4b0e1a9f45 in __libc_start_main (main=0x5648c51be640, argc=17,
    argv=0x7ffc2c0cd578, init=<optimised out>, fini=<optimised out>,
    rtld_fini=<optimised out>, stack_end=0x7ffc2c0cd568) at libc-start.c:287
#7 0x00005648c51c412c in ?? ()
(gdb)

I am afraid my gdb foo ends there.

Note: I *do* have the libc6-dbg package installed, so I don't know why the libc symbols aren't resolved.

The full qemu command line would be something like this (this is from a subsequent run):

/usr/bin/qemu-system-x86_64 -m 14G -drive file=output-qemu-nmm/vtp-nmm.qcow2,if=virtio,cache=writeback,discard=unmap -boot c -vnc 0.0.0.0:83 -name vtp-nmm.qcow2 -machine type=pc,accel=kvm -netdev user,id=user.0,hostfwd=tcp::2628-:22 -device virtio-net,netdev=user.0

Given the relatively old version of qemu which is included in trusty, I may just have to update this machine to xenial. There doesn't seem to be any newer qemu in trusty-backports.

=== Additional system info ===

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"

Linux brian 4.4.0-38-generic #57~14.04.1-Ubuntu SMP Tue Sep 6 17:20:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

ii ipxe-qemu 1.0.0+git-20131111.c3d1e78-2ubuntu1.1 all PXE boot firmware - ROM images for qemu
ii qemu-keymaps 2.0.0+dfsg-2ubuntu1.27 all QEMU keyboard maps
ii qemu-kvm 2.0.0+dfsg-2ubuntu1.27 amd64 QEMU Full virtualization
ii qemu-system-common 2.0.0+dfsg-2ubuntu1.27 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.27 amd64 QEMU full system emulation binaries (x86)
ii qemu-utils 2.0.0+dfsg-2ubuntu1.27 amd64 QEMU utilities

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: qemu-system-x86 2.0.0+dfsg-2ubuntu1.27
ProcVersionSignature: Ubuntu 4.4.0-38.57~14.04.1-generic 4.4.19
Uname: Linux 4.4.0-38-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.21
Architecture: amd64
Date: Tue Oct 4 11:59:00 2016
InstallationDate: Installed on 2014-07-16 (810 days ago)
InstallationMedia: Ubuntu-Server 14.04 LTS "Trusty Tahr" - Release amd64 (20140416.2)
SourcePackage: qemu
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Brian Candler (b-candler) wrote :
Revision history for this message
Brian Candler (b-candler) wrote :

Attaching gdb to a running process gives the same result as I got from the core dump.

(gdb) cont
Continuing.
[Thread 0x7f2877cfe700 (LWP 10805) exited]
[Thread 0x7f2876cfc700 (LWP 10816) exited]
[Thread 0x7f28774fd700 (LWP 10815) exited]
[Thread 0x7f2c015ff700 (LWP 10735) exited]
[Thread 0x7f2c09883700 (LWP 10721) exited]
[Thread 0x7f2c00c2a700 (LWP 10739) exited]
[Thread 0x7f28764fb700 (LWP 10817) exited]
[New Thread 0x7f28764fb700 (LWP 11470)]
[New Thread 0x7f2c00c2a700 (LWP 11473)]
[New Thread 0x7f2c09883700 (LWP 11474)]
[New Thread 0x7f2c015ff700 (LWP 11475)]
[New Thread 0x7f2877fff700 (LWP 11644)]
[New Thread 0x7f28777fe700 (LWP 11665)]
[New Thread 0x7f2876ffd700 (LWP 11836)]
[New Thread 0x7f2875cfa700 (LWP 11837)]
[New Thread 0x7f28754f9700 (LWP 11858)]
[New Thread 0x7f2874cf8700 (LWP 11923)]
[New Thread 0x7f2853fff700 (LWP 11924)]
[New Thread 0x7f28537fe700 (LWP 11925)]
[New Thread 0x7f2852ffd700 (LWP 11926)]
[New Thread 0x7f28527fc700 (LWP 11927)]
[New Thread 0x7f2851ffb700 (LWP 11928)]
[New Thread 0x7f28517fa700 (LWP 11929)]
[New Thread 0x7f2850ff9700 (LWP 11930)]
[Thread 0x7f2c00c2a700 (LWP 11473) exited]
[Thread 0x7f28754f9700 (LWP 11858) exited]
[Thread 0x7f2853fff700 (LWP 11924) exited]
[Thread 0x7f2875cfa700 (LWP 11837) exited]
[Thread 0x7f2877fff700 (LWP 11644) exited]
[Thread 0x7f2874cf8700 (LWP 11923) exited]
[Thread 0x7f28537fe700 (LWP 11925) exited]
[Thread 0x7f2c09883700 (LWP 11474) exited]
[Thread 0x7f2c015ff700 (LWP 11475) exited]
[Thread 0x7f28777fe700 (LWP 11665) exited]
[Thread 0x7f2851ffb700 (LWP 11928) exited]
[Thread 0x7f2876ffd700 (LWP 11836) exited]
[Thread 0x7f2852ffd700 (LWP 11926) exited]
[Thread 0x7f28517fa700 (LWP 11929) exited]
[Thread 0x7f28527fc700 (LWP 11927) exited]
[Thread 0x7f2850ff9700 (LWP 11930) exited]
[New Thread 0x7f2850ff9700 (LWP 12898)]
[New Thread 0x7f28527fc700 (LWP 12955)]
[New Thread 0x7f28517fa700 (LWP 13039)]
[New Thread 0x7f2852ffd700 (LWP 13046)]
[New Thread 0x7f2c09883700 (LWP 13047)]
[New Thread 0x7f2c015ff700 (LWP 13048)]

Program received signal SIGSEGV, Segmentation fault.
0x000055cc8dd2fd20 in ?? ()
(gdb) bt
#0 0x000055cc8dd2fd20 in ?? ()
#1 0x000055cc8dd3096a in ?? ()
#2 0x000055cc8dd31c92 in ?? ()
#3 0x000055cc8dd2c828 in ?? ()
#4 0x000055cc8dcdce77 in ?? ()
#5 0x000055cc8db84bd6 in ?? ()
#6 0x00007f2c12f5ff45 in __libc_start_main (main=0x55cc8db83640, argc=17,
    argv=0x7fff83a85d28, init=<optimised out>, fini=<optimised out>,
    rtld_fini=<optimised out>, stack_end=0x7fff83a85d18) at libc-start.c:287
#7 0x000055cc8db8912c in ?? ()
(gdb)

Revision history for this message
Brian Candler (b-candler) wrote :

I installed some more *-dbg and *-devel packages (including libstdc++6-4.8-dbg), and now the backtrace is marginally more helpful - although possibly this is a different trace?

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f95f3fff700 (LWP 10149)]
__memcpy_sse2_unaligned ()
    at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
35 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: No such file or directory.
(gdb) bt
#0 __memcpy_sse2_unaligned ()
    at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
#1 0x0000558db62a78c3 in ?? ()
#2 0x0000558db62a8735 in ?? ()
#3 0x0000558db641a06b in ?? ()
#4 0x00007f9997cbc184 in start_thread (arg=0x7f95f3fff700)
    at pthread_create.c:312
#5 0x00007f99979e937d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

A second crash was the same as the first:

Program received signal SIGSEGV, Segmentation fault.
0x0000557d8732cd20 in ?? ()
(gdb) bt
#0 0x0000557d8732cd20 in ?? ()
#1 0x0000557d8732d96a in ?? ()
#2 0x0000557d8732ec92 in ?? ()
#3 0x0000557d87329828 in ?? ()
#4 0x0000557d872d9e77 in ?? ()
#5 0x0000557d87181bd6 in ?? ()
#6 0x00007f200b9ebf45 in __libc_start_main (main=0x557d87180640, argc=17,
    argv=0x7ffca5a399c8, init=<optimised out>, fini=<optimised out>,
    rtld_fini=<optimised out>, stack_end=0x7ffca5a399b8) at libc-start.c:287
#7 0x0000557d8718612c in ?? ()
(gdb)

BTW, there have also been single line logs in syslog each segfault time:

$ grep segfault /var/log/kern.log
Oct 3 10:48:20 brian kernel: [137022.004997] qemu-system-x86[13755]: segfault at 558bdc2d2e08 ip 00007f694e12cd1c sp 00007ffcfa8660e8 error 4 in libc-2.19.so[7f694e0aa000+1ba000]
Oct 3 18:05:30 brian kernel: [163253.679372] qemu-system-x86[11074]: segfault at 565334c7cc10 ip 000056542a917d20 sp 00007ffd6a51b770 error 4 in qemu-system-x86_64[56542a6ce000+4b1000]
Oct 4 06:04:35 brian kernel: [206401.615476] qemu-system-x86[5957]: segfault at 559c8d38a350 ip 0000559d8a9c2d20 sp 00007fff4b441040 error 4 in qemu-system-x86_64[559d8a779000+4b1000]
Oct 4 07:58:53 brian kernel: [213260.662734] qemu-system-x86[31953]: segfault at 55da4168ffb0 ip 000055db3f369d20 sp 00007ffcaa921340 error 4 in qemu-system-x86_64[55db3f120000+4b1000]
Oct 4 10:26:01 brian kernel: [222089.607756] qemu-system-x86[4686]: segfault at 562f1e888360 ip 000056301b98cd20 sp 00007ffc2a185260 error 4 in qemu-system-x86_64[56301b743000+4b1000]
Oct 4 11:48:09 brian kernel: [227017.723519] qemu-system-x86[24041]: segfault at 5647ca92b250 ip 00005648c536ad20 sp 00007ffc2c0cd040 error 4 in qemu-system-x86_64[5648c5121000+4b1000]
Oct 4 12:20:15 brian kernel: [228943.353808] qemu-system-x86[32644]: segfault at 55772d989c00 ip 000055782a2a8d20 sp 00007ffc05ed8340 error 4 in qemu-system-x86_64[55782a05f000+4b1000]

I'm aware that hardware errors can cause segfaults. I've not seen this in anything other than qemu, but that is probably when the system is being stressed the most.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Could you build qemu from upstream git and run that under gdb to reproduce and get a full stack trace?

Revision history for this message
Brian Candler (b-candler) wrote :

It depends on a ton of libraries (literally):

$ ldd /usr/bin/qemu-system-x86_64 | wc -l
100

But using the dev packages I already had around, plus libfdt-dev which it insisted on, I have done the following:

apt-get source qemu-system-x86
cd qemu-2.0.0+dfsg
./configure --disable-strip --target-list=x86_64-softmmu,x86_64-linux-user
make
sudo make install

For some reason the binaries in pc-bios/ are missing, and "make install" barfs on this. I located as many as I could:

cp -pr /usr/share/seabios/* pc-bios/
cp /usr/share/misc/sgabios.bin pc-bios/
cp pc-bios/vgabios-isavga.bin pc-bios/vgabios.bin
cp /usr/lib/ipxe/qemu/*
cp -L /usr/share/qemu/* pc-bios/

Still some missing, so I took them out from INSTALL_BLOBS in Makefile

Any way I now have... *something* in /usr/local/bin. And it runs. And hooray, it fails in the same way and I have a backtrace!

Program received signal SIGSEGV, Segmentation fault.
tcp_output (tp=tp@entry=0x5636d9186db0) at slirp/tcp_output.c:127
127 len = min(so->so_snd.sb_cc, win) - off;
(gdb) bt
#0 tcp_output (tp=tp@entry=0x5636d9186db0) at slirp/tcp_output.c:127
#1 0x00005636d5a9067a in tcp_drop (tp=tp@entry=0x5636d9186db0,
    err=err@entry=0) at slirp/tcp_subr.c:232
#2 0x00005636d5a919a2 in tcp_timers (timer=2, tp=0x5636d9186db0)
    at slirp/tcp_timer.c:287
#3 tcp_slowtimo (slirp=slirp@entry=0x5636d824e820) at slirp/tcp_timer.c:88
#4 0x00005636d5a8c538 in slirp_pollfds_poll (pollfds=0x5636d8246a00,
    select_error=select_error@entry=0) at slirp/slirp.c:488
#5 0x00005636d5a3cc37 in main_loop_wait (nonblocking=<optimised out>)
    at main-loop.c:487
#6 0x00005636d590ff1e in main_loop () at vl.c:2051
#7 main (argc=<optimised out>, argv=<optimised out>, envp=<optimised out>)
    at vl.c:4510
(gdb)

(gdb) print so
$1 = (struct socket *) 0x5635d8489920
(gdb) print so->so_snd
Cannot access memory at address 0x5635d84899a0

There's the segfault. And it looks to be the same problem as this:

    https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg03636.html

Unfortunately that gave no resolution other than "use the tap netdev instead of slirp"

Revision history for this message
Brian Candler (b-candler) wrote :
Download full text (4.0 KiB)

For comparison I built qemu-2.5.1.1 from the release tarball at http://wiki.qemu.org/Download, using the same configure options. (I picked that one as being closest to what's in xenial)

And it crashes in exactly the same place:

Program received signal SIGSEGV, Segmentation fault.
tcp_output (tp=tp@entry=0x563e2b3ae180) at slirp/tcp_output.c:127
127 len = min(so->so_snd.sb_cc, win) - off;
(gdb) bt
#0 tcp_output (tp=tp@entry=0x563e2b3ae180) at slirp/tcp_output.c:127
#1 0x0000563e28bdce4a in tcp_drop (tp=tp@entry=0x563e2b3ae180,
    err=err@entry=0) at slirp/tcp_subr.c:232
#2 0x0000563e28bde172 in tcp_timers (timer=2, tp=0x563e2b3ae180)
    at slirp/tcp_timer.c:287
#3 tcp_slowtimo (slirp=slirp@entry=0x563e2a2bffd0) at slirp/tcp_timer.c:88
#4 0x0000563e28bd7988 in slirp_pollfds_poll (pollfds=0x563e2a2ac200,
    select_error=select_error@entry=0) at slirp/slirp.c:486
#5 0x0000563e28c11b21 in main_loop_wait (nonblocking=<optimised out>)
    at main-loop.c:506
#6 0x0000563e2897730f in main_loop () at vl.c:1923
#7 main (argc=<optimised out>, argv=<optimised out>, envp=<optimised out>)
    at vl.c:4699
(gdb)

Then I built 2.7.0, the latest release. This time the build ran successfully past where it was crashing before - so it looks like the fix occurred somewhere on the 2.6 or 2.7 branch - and indeed almost to the end.

It then crashed with a different problem:

Program received signal SIGSEGV, Segmentation fault.
_int_malloc (av=av@entry=0x7f55445b9760 <main_arena>, bytes=bytes@entry=96)
    at malloc.c:3389
3389 malloc.c: No such file or directory.
(gdb) bt
#0 _int_malloc (av=av@entry=0x7f55445b9760 <main_arena>, bytes=bytes@entry=96)
    at malloc.c:3389
#1 0x00007f554427e1dc in __libc_calloc (n=<optimised out>,
    elem_size=<optimised out>) at malloc.c:3219
#2 0x00007f5544f50669 in g_malloc0 ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x000055ee83c2375c in handle_alloc (m=0x55ee896442d8,
    bytes=<synthetic pointer>, host_offset=<synthetic pointer>,
    guest_offset=5247094784, bs=0x55ee852b43c0) at block/qcow2-cluster.c:1219
#4 qcow2_alloc_cluster_offset (bs=bs@entry=0x55ee852b43c0,
    offset=offset@entry=5247094784, bytes=bytes@entry=0x55ee896442cc,
    host_offset=host_offset@entry=0x55ee896442d0, m=m@entry=0x55ee896442d8)
    at block/qcow2-cluster.c:1361
#5 0x000055ee83c1652f in qcow2_co_pwritev (bs=0x55ee852b43c0,
    offset=5247094784, bytes=45056, qiov=0x55ee8572ecf0, flags=<optimised out>)
    at block/qcow2.c:1589
#6 0x000055ee83c445b1 in bdrv_driver_pwritev (bs=bs@entry=0x55ee852b43c0,
    offset=offset@entry=5247094784, bytes=bytes@entry=45056,
    qiov=qiov@entry=0x55ee8572ecf0, flags=flags@entry=0) at block/io.c:856
#7 0x000055ee83c454f1 in bdrv_aligned_pwritev (bs=bs@entry=0x55ee852b43c0,
    req=req@entry=0x55ee896444d0, offset=offset@entry=5247094784,
    bytes=bytes@entry=45056, align=align@entry=1, qiov=0x55ee8572ecf0,
    flags=flags@entry=0) at block/io.c:1320
#8 0x000055ee83c46337 in bdrv_co_pwritev (child=<optimised out>,
    offset=offset@entry=5247094784, bytes=bytes@entry=45056,
    qiov=qiov@entry=0x55ee8572ecf0, flags=0) at block/io.c:1569
#9 0x000055ee83c36e3f in blk_co_pwritev (blk=0x5...

Read more...

Revision history for this message
Brian Candler (b-candler) wrote :

Hmm, a different malloc-type error on next run:

Program received signal SIGABRT, Aborted.
0x00007f7b20acbc37 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007f7b20acbc37 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f7b20acf028 in __GI_abort () at abort.c:89
#2 0x00007f7b20b082a4 in __libc_message (do_abort=do_abort@entry=1,
    fmt=fmt@entry=0x7f7b20c166b0 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007f7b20b1455e in malloc_printerr (ptr=<optimised out>,
    str=0x7f7b20c12801 "free(): invalid pointer", action=1) at malloc.c:4996
#4 _int_free (av=<optimised out>, p=<optimised out>, have_lock=0)
    at malloc.c:3840
#5 0x0000563c539742ea in coroutine_trampoline (i0=<optimised out>,
    i1=<optimised out>) at util/coroutine-ucontext.c:78
#6 0x00007f7b20ade800 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x00007fffc12967b0 in ?? ()
#8 0x0000000000000000 in ?? ()
(gdb)

I am going to see if I can run this build on some different hardware.

Revision history for this message
Brian Candler (b-candler) wrote :

I have now tried this on someone else's Mac Mini, this one running 16.04.1.

With the stock qemu (1:2.5+dfsg-5ubuntu10.5), it crashes in apparently the same way as mine was doing originally:

Oct 5 14:59:49 s1 kernel: [3982196.302758] qemu-system-x86[20590]: segfault at 55fc165caa20 ip 000055fd12d76ab7 sp 00007ffdec4cfab0 error 4 in qemu-system-x86_64[55fd1294f000+640000]

Which is good, as it shows the original problem is definitely a software problem in qemu.

Then I built qemu-2.7.0 from source on this machine. Unfortunately I don't seem able to attach gdb: using "gdb -p <pid>" I get:

Warning:
Cannot insert breakpoint -1.
Cannot access memory at address 0x202210

(gdb) 0x00007f199113ff51 in ?? ()

(gdb) cont
Continuing.
Warning:
Cannot insert breakpoint -1.
Cannot access memory at address 0x202210

Command aborted.
(gdb)

I just have to run without gdb. And this time, it ran to completion without any malloc errors.

So I can't yet conclude whether there is also a hardware issue, until I upgrade the local machine to 16.04 (at which point I'll lose the ability to debug the issue in 14.04, but then again, at that point I probably won't care :-)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi,

how can we reproduce this? Can you give a precise set of steps to download/build an image and run qemu with it?

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Brian Candler (b-candler) wrote :

> how can we reproduce this? Can you give a precise set of steps to download/build an image and run qemu with it?

(1) The first issue - segfault in slirp/tcp_output.c - which is also this one:

https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg03636.html

You can reproduce using the project I am working on:

----
# See packer.io. "packer" builds images from ISOs.
wget https://releases.hashicorp.com/packer/0.10.2/packer_0.10.2_linux_amd64.zip
unzip https://releases.hashicorp.com/packer/0.10.2/packer_0.10.2_linux_amd64.zip
sudo mv packer /usr/local/bin/

# My project
git clone https://git.nsrc.org/open/vtp.git
cd vtp
./run.sh
~~~

There is a phase where it clones pc-master to pc1, pc2, pc3 etc. Typically it segfaults somewhere between pc12 and pc20. I saw this with both ubuntu 14.04 and 16.04 stock qemu, but not with qemu 2.7.0 from source.

It might be performance-sensitive; both machines tested are Macmini6,2 (Mac Mini Server 2012, quad core) with SSDs.

Note however: you may consider this a low-priority issue, in the sense that the kvm "slirp" functionality is not normally used in production. It is used by packer because of its built-in NAT function: the VM gets 10.0.2.15 and it sees the host as "gateway" 10.0.2.2, which gives the VM a temporary network connection without having to run iptables or dhcpd on the host.

It certainly seems not to have any attention from qemu upstream, and in any case they may not be interested in backporting the fix from 2.6 or 2.7 to 2.5.

The quickest and easiest solution I think would be to have qemu 2.7.0 in xenial-backports.

(2) The subsequent random crashes with qemu 2.7.0, i.e. SEGV in malloc(), invalid pointer in free(), on my 14.04 Mac Mini. These don't always reproduce, and I have not yet ruled out as being a hardware problem. But by all means see if you get them once you get past the slirp issue.

To use a different version of qemu in the build you modify packer_files/vtp.json:

...
  "builders":
  [
    {
      "type": "qemu",
      "qemu_binary": "/usr/local/bin/qemu-system-x86_64",
       ...

Revision history for this message
Brian Candler (b-candler) wrote :

I found that when I have

        [ "-smp", "8,sockets=1,cores=4,threads=2" ],

in packer_files/vtp.json, the build completes successfully on my 14.04 Mac Mini with qemu 2.7.0. It worked several times flawlessly.

But if I remove that line (so that only one CPU is emulated by KVM) then I get errors such as

Oct 5 17:13:03 brian kernel: [99660.108698] qemu-system-x86[26914]: segfault at 5596d914cfb0 ip 00007f1d1244fc5e sp 00007f1985bdc770 error 4 in libc-2.19.so[7f1d123d0000+1ba000]

During the build there is an increasing number of lxd containers running; so perhaps having so much load on a single emulated CPU is triggering this condition.

Revision history for this message
Brian Candler (b-candler) wrote :

FYI, update:

- I have upgraded my Mac Mini to 16.04 (plus qemu 2.7.0 from source)
- I have completely replaced the RAM in my Mac Mini
- I have replicated on someone else's Mac Mini with 16.04

I can still replicate the new segfault/libc problems, so I'm sure that it's not a hardware issue.

The new crashes are harder to obtain, but I can get them if I run the build process with 1 vCPU, or if I configure 8 vCPUs but run 4 instances of the build process concurrently.

Anyway, that part of this ticket can be ignored as I'll be raising this upstream with the qemu project.

The original part of this ticket is that qemu crashes in tcp_output in its SLIP networking code. I still think the pragmatic solution would be to have qemu 2.7.0 in xenial-backports.

I personally don't plan on spending time working out where exactly the fix is and backporting it to qemu 2.5.0; and people using the SLIP networking code in production are probably quite rare (although packer.io's qemu builder is an example of this)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Brian,
thanks for your persistence, reproductions and debugging a lot!

I was trying to follow your reproduction steps, but your vtp repo is behind a user authentication that is neither the "normal" gitlab nor does it show any option to register a new user.

Can you share that at some place we can reach it?
Since you have a launchpad account that might just be as easy as:
git push git+ssh://<email address hidden>/~b-candler/ <your/local/branch>

Revision history for this message
Brian Candler (b-candler) wrote :

Sorry about this - the repo was open at the time I posted but is currently closed for layer 9 reasons. I am trying to get permission to release this.

Revision history for this message
Brian Candler (b-candler) wrote :

I have been working with the qemu devs, was able to reproduce the slirp networking crashes under valgrind, and they provided a fix:
http://lists.nongnu.org/archive/html/qemu-devel/2016-11/msg02411.html

The fix has also been merged upstream:

commit ea64d5f08817b5e79e17135dce516c7583107f91
Author: Samuel Thibault <email address hidden>
Date: Sun Nov 13 23:54:27 2016 +0100

    slirp: Fix access to freed memory

    if_start() goes through the slirp->if_fastq and slirp->if_batchq
    list of pending messages, and accesses ifm->ifq_so->so_nqueued of its
    elements if ifm->ifq_so != NULL. When freeing a socket, we thus need
    to make sure that any pending message for this socket does not refer
    to the socket any more.

    Signed-off-by: Samuel Thibault <email address hidden>
    Tested-by: Brian Candler <email address hidden>
    Reviewed-by: Stefan Hajnoczi <email address hidden>

So now everything is fine as long as I build qemu 2.7.0 + this patch from source.

I'm not sure whether back-porting this to 2.5.0 would be useful. It looks like it could apply, but I believe there was a big reworking of SLIRP around 2.6 which could have fixed other problems. I can ask the question on the list if you like.

I'm planning to use 2.7 going forward since that's what I've tested heavily. Having qemu 2.7.0 + this patch in xenial-backports would be helpful for me, but I can also live with having to build from source until Ubuntu 18.04 is out.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for qemu (Ubuntu) because there has been no activity for 60 days.]

Changed in qemu (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Brian Candler (b-candler) wrote :

Rather than backporting slirp fixes from 2.7.0 to 2.5.0, how about qemu 2.7 or 2.8 in backports?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.