Reproducible crash in slirp_remque (qemu 1.0.1)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
QEMU |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Heya
I've been testing some automated data conversion scripts with qemu 1.0.1. They work fine with qemu-kvm 0.15.1, but on qemu 1.0.1 (from the website, built from source using gcc 4.6.1, i686 host), when the script runs qemu I see qemu crash in slirp_remque a few seconds after it's launched. This crash is consistent and reproducible.
The qemu guest is SCO OpenServer 5.0.5. I'm using it for some data conversion from a legacy application. qemu is launched "-display none -monitor stdio" and controlled from a Python script that then connects to the VM over usermode port forwards to ftp data to/from the VM and send commands over telnet.
qemu is launched fine with the following command:
/usr/local/
and images:
$ for f in *.qcow2; do qemu-img info $f; echo; done
image: booksys-
file format: qcow2
virtual size: 4.0G (4294967296 bytes)
disk size: 696K
cluster_size: 65536
image: booksys.qcow2
file format: qcow2
virtual size: 4.0G (4294967296 bytes)
disk size: 140K
cluster_size: 65536
backing file: booksys-
image: sco-base-
file format: qcow2
virtual size: 512M (536870912 bytes)
disk size: 142M
cluster_size: 65536
image: sco.qcow2
file format: qcow2
virtual size: 512M (536870912 bytes)
disk size: 140K
cluster_size: 65536
backing file: sco-base-
The VM guest begins booting fine, and nothing of interest appears in the monitor log:
QEMU 1.0,1 monitor - type 'help' for more information
(qemu)
After a few seconds the controlling scripts begins trying to ftp into the guest over the user-mode port forward on port 2121, and it's at this point that qemu crashes with the following backtrace:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb63e46e0 (LWP 25453)]
0xb768753b in slirp_remque (a=0xb90ee408) at slirp/misc.c:39
39 ((struct quehead *)(element-
(gdb) bt
#0 0xb768753b in slirp_remque (a=0xb90ee408) at slirp/misc.c:39
#1 0xb76854ad in if_start (slirp=0xb879beb0) at slirp/if.c:189
#2 0xb76853b3 in if_output (so=0xb8eb1380, ifm=0xb90eea60) at slirp/if.c:138
#3 0xb7686bb5 in ip_output (so=0xb8eb1380, m0=0xb90eea60)
at slirp/ip_
#4 0xb768f59c in tcp_output (tp=0xb906fd48) at slirp/tcp_
#5 0xb7691b9b in tcp_timers (tp=0xb906fd48, timer=0) at slirp/tcp_
#6 0xb76918d4 in tcp_slowtimo (slirp=0xb879beb0) at slirp/tcp_
#7 0xb768965a in slirp_select_poll (readfds=
xfds=
#8 0xb763e2a0 in main_loop_wait (nonblocking=0) at main-loop.c:465
#9 0xb7633042 in main_loop () at /home/craig/
#10 0xb76388a0 in main (argc=20, argv=0xbf9e42d4, envp=0xbf9e4328)
at /home/craig/
(gdb) frame 0
#0 0xb768753b in slirp_remque (a=0xb90ee408) at slirp/misc.c:39
39 ((struct quehead *)(element-
A more detailed backtrace, as supplied by "thread apply all bt full", follows at the end of this post.
In case it matters, stdout is redirected to a logfile and stdin is attached to the Python script, which hasn't yet written anything to the stdin pipe.
I'll happily post the script, but isn't much good without the OS image which is about 150MB and can't be legally redistributed. I'm happy to test patches, though, or try anything that's suggested.
Host info and full backtrace follows:
$ gcc --version
gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 11.10
Release: 11.10
Codename: oneiric
$ uname -a
Linux wallace 3.0.0-14-
(gdb) thread apply all bt full
Thread 5 (Thread 0xb31e1b70 (LWP 25631)):
#0 0xb74e4424 in __kernel_vsyscall ()
No symbol table info available.
#1 0xb7332e04 in pthread_
No locals.
#2 0xb764f38a in cond_timedwait (cond=0xb7d2e1e0, mutex=0xb7d2e1c0, ts=0xb31e135c) at posix-aio-
ret = 0
#3 0xb764fb6c in aio_thread (unused=0x0) at posix-aio-
aiocb = 0xb879dcc0
ret = 0
tv = {tv_sec = 1329889894, tv_usec = 299790}
ts = {tv_sec = 1329889904, tv_nsec = 0}
#4 0xb732ed31 in start_thread (arg=0xb31e1b70) at pthread_
__res = <optimized out>
pd = 0xb31e1b70
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {-1221328908, 0, 4001536, -1289874312, -1127561837, -449321061}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
robust = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
#5 0xb6d9f0ce in clone () at ../sysdeps/
No locals.
Backtrace stopped: Not enough registers or memory available to unwind further
Thread 2 (Thread 0xb1ddab70 (LWP 25455)):
#0 0xb74e4424 in __kernel_vsyscall ()
No symbol table info available.
#1 0xb7335619 in __lll_lock_wait () at ../nptl/
No locals.
#2 0xb73387a0 in _L_cond_lock_704 () from /lib/i386-
#3 0xb7338521 in __pthread_
type = 3085970432
id = 25455
#4 0xb7332b0e in pthread_
No locals.
#5 0xb766e54a in qemu_cond_wait (cond=0xb7d3eaa0, mutex=0xb7f02c00) at qemu-thread-
err = -1191216176
__func__ = "qemu_cond_wait"
#6 0xb76fc409 in qemu_tcg_
env = 0x10000
#7 0xb76fc6cf in qemu_tcg_
env = 0x0
#8 0xb732ed31 in start_thread (arg=0xb1ddab70) at pthread_
__res = <optimized out>
pd = 0xb1ddab70
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {-1221328908, 0, 4001536, -1310874504, 1001047446, -449321061}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
robust = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
#9 0xb6d9f0ce in clone () at ../sysdeps/
No locals.
Backtrace stopped: Not enough registers or memory available to unwind further
Thread 1 (Thread 0xb63e46e0 (LWP 25453)):
#0 0xb768753b in slirp_remque (a=0xb90ee408) at slirp/misc.c:39
element = 0xb90ee408
#1 0xb76854ad in if_start (slirp=0xb879beb0) at slirp/if.c:189
now = 182039052034397
requeued = 0
ifm = 0xb90ee408
ifqt = 0x0
#2 0xb76853b3 in if_output (so=0xb8eb1380, ifm=0xb90eea60) at slirp/if.c:138
slirp = 0xb879beb0
ifq = 0xb90ee408
on_fastq = 1
#3 0xb7686bb5 in ip_output (so=0xb8eb1380, m0=0xb90eea60) at slirp/ip_
slirp = 0xb879beb0
ip = 0xb90eeacc
m = 0xb90eea60
hlen = 20
len = -1190204832
off = -1199980740
error = 0
#4 0xb768f59c in tcp_output (tp=0xb906fd48) at slirp/tcp_
so = 0xb8eb1380
len = 0
win = 8760
off = 0
flags = 2
error = -1217987977
m = 0xb90eea60
ti = 0xb90eeacc
opt = "\002\004\
optlen = 4
hdrlen = 44
idle = 0
sendalot = 0
#5 0xb7691b9b in tcp_timers (tp=0xb906fd48, timer=0) at slirp/tcp_
rexmt = 192
#6 0xb76918d4 in tcp_slowtimo (slirp=0xb879beb0) at slirp/tcp_
ip = 0xb8eb1380
ipnxt = 0xb879c8b0
tp = 0xb906fd48
i = 0
#7 0xb768965a in slirp_select_poll (readfds=
slirp = 0xb879beb0
so = 0x0
so_next = 0x0
ret = -1080148532
#8 0xb763e2a0 in main_loop_wait (nonblocking=0) at main-loop.c:465
rfds = {fds_bits = {8, 0 <repeats 31 times>}}
wfds = {fds_bits = {0 <repeats 32 times>}}
xfds = {fds_bits = {0 <repeats 32 times>}}
ret = 1
nfds = 18
tv = {tv_sec = 0, tv_usec = 990389}
timeout = 1000
#9 0xb7633042 in main_loop () at /home/craig/
nonblocking = false
last_io = 0
#10 0xb76388a0 in main (argc=20, argv=0xbf9e42d4, envp=0xbf9e4328) at /home/craig/
gdbstub_dev = 0x0
i = 64
snapshot = 1
linux_boot = 0
ds = 0xb8b16bb8
dcl = 0x0
cyls = 0
heads = 0
secs = 0
translation = 0
hda_opts = 0x0
opts = 0xb7343000
olist = 0xbf9e4198
optind = 20
optarg = 0x0
loadvm = 0x0
machine = 0xb7921e60
cpu_model = 0x0
pid_file = 0x0
incoming = 0x0
defconfig = 1
log_mask = 0x0
log_file = 0x0
mem_trace = {malloc = 0xb7634cb1 <malloc_and_trace>, realloc = 0xb7634d0e <realloc_
trace_file = 0x0
(gdb)
$ ldd /usr/local/
linux-gate.so.1 => (0xb77d0000)
libnss3.so => /usr/lib/
libnspr4.so => /usr/lib/
libpthread.so.0 => /lib/i386-
librt.so.1 => /lib/i386-
libgthread-
libglib-2.0.so.0 => /lib/i386-
libutil.so.1 => /lib/i386-
libbluetooth.so.3 => /usr/lib/
libcurl.so.4 => /usr/lib/
libncurses.so.5 => /lib/libncurses
libtinfo.so.5 => /lib/libtinfo.so.5 (0xb6a1e000)
libbrlapi.so.0.5 => /lib/libbrlapi.
libpng12.so.0 => /lib/i386-
libjpeg.so.62 => /usr/lib/
libgnutls.so.26 => /usr/lib/
libSDL-1.2.so.0 => /usr/lib/
libX11.so.6 => /usr/lib/
libm.so.6 => /lib/i386-
libz.so.1 => /lib/i386-
libc.so.6 => /lib/i386-
libnssutil3.so => /usr/lib/
libplc4.so => /usr/lib/
libplds4.so => /usr/lib/
libdl.so.2 => /lib/i386-
/lib/ld-linux.so.2 (0xb77d1000)
libpcre.so.3 => /lib/i386-
libidn.so.11 => /usr/lib/
liblber-2.4.so.2 => /usr/lib/
libldap_r-2.4.so.2 => /usr/lib/
libgssapi_
libssl.so.1.0.0 => /lib/i386-
libcrypto.so.1.0.0 => /lib/i386-
librtmp.so.0 => /usr/lib/
libtasn1.so.3 => /usr/lib/
libgcrypt.so.11 => /lib/i386-
libpulse-
libpulse.so.0 => /usr/lib/
libxcb.so.1 => /usr/lib/
libresolv.so.2 => /lib/i386-
libsasl2.so.2 => /usr/lib/
libkrb5.so.3 => /usr/lib/
libk5crypto.so.3 => /usr/lib/
libcom_err.so.2 => /lib/i386-
libkrb5support
libgpg-error.so.0 => /lib/i386-
libpulsecommon
libjson.so.0 => /usr/lib/
libdbus-1.so.3 => /lib/i386-
libXau.so.6 => /usr/lib/
libXdmcp.so.6 => /usr/lib/
libkeyutils.so.1 => /lib/i386-
libwrap.so.0 => /lib/i386-
libsndfile.so.1 => /usr/lib/
libasyncns.so.0 => /usr/lib/
libnsl.so.1 => /lib/i386-
libFLAC.so.8 => /usr/lib/
libvorbisenc.so.2 => /usr/lib/
libvorbis.so.0 => /usr/lib/
libogg.so.0 => /usr/lib/
I have now reproduced the same segfault without the controlling script by running qemu on the command line and connecting to it with lftp. To reproduce the fault it appears to be necessary to attempt to connect to the guest before it is fully booted and ready to accept connections; if I let it "settle" for a while before attempting to connect then it doesn't crash. Even if I start hammering it as soon as it's launched I can only occasionally trigger the crash, so whatever's breaking is a short-lived state of some kind.
If I make an lftp connection then immediately kill lftp, qemu receives a SIGPIPE. I'm wondering if a sigpipe at the wrong time is messing things up, but it's only the vaguest notion.