On 09/17/13 11:51, Richard Jones wrote: >> There's at least three cases here I guess (KVM + eventfd, KVM without >> eventfd (enforceable eg. with the "ioeventfd" property for virtio >> devices), and TCG). We're probably talking about the third case. > > To clarify on this point: I have reproduced this bug on two different ARM > machines, one using KVM and one using TCG. > > In both cases they are ./configure'd without any special ioeventfd-related > options, which appears to mean CONFIG_EVENTFD=y (in both cases). > > In both cases I'm using a single vCPU. > I think I have a theory now; it's quite convoluted. The problem is a deadlock in ppoll() that is *masked* by unrelated file descriptor traffic in all of the apparently working cases. I wrote some ad-hoc debug patches, and this is the log leading up to the hang: io_watch_poll_prepare: chardev:channel0 was_active:0 now_active:0 qemu_poll_ns: timeout=4281013151888 poll entry #0 fd 3 poll entry #1 fd 5 poll entry #2 fd 0 poll entry #3 fd 11 poll entry #4 fd 4 trying to open virtio-serial channel '/dev/virtio-ports/org.libguestfs.channel.0' opened the socket, sock = 3 udevadm settle libguestfs: recv_from_daemon: received GUESTFS_LAUNCH_FLAG libguestfs: [21734ms] appliance is up Guest launched OK. libguestfs: writing the data to the socket (size = 64) sent magic GUESTFS_LAUNCH_FLAG main_loop waiting for next request libguestfs: data written OK Setup call tree for the backend (ie. the UNIX domain socket): 1 qemu_chr_open_socket() [qemu-char.c] 2 unix_connect_opts() [util/qemu-sockets.c] 3 qemu_socket() 4 connect() 5 qemu_chr_open_socket_fd() [qemu-char.c] 6 io_channel_from_socket() 7 g_io_channel_unix_new() 8 tcp_chr_connect() 9 io_add_watch_poll() 10 g_source_new() 11 g_source_attach() This part connects to libguestfs's UNIX domain socket (the new socket file descriptor, returned on line 3, is fd 6), and it registers a few callbacks. Notably, the above doesn't try to add fd 6 to the set of polled file descriptors. Then, the setup call tree for the frontend (the virtio-serial port) is as follows: 12 virtconsole_initfn() [hw/char/virtio-console.c] 13 qemu_chr_add_handlers() [qemu-char.c] This reaches into the chardev (ie. the backend referenced by the frontend, label "channel0"), and sets further callbacks. The following seems to lead up to the hang: 14 os_host_main_loop_wait() [main-loop.c] 15 glib_pollfds_fill() 16 g_main_context_prepare() 17 io_watch_poll_prepare() [qemu-char.c] 18 chr_can_read() [hw/char/virtio-console.c] 19 virtio_serial_guest_ready() [hw/char/virtio-serial-bus.c] 20 21 if (use_multiport(port->vser) && !port->guest_connected) { 22 return 0; 23 } 24 25 virtqueue_get_avail_bytes() 26 g_io_create_watch() // conditionally 27 qemu_poll_ns() [qemu-timer.c] 28 ppoll() Line 15: glib_pollfds_fill() prepares the array of file descriptors for polling. As first step, Line 16: it calls g_main_context_prepare(). This GLib function runs the "prepare" callbacks for the GSource's in the main context. The GSource for fd 6 has been allocated on line 10 above, and its "prepare" callback has been set to io_watch_poll_prepare() there. It is called on line 17. Line 17: io_watch_poll_prepare() is a crucial function. It decides whether fd 6 (the backend fd) will be added to the set of pollfds or not. It checks whether the frontend has become writeable (ie. it must have been unwriteable up to now, but it must be writeable now). If so, a (persistent) watch is created (on line 26), which is the action that includes fd 6 in the set of pollfds after all. If there is no change in the status of the frontend, the watch is not changed. io_watch_poll_prepare() checks for the writeability of the frontend (ie. virtio serial port) by the "fd_can_read" callback. This has been set to chr_can_read() on line 13, inside virtconsole_initfn(). So, the frontend-writeability check happens in chr_can_read(), which simply calls: Line 19: virtio_serial_guest_ready(). This function *normally* checks for the available room in the virtqueue (the guest receives serial port data from the host by submitting "receive requests" that must be filled in by the host); see line 25. However, virtio_serial_guest_ready() first verifies whether the guest has connected to the virtio-serial port at all. If not, then the function will report the frontend unwriteable (lines 21-23). Now, right before the hang, the guest hasn't yet connected to the virtio-serial port. Therefore line 22 fires (= virtio-serial port is unwriteable), which in turn results in *no* watch being created for the backend. Consequently, the UNIX domain socket (fd 6) is not added to the set of pollfds: io_watch_poll_prepare: chardev:channel0 was_active:0 now_active:0 qemu_poll_ns: timeout=4281013151888 poll entry #0 fd 3 poll entry #1 fd 5 poll entry #2 fd 0 poll entry #3 fd 11 poll entry #4 fd 4 At this point the IO thread is blocked in ppoll(). Then, the guest connects to the serial port, and sends data. trying to open virtio-serial channel '/dev/virtio-ports/org.libguestfs.channel.0' opened the socket, sock = 3 udevadm settle As discussed before, this guest-to-host transfer is handled by the VCPU thread, and the data is written to fd 6 (the UNIX domain socket). The host-side libguestfs component reads it, and answers. libguestfs: recv_from_daemon: received GUESTFS_LAUNCH_FLAG libguestfs: [21734ms] appliance is up Guest launched OK. libguestfs: writing the data to the socket (size = 64) sent magic GUESTFS_LAUNCH_FLAG main_loop waiting for next request # note, this is a libguestfs message! libguestfs: data written OK Unfortunately, ppoll() is not watching out for fd 6 at all, hence this deadlocks. What about the successful cases though? A good proportion of the attempts succeed. This is explained by the fact that *other* file descriptors can break out the IO-thread from ppoll(). - The most common example is KVM *with* eventfd support. The KVM eventfd (the host notifier) is part of the pollfd set, and whenever the guest sends some data, KVM kicks the eventfd, and ppoll() returns. This masks the problem universally. - In the other two cases, we either have KVM *without* eventfd support, or TCG. Rich reproduced the hang under both, and he's seen successful (actually: masked deadlock) cases as well, on both. In these setups the file descriptor traffic that masks the problem is not from a KVM eventfd, hence the wakeup is quite random. There is sometimes a perceivable pause between ppoll() going to sleep and waking up. At other times there's no other fd traffic, and the deadlock persists. In my testing on Rich's ARM box, the unrelated fd that breaks out the IO-thread from ppoll() is the eventfd that belongs to the AIO thread pool. It is fd 11, and it is allocated in: 0x00512d3c in event_notifier_init (e=0x1f4ad80, active=0) at util/event_notifier-posix.c:34 34 ret = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); (gdb) n 39 if (ret >= 0) { (gdb) print ret $2 = 11 (gdb) where #0 event_notifier_init (e=0x1f4ad80, active=0) at util/event_notifier-posix.c:39 #1 0x00300b4c in thread_pool_init_one (pool=0x1f4ad80, ctx=0x1f39e18) at thread-pool.c:295 #2 0x00300c6c in thread_pool_new (ctx=0x1f39e18) at thread-pool.c:313 #3 0x0001686c in aio_get_thread_pool (ctx=0x1f39e18) at async.c:238 #4 0x0006f9d8 in paio_submit (bs=0x1f4a020, fd=10, sector_num=0, qiov=0xbe88c164, nb_sectors=4, cb=0x389d8 , opaque=0xb6505ec4, type=1) at block/raw-posix.c:799 #5 0x0006fb84 in raw_aio_submit (bs=0x1f4a020, sector_num=0, qiov=0xbe88c164, nb_sectors=4, cb=0x389d8 , opaque=0xb6505ec4, type=1) at block/raw-posix.c:828 #6 0x0006fc28 in raw_aio_readv (bs=0x1f4a020, sector_num=0, qiov=0xbe88c164, nb_sectors=4, cb=0x389d8 , opaque=0xb6505ec4) at block/raw-posix.c:836 #7 0x00038b2c in bdrv_co_io_em (bs=0x1f4a020, sector_num=0, nb_sectors=4, iov=0xbe88c164, is_write=false) at block.c:3957 #8 0x00038bf0 in bdrv_co_readv_em (bs=0x1f4a020, sector_num=0, nb_sectors=4, iov=0xbe88c164) at block.c:3974 #9 0x000349d4 in bdrv_co_do_readv (bs=0x1f4a020, sector_num=0, nb_sectors=4, qiov=0xbe88c164, flags=(unknown: 0)) at block.c:2619 #10 0x00033804 in bdrv_rw_co_entry (opaque=0xbe88c0e8) at block.c:2236 #11 0x0009ba54 in coroutine_trampoline (i0=32811528, i1=0) at coroutine-ucontext.c:118 #12 0x492fd160 in setcontext () from /usr/lib/libc.so.6 #13 0x492fd160 in setcontext () from /usr/lib/libc.so.6 Backtrace stopped: previous frame identical to this frame (corrupt stack?) And the transitory hang looks like: io_watch_poll_prepare: chardev:channel0 was_active:0 now_active:0 qemu_poll_ns: timeout=4281193192443 poll entry #0 fd 3 poll entry #1 fd 5 poll entry #2 fd 0 poll entry #3 fd 11 poll entry #4 fd 4 Again, at this point IO thread is blocked in ppoll(), trying to open virtio-serial channel '/dev/virtio-ports/org.libguestfs.channel.0' opened the socket, sock = 3 udevadm settle the guest transferred out some data, libguestfs: recv_from_daemon: received GUESTFS_LAUNCH_FLAG libguestfs: [20921ms] appliance is up Guest launched OK. libguestfs: writing the data to the socket (size = 64) sent magic GUESTFS_LAUNCH_FLAG main_loop waiting for next request libguestfs: data written OK and the host side libguestfs has responded. The IO-thread is blocked in ppoll(), guaranteed, and it doesn't notice readiness of fd 6 for reading. However, the (completely unrelated) AIO thread-pool eventfd is kicked at that point, and poll returns: ppoll(): 1, errno=Success poll entry #0 fd 3 events 25 revents 0 poll entry #1 fd 5 events 1 revents 0 poll entry #2 fd 0 events 1 revents 0 poll entry #3 fd 11 events 1 revents 1 poll entry #4 fd 4 events 1 revents 0 Which in turn allows the IO-thread to run os_host_main_loop_wait() again, and *now* we're seeing the activation of fd 6 (its frontend, the virtio-serial port has been connected by the guest in the meantime and is now writeable): io_watch_poll_prepare: chardev:channel0 was_active:0 now_active:1 qemu_poll_ns: timeout=0 poll entry #0 fd 3 poll entry #1 fd 5 poll entry #2 fd 0 poll entry #3 fd 11 poll entry #4 fd 4 poll entry #5 fd 6 And stuff works as expected from here on. The VCPU thread needs to interrupt the IO-thread's ppoll() call explicitly. Basically, when the chardev's attached frontend (in this case, the virtio serial port) experiences a change that would cause it to report writeability in io_watch_poll_prepare() -- lines 17-18 --, it must interrupt ppoll(). The following call tree seems relevant, but I'm not sure if it would be appropriate. When the guest message trying to open virtio-serial channel '/dev/virtio-ports/org.libguestfs.channel.0' is printed, the following call chain is executed in the VCPU thread: #0 qemu_chr_fe_set_open (chr=0xf22190, fe_open=1) at qemu-char.c:3404 #1 0x001079dc in set_guest_connected (port=0x1134f00, guest_connected=1) at hw/char/virtio-console.c:83 #2 0x003cfd94 in handle_control_message (vser=0x1124360, buf=0xb50005f8, len=8) at /home/rjones/d/qemu/hw/char/virtio-serial-bus.c:379 #3 0x003d0020 in control_out (vdev=0x1124360, vq=0x11246b0) at /home/rjones/d/qemu/hw/char/virtio-serial-bus.c:416 #4 0x0044afe4 in virtio_queue_notify_vq (vq=0x11246b0) at /home/rjones/d/qemu/hw/virtio/virtio.c:720 #5 0x0044b054 in virtio_queue_notify (vdev=0x1124360, n=3) at /home/rjones/d/qemu/hw/virtio/virtio.c:726 #6 0x00271f30 in virtio_mmio_write (opaque=0x11278c8, offset=80, value=3, size=4) at hw/virtio/virtio-mmio.c:264 #7 0x00456aac in memory_region_write_accessor (mr=0x1128ba8, addr=80, value=0xb5972b08, size=4, shift=0, mask=4294967295) at /home/rjones/d/qemu/memory.c:440 #8 0x00456c90 in access_with_adjusted_size (addr=80, value=0xb5972b08, size=4, access_size_min=1, access_size_max=4, access=0x4569d0 , mr=0x1128ba8) at /home/rjones/d/qemu/memory.c:477 #9 0x0045955c in memory_region_dispatch_write (mr=0x1128ba8, addr=80, data=3, size=4) at /home/rjones/d/qemu/memory.c:984 #10 0x0045cee0 in io_mem_write (mr=0x1128ba8, addr=80, val=3, size=4) at /home/rjones/d/qemu/memory.c:1748 #11 0x0035d8dc in address_space_rw (as=0xa982f8 , addr=471008336, buf=0xb6f3d028 "\003", len=4, is_write=true) at /home/rjones/d/qemu/exec.c:1954 #12 0x0035ddf0 in cpu_physical_memory_rw (addr=471008336, buf=0xb6f3d028 "\003", len=4, is_write=1) at /home/rjones/d/qemu/exec.c:2033 #13 0x00453000 in kvm_cpu_exec (cpu=0x1097020) at /home/rjones/d/qemu/kvm-all.c:1665 #14 0x0034ca94 in qemu_kvm_cpu_thread_fn (arg=0x1097020) at /home/rjones/d/qemu/cpus.c:802 #15 0x494c6bc0 in start_thread () from /usr/lib/libpthread.so.0 Unfortunately, the leaf (ie. qemu_chr_fe_set_open()) doesn't do anything here; the only chardev that sets the "chr_set_fe_open" callback is the spicevmc backend. I think the "socket" chardev might want to implement "chr_set_fe_open", kicking a (new) global eventfd, sending some signal to the IO-thread, or interrupting ppoll() in some other way. A new global eventfd just for this purpose seems quite the kludge, but it shouldn't be hard to implement. It needs no handler at all. Thanks Laszlo