Comment 3 for bug 1673483

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

(From an IRC interaction with Dan Berrangé [danpb] and Matthew Booth
[mdbooth].)

Dan has done some log analysis. Specifically from the below log message
(from libvirt debug logs), he could deduce some clues about root cause:

   [...]
   2017-03-16 01:01:33.096+0000: 26064: debug : virNetlinkStartup:138 : Running global netlink initialization
   [...]

[danpb]: "The above is showing libvirtd has been restarted, so
presumably it crashed, causing Nova to see the EOF. We should probably
make sure OpenStack CI does *not* restart things, as it just hides the
obvious failure!"

And after looking at the error from 'syslog':

    [...]
    Mar 16 01:01:32 ubuntu-xenial-infracloud-vanilla-7903827 libvirtd[17211]: *** Error in `/usr/sbin/libvirtd': malloc(): memory corruption: 0x0000562d21527f50 ***
    [...]

[danpb]: "So it's a bug in libvirt in Ubuntu. Which is probably fixed
sometime after 1.3.1."

NOTE (kashyap): There is an existing bug for the above memory corruption
here:

    https://bugs.launchpad.net/nova/+bug/1643911/ -- 'libvirt randomly
    crashes on xenial nodes with "*** Error in `/usr/sbin/libvirtd':
    malloc(): memory corruption:"'

After some code inspection, Dan pointed out two potential commits from
libvirt that are likely candidates to have fixed the issue:

(1) One possible candidate in 1.3.2 would be:

    'qemu: Process monitor EOF in a job' --
    https://libvirt.org/git/?p=libvirt.git;a=commit;h=8c9ff99

(2) This might be relevant, too, but less likely:

    ;qemu: Avoid calling qemuProcessStop without a job' --
    https://libvirt.org/git/?p=libvirt.git;a=commit;h=81f50cb

He notes further: "the first [libvirt] commit hash noted above (8c9ff99)
is something you'd want to encourage Ubuntu maintainers to add as I
don't see it in their patches right now."