Bug #922486 “libvirt boot race on xen hypervisor” : Bugs : libvirt package : Ubuntu

Revision history for this message

Stefan Bader (smb) wrote on 2012-02-08:

#1

So far it seems possible to see what RPC calls the daemon handles. Unfortunately I am not sure how to or whether it is possible to see results from those calls.
In both cases the sequence seems to start with:

debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x223e000 len=32 prog=536903814 vers=1 proc=51(remoteDispatchNumOfDomainsHelper) type=1 status=0 serial=2
debug : virNetServerProgramDispatch:269 : prog=536903814 ver=1 type=0 status=0 serial=2 proc=51
debug : remoteDispatchNumOfDomainsHelper:9398 : server=0x2280810 client=0x7f8380001b90 msg=0x7f8380001f30 rerr=0x7f838d06ec70 args=0x2294c50 ret=0x229f1d0
debug : virConnectNumOfDomains:1874 : conn=0x2294fa0
debug : virNetServerClientSendMessage:1106 : msg=0x7f8380001f30 proc=51 len=32 offset=0
debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x7f8380001b90 len=32 prog=536903814 vers=1 proc=51 type=1 status=0 serial=2
...
debug : virNetMessageDecodeLength:149 : Got length, now need 32[28] total (28[24] more)*
*numbers in [] in the non-working case

After this, there seem to be a few messages about handling the message (but nothing that looks like an answer in both cases). Then in the working case, there is a:

debug : virNetServerClientDispatchRead:886 : RPC_SERVER_CLIENT_MSG_RX: client=0x7f8380001b90 len=32 prog=536903814 vers=1 proc=37( remoteDispatchListDomainsHelper) type=0 status=0 serial=3
debug : virNetServerProgramDispatch:269 : prog=536903814 ver=1 type=0 status=0 serial=3 proc=37
debug : remoteDispatchListDomainsHelper:7213 : server=0x2280810 client=0x7f8380001b90 msg=0x7f8380041fa0 rerr=0x7f838c86dc70 args=0x229f1d0 ret=0x23704e0
debug : virConnectListDomains:1835 : conn=0x2294fa0, ids=0x2370260, maxids=1
debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x7f8380001b90 len=36 prog=536903814 vers=1 proc=37 type=1 status=0 serial=3

So it looks like both traces seem to continue with the same RPC call next. Right now, I do not understand why that difference happens. A first thought would be that the very first call yields a different result. But it does not seem to be different between both traces (except the difference in the message about need and total). So the common next call looks like:

debug : virNetServerClientDispatchRead:886 : RPC_SERVER_CLIENT_MSG_RX: client=0x7f8380001b90 len=28 prog=536903814 vers=1 proc=25(remoteDispatchNumOfDefinedDomainsHelper) type=0 status=0 serial=4(3 for the non-working case)
debug : virNetServerProgramDispatch:269 : prog=536903814 ver=1 type=0 status=0 serial=4 proc=25
debug : remoteDispatchNumOfDefinedDomainsHelper:9206 : server=0x2280810 client=0x7f8380001b90 msg=0x7f8380001f30 rerr=0x7f838c06cc70 args=0x23704e0 ret=0x2370920
debug : virConnectNumOfDefinedDomains:7528 : conn=0x2294fa0
debug : virNetServerClientSendMessage:1106 : msg=0x7f8380001f30 proc=25 len=32 offset=0
debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x7f8380001b90 len=32 prog=536903814 vers=1 proc=25 type=1 status=0 serial=4

So far it seems possible to see what RPC calls the daemon handles. Unfortunately I am not sure how to or whether it is possible to see results from those calls.
In both cases the sequence seems to start with:

debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x223e000 len=32 prog=536903814 vers=1 proc=51(remoteDispatchNumOfDomainsHelper) type=1 status=0 serial=2
debug : virNetServerProgramDispatch:269 : prog=536903814 ver=1 type=0 status=0 serial=2 proc=51
debug : remoteDispatchNumOfDomainsHelper:9398 : server=0x2280810 client=0x7f8380001b90 msg=0x7f8380001f30 rerr=0x7f838d06ec70 args=0x2294c50 ret=0x229f1d0
debug : virConnectNumOfDomains:1874 : conn=0x2294fa0
debug : virNetServerClientSendMessage:1106 : msg=0x7f8380001f30 proc=51 len=32 offset=0
debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x7f8380001b90 len=32 prog=536903814 vers=1 proc=51 type=1 status=0 serial=2
...
debug : virNetMessageDecodeLength:149 : Got length, now need 32[28] total (28[24] more)*
*numbers in [] in the non-working case

After this, there seem to be a few messages about handling the message (but nothing that looks like an answer in both cases). Then in the working case, there is a:

debug : virNetServerClientDispatchRead:886 : RPC_SERVER_CLIENT_MSG_RX: client=0x7f8380001b90 len=32 prog=536903814 vers=1 proc=37( remoteDispatchListDomainsHelper) type=0 status=0 serial=3
debug : virNetServerProgramDispatch:269 : prog=536903814 ver=1 type=0 status=0 serial=3 proc=37
debug : remoteDispatchListDomainsHelper:7213 : server=0x2280810 client=0x7f8380001b90 msg=0x7f8380041fa0 rerr=0x7f838c86dc70 args=0x229f1d0 ret=0x23704e0
debug : virConnectListDomains:1835 : conn=0x2294fa0, ids=0x2370260, maxids=1
debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x7f8380001b90 len=36 prog=536903814 vers=1 proc=37 type=1 status=0 serial=3

So it looks like both traces seem to continue with the same RPC call next. Right now, I do not understand why that difference happens. A first thought would be that the very first call yields a different result. But it does not seem to be different between both traces (except the difference in the message about need and total). So the common next call looks like:

debug : virNetServerClientDispatchRead:886 : RPC_SERVER_CLIENT_MSG_RX: client=0x7f8380001b90 len=28 prog=536903814 vers=1 proc=25(remoteDispatchNumOfDefinedDomainsHelper)  type=0 status=0 serial=4(3 for the non-working case)
debug : virNetServerProgramDispatch:269 : prog=536903814 ver=1 type=0 status=0 serial=4 proc=25
debug : remoteDispatchNumOfDefinedDomainsHelper:9206 : server=0x2280810 client=0x7f8380001b90 msg=0x7f8380001f30 rerr=0x7f838c06cc70 args=0x23704e0 ret=0x2370920
debug : virConnectNumOfDefinedDomains:7528 : conn=0x2294fa0
debug : virNetServerClientSendMessage:1106 : msg=0x7f8380001f30 proc=25 len=32 offset=0
debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x7f8380001b90 len=32 prog=536903814 vers=1 proc=25 type=1 status=0 serial=4

Revision history for this message

Stefan Bader (smb) wrote on 2012-02-08:

#2

Adding a little debug to libvirt it becomes clearer:

debug : xenUnifiedNumOfDomains:595 : NumOfDomains from XS: 0 (right after boot)

The value changes to 1+ (dom0 + running domUs) after libvirtd has been restarted. The XS subdriver is used for the query in both cases.

Revision history for this message

Stefan Bader (smb) wrote on 2012-02-08:

#3

Next steps down: XS does not list the domains because a cross check with the hypervisor fails. And that is because the hypervisor driver seems not to be initialized correctly (without any error message). At least the hypervisor version is still at -1.

Revision history for this message

Stefan Bader (smb) wrote on 2012-02-08:

#4

So the problem happens to be that when libvirtd is started, it does a first init for the xen hypervisor driver. At that point in time we do not have run the xend start script which in turn loads xenfs and that creates /proc/xen/privcmd. Now, this is exactly what the init function for the hypervisor driver checks first and then decides that there is no hypervisor.
What is a bit odd is that there is no real error in the libvirtd log. Whenever the attempts to list domains are made later on, all the xen drivers appear to be loading successful.

Revision history for this message

Stefan Bader (smb) wrote on 2012-02-08:

#5

proposed.debdiff Edit (2.2 KiB, text/plain)

I tried the following change which just resets the internal status back to uninitialized when the init call fails due to a failed open on the socket file. This has of course the effect that this is tried again, every time someone connects to libvirt via a xen uri. But it seems to be doing the right thing overall with or without a xen hypervisor present.

Of course the other option is to not build the xenfs driver as a module. But it feels a bit dumb to make the kernel bigger just because of this race. Or we could change the xend startup to stop and start libvirtd if present. Which also seems a but wrong.

Revision history for this message

Stefan Bader (smb) wrote on 2012-02-08:

#6

One other thing I stumbled over is this snippet below. I just wonder whether the handle should not be set to -1 before bailing out in case of init failing.

virDrvOpenStatus
xenHypervisorOpen(virConnectPtr conn,
                  virConnectAuthPtr auth ATTRIBUTE_UNUSED,
                  unsigned int flags)
{
    int ret;
    xenUnifiedPrivatePtr priv = (xenUnifiedPrivatePtr) conn->privateData;

virCheckFlags(VIR_CONNECT_RO, VIR_DRV_OPEN_ERROR);

    if (initialized == 0)
        if (xenHypervisorInit(NULL) == -1)
            return -1;

priv->handle = -1;

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-02-08:

#7

This bug was fixed in the package libvirt - 0.9.8-2ubuntu9

---------------
libvirt (0.9.8-2ubuntu9) precise; urgency=low

  [ Stefan Bader ]
  * xen_hypervisor: libvirtd can be started before xenfs has been loaded
    as a module. A missing privcmd file is not necessarily a permanent
    error. (LP: #922486)

[ Serge Hallyn ]
* debian/libvirt-bin.upstart: start on just 'runlevel [2345]'
-- Serge Hallyn <email address hidden> Wed, 08 Feb 2012 11:20:35 -0600

Changed in libvirt (Ubuntu):
status:	New → Fix Released

Ubuntu
libvirt package

libvirt boot race on xen hypervisor

Bug Description

Related branches

Other bug subscribers

Patches

Remote bug watches

Ubuntulibvirt package

libvirt boot race on xen hypervisor

Bug Description

Related branches

Other bug subscribers

Patches

Remote bug watches

Ubuntu
libvirt package