libvirt boot race on xen hypervisor

Bug #922486 reported by Stefan Bader on 2012-01-27
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Undecided
Unassigned

Bug Description

Release: Precise
Architecture: amd64

Right after boot the display of running instances is broken when done via xen+ssh:

 smb@arcturus:~$ virsh -c xen+ssh://smb@argabuthon list
 Id Name State
----------------------------------

(same happens when connecting as non-root user locally via xen:///). Running locally as root does work:

root@argabuthon:~# virsh -c xen:/// list
2012-01-27 09:25:46.423+0000: 4448: info : libvirt version: 0.9.8
2012-01-27 09:25:46.423+0000: 4448: warning : xenHypervisorMakeCapabilities:2748 : Failed to get host power management capabilities
 Id Name State
----------------------------------
  0 Domain-0 running
  1 p-t1-micro64 idle

After stopping and starting the libvirt-bin service, this also works in the xen+ssh case:

root@argabuthon:~# service libvirt-bin stop
libvirt-bin stop/waiting
root@argabuthon:~# service libvirt-bin start
libvirt-bin start/running, process 4473

smb@arcturus:~$ virsh -c xen+ssh://smb@argabuthon list
 Id Name State
----------------------------------
  0 Domain-0 running
  1 p-t1-micro64 idle

Stefan Bader (smb) wrote :

So far it seems possible to see what RPC calls the daemon handles. Unfortunately I am not sure how to or whether it is possible to see results from those calls.
In both cases the sequence seems to start with:

debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x223e000 len=32 prog=536903814 vers=1 proc=51(remoteDispatchNumOfDomainsHelper) type=1 status=0 serial=2
debug : virNetServerProgramDispatch:269 : prog=536903814 ver=1 type=0 status=0 serial=2 proc=51
debug : remoteDispatchNumOfDomainsHelper:9398 : server=0x2280810 client=0x7f8380001b90 msg=0x7f8380001f30 rerr=0x7f838d06ec70 args=0x2294c50 ret=0x229f1d0
debug : virConnectNumOfDomains:1874 : conn=0x2294fa0
debug : virNetServerClientSendMessage:1106 : msg=0x7f8380001f30 proc=51 len=32 offset=0
debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x7f8380001b90 len=32 prog=536903814 vers=1 proc=51 type=1 status=0 serial=2
...
debug : virNetMessageDecodeLength:149 : Got length, now need 32[28] total (28[24] more)*
*numbers in [] in the non-working case

After this, there seem to be a few messages about handling the message (but nothing that looks like an answer in both cases). Then in the working case, there is a:

debug : virNetServerClientDispatchRead:886 : RPC_SERVER_CLIENT_MSG_RX: client=0x7f8380001b90 len=32 prog=536903814 vers=1 proc=37( remoteDispatchListDomainsHelper) type=0 status=0 serial=3
debug : virNetServerProgramDispatch:269 : prog=536903814 ver=1 type=0 status=0 serial=3 proc=37
debug : remoteDispatchListDomainsHelper:7213 : server=0x2280810 client=0x7f8380001b90 msg=0x7f8380041fa0 rerr=0x7f838c86dc70 args=0x229f1d0 ret=0x23704e0
debug : virConnectListDomains:1835 : conn=0x2294fa0, ids=0x2370260, maxids=1
debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x7f8380001b90 len=36 prog=536903814 vers=1 proc=37 type=1 status=0 serial=3

So it looks like both traces seem to continue with the same RPC call next. Right now, I do not understand why that difference happens. A first thought would be that the very first call yields a different result. But it does not seem to be different between both traces (except the difference in the message about need and total). So the common next call looks like:

debug : virNetServerClientDispatchRead:886 : RPC_SERVER_CLIENT_MSG_RX: client=0x7f8380001b90 len=28 prog=536903814 vers=1 proc=25(remoteDispatchNumOfDefinedDomainsHelper) type=0 status=0 serial=4(3 for the non-working case)
debug : virNetServerProgramDispatch:269 : prog=536903814 ver=1 type=0 status=0 serial=4 proc=25
debug : remoteDispatchNumOfDefinedDomainsHelper:9206 : server=0x2280810 client=0x7f8380001b90 msg=0x7f8380001f30 rerr=0x7f838c06cc70 args=0x23704e0 ret=0x2370920
debug : virConnectNumOfDefinedDomains:7528 : conn=0x2294fa0
debug : virNetServerClientSendMessage:1106 : msg=0x7f8380001f30 proc=25 len=32 offset=0
debug : virNetServerClientSendMessage:1116 : RPC_SERVER_CLIENT_MSG_TX_QUEUE: client=0x7f8380001b90 len=32 prog=536903814 vers=1 proc=25 type=1 status=0 serial=4

Stefan Bader (smb) wrote :

Adding a little debug to libvirt it becomes clearer:

debug : xenUnifiedNumOfDomains:595 : NumOfDomains from XS: 0 (right after boot)

The value changes to 1+ (dom0 + running domUs) after libvirtd has been restarted. The XS subdriver is used for the query in both cases.

Stefan Bader (smb) wrote :

Next steps down: XS does not list the domains because a cross check with the hypervisor fails. And that is because the hypervisor driver seems not to be initialized correctly (without any error message). At least the hypervisor version is still at -1.

Stefan Bader (smb) wrote :

So the problem happens to be that when libvirtd is started, it does a first init for the xen hypervisor driver. At that point in time we do not have run the xend start script which in turn loads xenfs and that creates /proc/xen/privcmd. Now, this is exactly what the init function for the hypervisor driver checks first and then decides that there is no hypervisor.
What is a bit odd is that there is no real error in the libvirtd log. Whenever the attempts to list domains are made later on, all the xen drivers appear to be loading successful.

Stefan Bader (smb) wrote :

I tried the following change which just resets the internal status back to uninitialized when the init call fails due to a failed open on the socket file. This has of course the effect that this is tried again, every time someone connects to libvirt via a xen uri. But it seems to be doing the right thing overall with or without a xen hypervisor present.

Of course the other option is to not build the xenfs driver as a module. But it feels a bit dumb to make the kernel bigger just because of this race. Or we could change the xend startup to stop and start libvirtd if present. Which also seems a but wrong.

Stefan Bader (smb) wrote :

One other thing I stumbled over is this snippet below. I just wonder whether the handle should not be set to -1 before bailing out in case of init failing.

virDrvOpenStatus
xenHypervisorOpen(virConnectPtr conn,
                  virConnectAuthPtr auth ATTRIBUTE_UNUSED,
                  unsigned int flags)
{
    int ret;
    xenUnifiedPrivatePtr priv = (xenUnifiedPrivatePtr) conn->privateData;

    virCheckFlags(VIR_CONNECT_RO, VIR_DRV_OPEN_ERROR);

    if (initialized == 0)
        if (xenHypervisorInit(NULL) == -1)
            return -1;

    priv->handle = -1;

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 0.9.8-2ubuntu9

---------------
libvirt (0.9.8-2ubuntu9) precise; urgency=low

  [ Stefan Bader ]
  * xen_hypervisor: libvirtd can be started before xenfs has been loaded
    as a module. A missing privcmd file is not necessarily a permanent
    error. (LP: #922486)

  [ Serge Hallyn ]
  * debian/libvirt-bin.upstart: start on just 'runlevel [2345]'
 -- Serge Hallyn <email address hidden> Wed, 08 Feb 2012 11:20:35 -0600

Changed in libvirt (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers