Remote TLS connection to Libvirt 0.9.8 hangs (possibly a race condition and very possibly a regression)

Bug #1001798 reported by Andreas Ntaflos
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
High
Unassigned

Bug Description

Connecting to a remote Libvirt instance on Ubuntu 12.04 (libvirt 0.9.8) via, e.g., virsh -c qemu+tls://hostname.example.com/system simply hangs. I have to terminate virsh with Ctrl+C to get my terminal back.

Connecting to the same remote Libvirt instance via unencrypted TCP (qemu+tcp) or via SSH (qemu+ssh) works. Connecting to a different Libvirt instance on Ubuntu 10.04 from the same client machine as above via qemu+tls works fine, too. So this looks like a regression to me, if that term applies here.

What is interesting is that the remote connection via qemu+tls to the problematic Libvirt instance *works* in many cases when I run strace on the daemon process AND the strace output goes to stdout. Could this indicate a race condition somewhere in the Libvirt (or TLS/SSL) code? Seeing as tracing system calls and printing them to stdout generally and noticeably impacts performance?

I will attach debug logs from the problematic remote Libvirt daemon as well as strace outputs, one of each where the connection worked (thanks to the strace voodoo) and one of each where the connection hung (hanged?).

In the "hang" debug logs there is a line "authentication failed: TLS handshake failed A TLS packet with unexpected length was received." This indicates my pressing Ctrl+C on the remote client side (terminating the hanging virsh process). In the "works" debug logs there is also one such line, which indicates one connection attempt failed in the described manner, but after it follows a connection attempt that worked. As I said, the strace voodoo makes *most* connection attempts work, but not all. Again, this indicates to me some kind of race condition somewhere.

Please let me know what I can do to debug and test this further.

Revision history for this message
Andreas Ntaflos (daff) wrote :
Revision history for this message
Andreas Ntaflos (daff) wrote :
Revision history for this message
Andreas Ntaflos (daff) wrote :
Revision history for this message
Andreas Ntaflos (daff) wrote :
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks. In your case, is the client also 12.04? An incompatibility between oneiric guests and precise hosts has been reported recently (hence marking this confirmed). If precise guests also cause a problem, that is new to me and good to know.

Changed in libvirt (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
importance: Medium → High
Revision history for this message
Andreas Ntaflos (daff) wrote :

In my case, the client was indeed 12.04, but I just tested with a 10.04 client (0.9.2-5ubuntu0+dnjl0~lucid0, from https://launchpad.net/~dnjl/+archive/virtualization) and the problem is the same. So what I can say is this:

 * 12.04 virsh client via TLS to 12.04 libvirt host doesn't work
 * 10.04 virsh client via TLS to 12.04 libvirt host doesn't work
 * 12.04 virsh client via TLS to 10.04 libvirt host works
 * 10.04 virsh client via TLS to 10.04 libvirt host works
 * All virsh clients to all libvirt hosts work with TCP or SSH

Anything else I can do?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1001798] Re: Remote TLS connection to Libvirt 0.9.8 hangs (possibly a race condition and very possibly a regression)

Quoting Andreas Ntaflos (<email address hidden>):
> In my case, the client was indeed 12.04, but I just tested with a 10.04
> client (0.9.2-5ubuntu0+dnjl0~lucid0, from
> https://launchpad.net/~dnjl/+archive/virtualization) and the problem is
> the same. So what I can say is this:
>
> * 12.04 virsh client via TLS to 12.04 libvirt host doesn't work
> * 10.04 virsh client via TLS to 12.04 libvirt host doesn't work
> * 12.04 virsh client via TLS to 10.04 libvirt host works
> * 10.04 virsh client via TLS to 10.04 libvirt host works
> * All virsh clients to all libvirt hosts work with TCP or SSH

Thanks very much.

> Anything else I can do?

No, thanks much.

Revision history for this message
Malte Swart (malte.swart) wrote :

I have the same problem. Can't connect to 12.04 libvirtd server from either ubuntu 12.04 or gentoo (libvirt 0.9.3) with tls. ssh works.

Any update?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Andreas,

I ran that test case on precise for 3.5 hours, with virtio network bridged with eth0 and without the vhost_net kernel module loaded, but network never hung.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry, comment on wrong bug.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Testing some combinations - using a quantal server, a quantal client can do virsh -c qemu+tls://guest/server list just fine. A precise client usually works, but occasionally gives a warning as shown below. (Will try a precise server now, as I believe that's the real problem)

ubuntu@server-17191:~$ virsh -c qemu+tls://10.55.60.186/session list
 Id Name State
----------------------------------

ubuntu@server-17191:~$ virsh -c qemu+tls://10.55.60.186/session list
 Id Name State
----------------------------------

ubuntu@server-17191:~$ virsh -c qemu+tls://10.55.60.186/session list
 Id Name State
----------------------------------

2012-09-27 16:24:33.223+0000: 18600: info : libvirt version: 0.9.8
2012-09-27 16:24:33.223+0000: 18600: warning : virNetClientIncomingEvent:1660 : Something went wrong during async message processing

Revision history for this message
Andreas Ntaflos (daff) wrote :

It's funny. I have three machines where I can always reproduce the behaviour in my original description and two machines where I can connect and list just fine. All running Precise. The two where it works are slightly newer, i.e. have been set up some time in the last two months, whereas the others are from May or so. Otherwise the machines are identical PowerEdge R710s. Around the same number of VMs, same CPUs, amount of RAM, network configuration, etc.

I don't know if it makes any difference in these tests, but it seems you have no VMs running on your virtualisation host. Maybe add some?

It may also be noteworthy that our VMs' disks are all LVM-based, which is why we are also bitten by bug #1027987. I wouldn't want to presume anything but maybe there is some relation between this bug and that.

Unfortunately I still need some time before I can test any of this out on some spare machines running Quantal.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Is this bug still unresolved?

Changed in libvirt (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Andreas Ntaflos (daff) wrote :

After my last comment here over two years ago I haven't had the time and resources to debug this any more than I already had at that point. We have since switched to using unencrypted Libvirt connections (qemu+tcp:///) which is "good enough" since all virtualisation hosts are on a separate and "secure" management subnet and VLAN.

But I just tried the qemu+tls:/// connections again just now, to six different virtualisation hosts and I *can't* reproduce this problem any more.

The hosts are all running Ubuntu 12.04.5, some with the Trusty HWE kernel (e.g. 3.13.0-40-generic), some with the original Precise kernel (e.g. 3.2.0-74-generic). Libvirt is installed in version 0.9.8-2ubuntu17.20. The Libvirt client in all cases is also a Ubuntu 12.04.5 machine, also running Libvirt 0.9.8-2ubuntu17.20.

We currently also leverage our Puppet CA and use the issued certificates not only for Puppet but also for Libvirt and other services. I don't think this makes a difference but two years ago when I ran into this problem we were using keys and certificates issued by our own internal CA.

So to me this looks resolved but since I have no idea what caused the problem originally and what exactly has changed in Libvirt since then in that regard I can only really say "WORKSFORME".

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks very much for the update

Changed in libvirt (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.