libvirt updates all iSCSI targets from host each time a pool is started

Bug #1738864 reported by Laz Peterson on 2017-12-18
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Undecided
Unassigned
Trusty
Undecided
Unassigned
Xenial
Undecided
Unassigned

Bug Description

Hello everyone, I'm a little confused about the behavior of libvirt in Ubuntu 16.04.3.

We have up to 140 iSCSI targets on a single storage host, and all of these are made available to our VM hosts. If I stop one of the iSCSI pools through virsh ("virsh pool-destroy iscsipool1") and start it back up again while running libvirt in debug, I see that it runs a discovery and proceeds to go through and update every single target available on that host -- even targets that we do not use, instead of simply connecting to that target.

This turns a <1 second process into a minimum of 30 seconds, and I just ran it with the stopwatch and clocked it at 64 seconds. So if we are doing maintenance on these hosts and go for a reboot, it takes 90-120+ minutes to finish auto starting all of the iSCSI pools. And of course, during this period of time, the server is completely worthless as a VM host. Libvirt is just stuck until it finishes connecting everything. Manually connecting to the targets using iscsiadm without doing all the same hubbub that libvirt is doing connects these targets immediately, as I would expect from libvirt.

And for each of the 140 iSCSI targets, it goes through and runs an iscsiadm sendtargets and then updates every single target before finally connecting the respective pool.

We also noticed that libvirt in Ubuntu 17.10 does not have this behavior. Well maybe it does, but it connects the iSCSI targets immediately. It is a much different process than Ubuntu 16.04.3.

Any help would be greatly appreciated. Thank you so much.

Laz Peterson (laz-v) wrote :

Here is the debug log from libvirt. Starting on line 933, you will see libvirt discovering the targets from the host, and then the next 10,000+ lines are libvirt going through and updating all of the target information for each of these available targets. Finally, it does what I would expect it to do first, on line 11,476, which is login to the target.

This particular example was about 15 seconds to connect one iSCSI pool. But that number varies depending on unknown factors, from 15 (which is rare in our case with 140+ targets) up to 90 seconds to connect one pool.

And for each of the 140 targets on this one host, it goes and repeats the same process for each individual pool that is started.

tags: added: iscsi
tags: added: storage

Hi Laz,
thank you for your report.

Thanks for the log, I mostly looked at the condensed version like:
$ awk '/iscsiadm/ {gsub("[0-9]*: debug : virCommandRunAsync:2429 :",""); gsub("+0000",""); gsub("^2017-12-18 ",""); print $0}' 20171218-libvirt.txt | pastebinit
=> http://paste.ubuntu.com/26214102/

The version in 17.10 is libvirt 3.6.
So something between 1.3.1 (Xenial) and 3.6 (Artful) must have fixed it upstream according to your report.

I think what you are seeing is the effect of [3] that is active and essentially iterates to set "node.startup=>manual" on all targets.
The fix for that came in 1.3.5 via [4] which makes use of iscsiadm option "nonpersistent" to get the same done in a better way.

That would mean for a fix in Xenial backporting [4], but I'm not sure if backporting those changes in newer libvirt would not impose too much general regression risk for an SRU [1].
I agree it is uncomfortably slow in your case, but the change is a rather big behavioral change (wanted in your case I totally agree), but I'm always afraid of [5] as I ran into issues almost like it.
And since it is "only slow" (I hate saying that being a performance engineer in the past) the severity isn't as high as if it is broken.

OTOH the code seems to apply and the option on iscsiadm is available even in trusty (important for backports).

I wonder if instead of taking the risk of changing that in an SRU for you the best option might be via [2] Ubuntu Cloud Archive which would allow you to stick to 16.04 but at the same time get a supported newer virtualization stack?

I'm tagging up the bug tasks accordingly and look forward to a discussion about the best but also safe way for you and other Ubuntu users overall.

[1]: https://wiki.ubuntu.com/StableReleaseUpdates
[2]: https://wiki.ubuntu.com/OpenStack/CloudArchive
[3]: https://libvirt.org/git/?p=libvirt.git;a=commit;h=3c12b654
[4]: https://libvirt.org/git/?p=libvirt.git;a=commit;h=56057900
[5]: https://xkcd.com/1172/

Changed in libvirt (Ubuntu Trusty):
status: New → Won't Fix
Changed in libvirt (Ubuntu Xenial):
status: New → Confirmed
Changed in libvirt (Ubuntu):
status: New → Fix Released
Laz Peterson (laz-v) wrote :

Hello Christian, thanks for your quick response.

True, it is quite unfortunate that this problem is just a waiting game, and not a game-ending issue. And I definitely agree that some end game here might not be good for a [5] type of result.

Regarding [3], yes I saw that yesterday while searching around. Had not seen [4] yet, but I can see why they wanted to find a better solution for the "large hammer" approach.

Now, regarding [2], that sounds like a very interesting possibility. I will say that while running Ubuntu 17.10 for a short time, we did experience quite noticeable instability on Xeon-based servers, as well as hard system resets on AMD-based servers. (This required us to go back to 16.04.3 on all hardware last weekend.) Do you think that those issues have anything to do with libvirt, or with other parts of the system?

Also for [2], do you think there would be any harm in updating all packages that are newer from xenial-pike? Or what do you think the best approach here would be? I wasn't exactly aware of this option, and definitely don't know much about it safely going into datacenter production. But depending on how reliable/stable all of these versions are on 16.04.3, I'd definitely be willing to go this route. After our 17.10 disaster though, I just need to be cautious. :)

Very interested to hear more. Thanks again, Christian.

You'd add [2] and always take packages from there non selectively.
So normal add-apt-repository and after that just update/upgrade as usual according to your maintenance policies.

I'd say that while libvirt/qemu can break things I'd have very rarely seen those to manifest as system instabilities - so I'd assume other changes would have caused this.
In general I'd recommend to do the change on a few systems only to begin with and see if they behave up to your expectation for a while.
Also maybe do not change the week before Christmas, but more in January :-)

On the SRU I think we agree that we will not change the version as in Xenial for the reasons outlined before - thanks for your understanding. I'm marking the bug task accordingly.

I'll stay subscribed here to help you with the general discussion.

Changed in libvirt (Ubuntu Xenial):
status: Confirmed → Won't Fix
Laz Peterson (laz-v) wrote :

Ha ha -- yes, good point about January!

You are wonderful, thanks so much for your help Christian. We're going to plan for one VM host to test, with only VMs that are part of a HA pair (and probably a nice big OpenStack test cluster), and leave the primary VM on the stock 16.04.3 packages. That way if everything goes nuts, at least we won't have to worry about any services going out.

I'm looking forward to this! What a pleasant surprise to learn about the Ubuntu Cloud archives!

Take care, happy holidays to you Christian.

Laz Peterson (laz-v) wrote :

Christian, I couldn't hold back from giving this a try. FYI, it's working like a dream.

Thanks again!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers