Need some kind of 'auto' boolean column in the Service table

Bug #1258625 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Opinion
Wishlist
jichenjc

Bug Description

Bug 1250049 reported a problem with automatically disabling/enabling a host via the libvirt driver, but rather than fix it the right way, i.e. add a new column to the Service table which indicates if an admin intentionally disabled the host or if nova detected a fail and did it automatically, a hack was done instead to prefix the 'disabled_reason' with "AUTO:" and build some logic in the driver around that.

The problem with that approach is the ComputeFilter in the scheduler can't perform any kind of retry logic around that if needed, i.e. bug 1257644.

Right now if the ComputeFilter encounters a disabled host, it just logs it at debug level and skips it. If the host was automatically disabled because of a connection fail, we should at least log that as a warning in the scheduler (like we do now for hosts that haven't checked in for awhile) - or possibly build some retry logic around that to make it more robust in case the connection fail is just a hiccup that quickly resolves itself.

One could maybe argue that some kind of connection retry logic could be built into the libvirt driver instead, I wouldn't be against that.

Changed in nova:
importance: Undecided → Wishlist
status: New → Triaged
jichenjc (jichenjc)
Changed in nova:
assignee: nobody → jichencom (jichenjc)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/80885

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/86767

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/86797

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/86869

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/86885

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/87150

Revision history for this message
Matt Riedemann (mriedem) wrote :

Reposting some detailed comments from the review:

Talked with Dan Smith a bit this morning about this bug. For historical context, libvirt dropping the connection and auto-disabling the host was a pretty significant gate failure late last year when I wrote the bug, and I wrote it based on how hard it was to identify and track the issue with logstash, specifically because the scheduler was dumping a debug message when the host was disabled regardless of whether or not it was auto-disabled or not.

So that led me to think that we should update the service table to add a column for auto-disabled and then the scheduler could check that and log appropriately when the host was down, i.e. warning rather than debug. Then we could fingerprint that in elastic-recheck.

Looking back, there must have been a reason I was focused on the scheduler log, and that's probably because the error that shows up in the compute logs from the libvirt driver are not specific enough to the connection dropping, so maybe we need to strengthen the logging that happens in the libvirt driver when this fails. I did fix an issue in the libvirt driver at the time with this:

https://review.openstack.org/#/c/60563/

That callback code was logging an error, but there have since been changes related to that because we're not supposed to be doing logging in those native threads since it can cause deadlocks, jswarren fixed that here:

https://review.openstack.org/#/c/79617/

So now we should be getting the warning here, but it's probably very generic:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py?id=2014.1.rc2#n582

e.g. "libvirtError: Unable to read from monitor: Connection reset by peer".

Anyway, the rest of bug 1258625 was talking about putting retry logic in the scheduler or in the libvirt driver when the connection is not available - so maybe that's the appropriate fix to chase here. I think the vmware driver is already doing some retry logic on connection issues, it seems the libvirt driver could be doing the same - disable the host and then retry for awhile to see if libvirt will come back up, then enable the host so the scheduler can use it again.

Dan's point was that the scheduler should be isolated from this logic - if the host is disabled the scheduler skips it. In a single compute environment this is a bad failure, but in multi-compute deployment (i.e. production cloud) it should be less of an issue.
I still like the idea of moving the "AUTO:" prefix into the services table with a boolean column, but it's not worth it just to make a log message in the scheduler log go from debug to warning level.

So to summarize, it's probably more worthwhile to pursue better error diagnosis and auto-recovery in the libvirt driver for the intermittent connection drop than put in all this code for the scheduler to use it.

Besides, when we get a newer version of libvirt in the gate (see blueprint support-libvirt-1x), the intermittent connection drop might be a non-issue. For all I know, maybe the fix here https://review.openstack.org/#/c/79617/ already cleans up some of the problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by jichenjc (<email address hidden>) on branch: master
Review: https://review.openstack.org/87150

Joe Gordon (jogo)
Changed in nova:
status: In Progress → Opinion
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.