Bug #1258625 “Need some kind of 'auto' boolean column in the Ser...” : Bugs : OpenStack Compute (nova)

Abhishek Chanda (abhishek-i) on 2013-12-09

Changed in nova:
importance:	Undecided → Wishlist
status:	New → Triaged

jichenjc (jichenjc) on 2014-03-10

Changed in nova:
assignee:	nobody → jichencom (jichenjc)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-17: Fix proposed to nova (master)

#1

Fix proposed to branch: master
Review: https://review.openstack.org/80885

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-04-11:

#2

Fix proposed to branch: master
Review: https://review.openstack.org/86767

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-04-11:

#3

Fix proposed to branch: master
Review: https://review.openstack.org/86797

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-04-11:

#4

Fix proposed to branch: master
Review: https://review.openstack.org/86869

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-04-11:

#5

Fix proposed to branch: master
Review: https://review.openstack.org/86885

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-04-14:

#6

Fix proposed to branch: master
Review: https://review.openstack.org/87150

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-04-15:

#7

Reposting some detailed comments from the review:

Talked with Dan Smith a bit this morning about this bug. For historical context, libvirt dropping the connection and auto-disabling the host was a pretty significant gate failure late last year when I wrote the bug, and I wrote it based on how hard it was to identify and track the issue with logstash, specifically because the scheduler was dumping a debug message when the host was disabled regardless of whether or not it was auto-disabled or not.

So that led me to think that we should update the service table to add a column for auto-disabled and then the scheduler could check that and log appropriately when the host was down, i.e. warning rather than debug. Then we could fingerprint that in elastic-recheck.

Looking back, there must have been a reason I was focused on the scheduler log, and that's probably because the error that shows up in the compute logs from the libvirt driver are not specific enough to the connection dropping, so maybe we need to strengthen the logging that happens in the libvirt driver when this fails. I did fix an issue in the libvirt driver at the time with this:

https://review.openstack.org/#/c/60563/

That callback code was logging an error, but there have since been changes related to that because we're not supposed to be doing logging in those native threads since it can cause deadlocks, jswarren fixed that here:

https://review.openstack.org/#/c/79617/

So now we should be getting the warning here, but it's probably very generic:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py?id=2014.1.rc2#n582

e.g. "libvirtError: Unable to read from monitor: Connection reset by peer".

Anyway, the rest of bug 1258625 was talking about putting retry logic in the scheduler or in the libvirt driver when the connection is not available - so maybe that's the appropriate fix to chase here. I think the vmware driver is already doing some retry logic on connection issues, it seems the libvirt driver could be doing the same - disable the host and then retry for awhile to see if libvirt will come back up, then enable the host so the scheduler can use it again.

Dan's point was that the scheduler should be isolated from this logic - if the host is disabled the scheduler skips it. In a single compute environment this is a bad failure, but in multi-compute deployment (i.e. production cloud) it should be less of an issue.
I still like the idea of moving the "AUTO:" prefix into the services table with a boolean column, but it's not worth it just to make a log message in the scheduler log go from debug to warning level.

So to summarize, it's probably more worthwhile to pursue better error diagnosis and auto-recovery in the libvirt driver for the intermittent connection drop than put in all this code for the scheduler to use it.

Besides, when we get a newer version of libvirt in the gate (see blueprint support-libvirt-1x), the intermittent connection drop might be a non-issue. For all I know, maybe the fix here https://review.openstack.org/#/c/79617/ already cleans up some of the problem.

Reposting some detailed comments from the review:

Talked with Dan Smith a bit this morning about this bug. For historical context, libvirt dropping the connection and auto-disabling the host was a pretty significant gate failure late last year when I wrote the bug, and I wrote it based on how hard it was to identify and track the issue with logstash, specifically because the scheduler was dumping a debug message when the host was disabled regardless of whether or not it was auto-disabled or not.

So that led me to think that we should update the service table to add a column for auto-disabled and then the scheduler could check that and log appropriately when the host was down, i.e. warning rather than debug. Then we could fingerprint that in elastic-recheck.

Looking back, there must have been a reason I was focused on the scheduler log, and that's probably because the error that shows up in the compute logs from the libvirt driver are not specific enough to the connection dropping, so maybe we need to strengthen the logging that happens in the libvirt driver when this fails. I did fix an issue in the libvirt driver at the time with this:

https://review.openstack.org/#/c/60563/

That callback code was logging an error, but there have since been changes related to that because we're not supposed to be doing logging in those native threads since it can cause deadlocks, jswarren fixed that here:

https://review.openstack.org/#/c/79617/

So now we should be getting the warning here, but it's probably very generic:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py?id=2014.1.rc2#n582

e.g. "libvirtError: Unable to read from monitor: Connection reset by peer".

Anyway, the rest of bug 1258625 was talking about putting retry logic in the scheduler or in the libvirt driver when the connection is not available - so maybe that's the appropriate fix to chase here. I think the vmware driver is already doing some retry logic on connection issues, it seems the libvirt driver could be doing the same - disable the host and then retry for awhile to see if libvirt will come back up, then enable the host so the scheduler can use it again.

Dan's point was that the scheduler should be isolated from this logic - if the host is disabled the scheduler skips it. In a single compute environment this is a bad failure, but in multi-compute deployment (i.e. production cloud) it should be less of an issue.
I still like the idea of moving the "AUTO:" prefix into the services table with a boolean column, but it's not worth it just to make a log message in the scheduler log go from debug to warning level.

So to summarize, it's probably more worthwhile to pursue better error diagnosis and auto-recovery in the libvirt driver for the intermittent connection drop than put in all this code for the scheduler to use it.

Besides, when we get a newer version of libvirt in the gate (see blueprint support-libvirt-1x), the intermittent connection drop might be a non-issue. For all I know, maybe the fix here https://review.openstack.org/#/c/79617/ already cleans up some of the problem.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-07: Change abandoned on nova (master)

#8

Change abandoned by jichenjc (<email address hidden>) on branch: master
Review: https://review.openstack.org/87150

Joe Gordon (jogo) on 2014-07-10

Changed in nova:
status:	In Progress → Opinion

OpenStack Compute (nova)

Need some kind of 'auto' boolean column in the Service table

Bug Description

Other bug subscribers

Remote bug watches