Recent change overwrites Admin set disabled status on libvirt error

Bug #1250049 reported by Phil Day
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
jan grant

Bug Description

A recent change https://review.openstack.org/#/c/52189/9 introduced the automatic disable / re-enable of nova-compute when connection to libvirt is lost and recovered.

While the idea is a good one the implementation means that any existing disabled status is lost (its very common on a large system for specific hosts to have been disabled by an administrator for a number of reasons, e.g. new servers still being commissioned, planned maintenance, reserved capacity, etc). As implemented this change will remove that disables status - returning nodes to the state where instances are scheduled to them even when the admin has explicit tried to prevent this.

Suggest that this change is backed out and replaced by an additional status value on each service so that there is separation between manual service enable/disable and automatic enable/disable based on detected errors.

Revision history for this message
Daniel Berrange (berrange) wrote :

Looking at the change it current just passes "" as the disable reason string when connection to libvirtd is lost.

It seems that we could just enhance the code to pass a pre-determined string "Libvirtd offline" and then when connection to libivrtd is re-established, we can check for that disable reason, to see if we should automatically re-enable it or not.

If we wanted to avoid the string reason comparison, then adding eithe boolean flag to to track automatic vs manual disablement could be an option.

I think this could be done as a followup patch - don't see a compelling reason to revert the existing patch

description: updated
Revision history for this message
Phil Day (philip-day) wrote :

Change reference should be: https://review.openstack.org/#/c/52189/

@Dan: My reading of the code is that at the moment the disable reason will be set to: "Connection to libvirt lost: %s"

My reason for saying this needs to be reverted in the short term is that as it stands it brakes a production system in a potentially dangerous way by re-enabling hosts that are meant to be kept disabled. If there is a quick fix, i.e. if adding logic on the contents of the reason string is going to be acceptable to yourself and Dan Smith then I can see that could be done as rapid fix.

However if we have to go down the route of adding additional state to the ServiceGroup API, which will mean a DB migration, then I don't think the change can stay in the system until that works its way through the review process.

jan grant (jan-grant)
Changed in nova:
assignee: nobody → jan grant (jan-grant)
status: New → In Progress
Revision history for this message
jan grant (jan-grant) wrote :

https://review.openstack.org/#/c/56224

This is an implementation of @Dan's suggestion. Automatically disabling a host will prepend a fixed string to the reason field. A host will only be automatically enabled if it was automatically disabled; operator disabling can/will override this behaviour.

Changed in nova:
importance: Undecided → Medium
tags: added: havana-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/56224
Committed: http://github.com/openstack/nova/commit/8fc9d8108e946013a9c58f0cc20d8b4e430e6286
Submitter: Jenkins
Branch: master

commit 8fc9d8108e946013a9c58f0cc20d8b4e430e6286
Author: Jan Grant <email address hidden>
Date: Wed Nov 13 10:43:40 2013 +0000

    Conditionalise automatic enabling of disabled host

    The change Ib8d67838ceb73c5b1cdc9498c17b335e9e5bb6f3 introduced
    automatic behaviour to the nova-compute hosts, which will now
    automatically disable themselves if they lose contact with libvirtd,
    and re-endable themselves on a successful reconnection.

    However, this tramples over any operational service disablement
    that may be in place (on an otherwise working host) for any number
    of good reasons.

    As suggested in bug #1250049, this is a quick fix which attaches
    a prefix to any host disable_reason string, and only re-enables
    a host if that string is still present.

    Fixes-bug: #1250049
    Change-Id: Ia7544a357d1385db5fa92e500cb691f49436cfc9

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → icehouse-1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Matt Riedemann (mriedem)
tags: added: libvirt
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.