Allow disabling the evacuate cleanup mechanism in compute manager

Bug #1461459 reported by OpenStack Infra
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
openstack-manuals
Fix Released
High
Alexandra Settle

Bug Description

https://review.openstack.org/174779
commit 6f1f9dbc211356a3d0e2d46d3a984d7ceee79ca6
Author: Tony Breeds <email address hidden>
Date: Tue Jan 27 11:17:54 2015 -0800

    Allow disabling the evacuate cleanup mechanism in compute manager

    This mechanism attempts to destroy any locally-running instances on
    startup if instance.host != self.host. The assumption is that the
    instance has been evacuated and is safely running elsewhere. This is
    a dangerous assumption to make, so this patch adds a configuration
    variable to disable this behavior if it's not desired.

    Note that disabling it may have implications for the case where
    instances *were* evacuated, given potential shared resources.
    To counter that problem, this patch also makes _init_instance()
    skip initialization of the instance if it appears to be owned
    by another host, logging a prominent warning in that case.

    As a result, if you have destroy_after_evacuate=False and you start
    a nova compute with an incorrect hostname, or run it twice from
    another host, then the worst that will happen is you get log
    warnings about the instances on the host being ignored. This should
    be an indication that something is wrong, but still allow for
    fixing it without any loss. If the configuration option is disabled
    and a legitimate evacuation does occur, simply enabling it and then
    restarting the compute service will cause the cleanup to occur.

    This is added to the workarounds config group because it is really
    only relevant while evacuate is fundamentally broken in this way.
    It needs to be refactored to be more robust, and once that is done,
    this should be able to go away.

    Conflicts:
            nova/compute/manager.py
            nova/tests/unit/compute/test_compute.py
            nova/tests/unit/compute/test_compute_mgr.py
            nova/utils.py

    NOTE: In nova/utils.py a new section has been introduced but
    only the option addessed by this backport has been included.

    DocImpact: New configuration option, and peril warning
    Partial-Bug: #1419785
    (cherry picked from commit 922148ac45c5a70da8969815b4f47e3c758d6974)

    -- squashed with commit --

    Create a 'workarounds' config group.

    This group is for very specific reasons.

    If you're:
    - Working around an issue in a system tool (e.g. libvirt or qemu) where the fix
      is in flight/discussed in that community.
    - The tool can be/is fixed in some distributions and rather than patch the code
      those distributions can trivially set a config option to get the "correct"
      behavior.
    This is a good place for your workaround.

    (cherry picked from commit b1689b58409ab97ef64b8cec2ba3773aacca7ac5)

    --

    Change-Id: Ib9a3c72c096822dd5c65c905117ae14994c73e99

tags: added: autogenerate-config-docs config-reference
Changed in openstack-manuals:
status: New → Confirmed
importance: Undecided → High
milestone: none → liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-manuals (master)

Fix proposed to branch: master
Review: https://review.openstack.org/203429

Changed in openstack-manuals:
assignee: nobody → Gauvain Pocentek (gpocentek)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-manuals (master)

Reviewed: https://review.openstack.org/203429
Committed: https://git.openstack.org/cgit/openstack/openstack-manuals/commit/?id=cdebb041d5893c0f17cc7c0fc7b891b25bb3cf55
Submitter: Jenkins
Branch: master

commit cdebb041d5893c0f17cc7c0fc7b891b25bb3cf55
Author: Gauvain Pocentek <email address hidden>
Date: Sun Jul 19 17:00:01 2015 +0200

    [config-ref] Nova option tables update

    Partial-Bug: #1472417
    Closes-Bug: #1465841
    Partial-Bug: #1461459
    Partial-Bug: #1454356
    Closes-Bug: #1450002

    Change-Id: I1ce5933ce20d2021f4286ca965823483940157fe

Tom Fifield (fifieldt)
Changed in openstack-manuals:
status: In Progress → Triaged
assignee: Gauvain Pocentek (gpocentek) → nobody
Revision history for this message
Atsushi SAKAI (sakaia) wrote :

Tom
Would you describe this issue is remaining points?
  From seeing below, config option is already documented.
  Are you concerning "peril warning"?

> DocImpact: New configuration option, and peril warning

Revision history for this message
Tom Fifield (fifieldt) wrote :

Hi,

I moved it back to Triaged as Gauvain's patch only said "Partial-Bug" for this bug

Changed in openstack-manuals:
milestone: liberty → mitaka
Changed in openstack-manuals:
assignee: nobody → khushbu (khushbuparakh)
Revision history for this message
Gauvain Pocentek (gpocentek) wrote :

The 'Partial-Bug' means that the configuration option is documented in the config-ref. Since a wrong configuration could lead to unwanted behavior, I believe that some kind of warning should probably be added somewhere in the docs.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I think the DocImpact in the nova change was probably just to get the config options docs updated with the new workaround option.

If there is anything else we could do with this, it could be to note in the docs related to evacuate operations that if you're running nova < liberty, there is a potential data loss issue with the evacuate functionality if you don't have that patch and don't set the option appropriately.

For example:

http://docs.openstack.org/user-guide-admin/cli_nova_evacuate.html

http://docs.openstack.org/admin-guide-cloud/compute-node-down.html

There was a spec in liberty to make this smarter, but the existing problem description applies to nova compute nodes < liberty:

http://specs.openstack.org/openstack/nova-specs/specs/liberty/implemented/robustify_evacuate.html#problem-description

If the hostname changes on the compute or you have a typo in your configs (multiple compute nodes managing the same vcenter running at the same time), that evacuate code can delete your instances.

That's why the workarounds.destroy_after_evacuate=False option is a way to safely get around this until you're sure that you're cleaning up a failed compute node (a real evacuation rather than a misconfiguration or hostname change), until you get your computes to liberty+.

Changed in nova:
status: New → Invalid
Revision history for this message
Anne Gentle (annegentle) wrote :

Place this as a warning in the docs:

If the hostname changes on the compute or you have a typo in your configs (multiple compute nodes managing the same vcenter running at the same time), that evacuate code can delete your instances.

since user-guide-admin has recently been re-factored, might need to find where this warning should exist by asking Joseph Robinson.

Revision history for this message
Anne Gentle (annegentle) wrote :
Changed in openstack-manuals:
milestone: mitaka → newton
Revision history for this message
Khushbuparakh (khushbuparakh) wrote :

 https://review.openstack.org/309799

I am working more on it to add more details in troubleshoot file

Changed in openstack-manuals:
status: Triaged → In Progress
Changed in openstack-manuals:
assignee: khushbu (khushbuparakh) → Christian Berendt (berendt)
Changed in openstack-manuals:
assignee: Christian Berendt (berendt) → Lana (loquacity)
Changed in openstack-manuals:
assignee: Lana (loquacity) → Alexandra Settle (alexandra-settle)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/309799
Committed: https://git.openstack.org/cgit/openstack/openstack-manuals/commit/?id=f0fe0d40bcc4ce984e2ca2bddfc992dd7e256396
Submitter: Jenkins
Branch: master

commit f0fe0d40bcc4ce984e2ca2bddfc992dd7e256396
Author: khushbuparakh <email address hidden>
Date: Sun Apr 24 11:09:59 2016 -0500

    Adding peril warning

    Adding warning in compute node down. More content required
    in the troubleshooting section in order to fully close
    this bug.

    Change-Id: Ida409e1fc8b6c3112b07fb09bad65621894ab0c9
    Partial-Bug: #1461459

Revision history for this message
Alexandra Settle (alexandra-settle) wrote :

As suggested, the peril warning was successfully added in the docs but only noted as a partial bug.
I think this suffices as a fix.

Changed in openstack-manuals:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.