Bug #1419785 “VMware: running a redundant nova compute deletes r...” : Bugs : OpenStack Compute (nova)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-09: Fix proposed to nova (master)

#1

Fix proposed to branch: master
Review: https://review.openstack.org/154029

Changed in nova:
assignee:	nobody → Gary Kotton (garyk)
status:	New → In Progress

Gary Kotton (garyk) on 2015-02-09

Changed in nova:
importance:	Undecided → Critical

Revision history for this message

Matthew Booth (mbooth-9) wrote on 2015-02-09:

#2

Some context: this happens because _destroy_evacuated_instances in compute.manager does (lightly edited for clarity):

        local_instances = self._get_instances_on_driver(context, filters)
        for instance in local_instances:
            if instance.host != self.host:
                ...DESTROY...

The only instances which will be destroyed are the ones for which instance.host != self.host.

The meaning of self.host in this context appears to be 'hypervisor'. However, self.host is also a service endpoint. Historically there was a 1:1 relationship between these 2, but there are now a couple of drivers where this no longer makes sense.

I think the correct fix for this would be something like adding driver.get_hypervisor_id() which returns a driver-specific identifier for the hypervisor location. Instance.host would then be set to this value. HA nova instances would then ensure that this returned the same value for all Novas managing the same hypervisor.

However, that's a spec and a bunch of work, and this is a critical issue.

Note that there is no problem in the above code if the active and standby node have the same value of self.host. The immediate workaround would seem to be to configure the active and standby nodes accordingly. This would presumably assume simultaneous failover of dns/ip.

For this specific issue, I would prefer to see a solution which is able to detect this situation and refuse to start Nova.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-11:

#3

Fix proposed to branch: master
Review: https://review.openstack.org/154907

Changed in nova:
assignee:	Gary Kotton (garyk) → Matthew Booth (mbooth-9)

Revision history for this message

Gary Kotton (garyk) wrote on 2015-02-11:

#4

The fix that Matt suggest is not viable in my opinoin. It does not support HA at all which kind of defeats the purpose

Revision history for this message

Matthew Booth (mbooth-9) wrote on 2015-02-11:

#5

Supporting HA isn't in the scope of this Critical bug. Nova doesn't currently support HA, and adding it will require a spec and a significant amount of work. This bug will fix the bug as described here.

That said, I believe HA will still be supportable if all HA nodes have the same hostname.

Davanum Srinivas (DIMS) (dims-v) on 2015-02-12

tags:

added: vmware

OpenStack Infra (hudson-openstack) on 2015-02-14

Changed in nova:
assignee:	Matthew Booth (mbooth-9) → Gary Kotton (garyk)

Revision history for this message

Michael Still (mikal) wrote on 2015-02-16:

#6

Cannot be critical, as single driver.

Changed in nova:
importance:	Critical → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-23: Change abandoned on nova (master)

#7

Change abandoned by Matthew Booth (<email address hidden>) on branch: master
Review: https://review.openstack.org/154907
Reason: I think this can be better done in the DB, which is cleaner and applies to all drivers.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-23: Fix proposed to nova (master)

#8

Fix proposed to branch: master
Review: https://review.openstack.org/158269

Changed in nova:
assignee:	Gary Kotton (garyk) → Matthew Booth (mbooth-9)

Revision history for this message

Matthew Booth (mbooth-9) wrote on 2015-02-23:

#9

Michael, I believe this bugs meets the definition of critical here:

https://wiki.openstack.org/wiki/BugTriage#Task_2:_Prioritize_confirmed_bugs_.28bug_supervisors.29

because it results in data loss. Severe data loss, in fact. It also affects both the VMware and Ironic drivers.

Kashyap Chamarthy (kashyapc) on 2015-02-23

Changed in nova:
importance:	High → Critical

Revision history for this message

Joe Gordon (jogo) wrote on 2015-02-26:

#10

critical is for things that impact all users

Changed in nova:
importance:	Critical → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-27:

#11

Fix proposed to branch: master
Review: https://review.openstack.org/159890

Changed in nova:
assignee:	Matthew Booth (mbooth-9) → Dan Smith (danms)

OpenStack Infra (hudson-openstack) on 2015-03-01

Changed in nova:
assignee:	Dan Smith (danms) → Gary Kotton (garyk)

OpenStack Infra (hudson-openstack) on 2015-03-04

Changed in nova:
assignee:	Gary Kotton (garyk) → Matthew Booth (mbooth-9)

OpenStack Infra (hudson-openstack) on 2015-03-04

Changed in nova:
assignee:	Matthew Booth (mbooth-9) → Dan Smith (danms)

OpenStack Infra (hudson-openstack) on 2015-03-05

Changed in nova:
assignee:	Dan Smith (danms) → Gary Kotton (garyk)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-05: Fix merged to nova (master)

#12

Reviewed: https://review.openstack.org/159890
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=922148ac45c5a70da8969815b4f47e3c758d6974
Submitter: Jenkins
Branch: master

commit 922148ac45c5a70da8969815b4f47e3c758d6974
Author: Dan Smith <email address hidden>
Date: Fri Feb 27 07:30:10 2015 -0800

Allow disabling the evacuate cleanup mechanism in compute manager

    This mechanism attempts to destroy any locally-running instances on
    startup if instance.host != self.host. The assumption is that the
    instance has been evacuated and is safely running elsewhere. This is
    a dangerous assumption to make, so this patch adds a configuration
    variable to disable this behavior if it's not desired.

    Note that disabling it may have implications for the case where
    instances *were* evacuated, given potential shared resources.
    To counter that problem, this patch also makes _init_instance()
    skip initialization of the instance if it appears to be owned
    by another host, logging a prominent warning in that case.

    As a result, if you have destroy_after_evacuate=False and you start
    a nova compute with an incorrect hostname, or run it twice from
    another host, then the worst that will happen is you get log
    warnings about the instances on the host being ignored. This should
    be an indication that something is wrong, but still allow for
    fixing it without any loss. If the configuration option is disabled
    and a legitimate evacuation does occur, simply enabling it and then
    restarting the compute service will cause the cleanup to occur.

    This is added to the workarounds config group because it is really
    only relevant while evacuate is fundamentally broken in this way.
    It needs to be refactored to be more robust, and once that is done,
    this should be able to go away.

    DocImpact: New configuration option, and peril warning
    Partial-Bug: #1419785
    Change-Id: Ib9a3c72c096822dd5c65c905117ae14994c73e99

Reviewed:  https://review.openstack.org/159890
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=922148ac45c5a70da8969815b4f47e3c758d6974
Submitter: Jenkins
Branch:    master

commit 922148ac45c5a70da8969815b4f47e3c758d6974
Author: Dan Smith <dansmith@redhat.com>
Date:   Fri Feb 27 07:30:10 2015 -0800

Allow disabling the evacuate cleanup mechanism in compute manager
    
    This mechanism attempts to destroy any locally-running instances on
    startup if instance.host != self.host. The assumption is that the
    instance has been evacuated and is safely running elsewhere. This is
    a dangerous assumption to make, so this patch adds a configuration
    variable to disable this behavior if it's not desired.
    
    Note that disabling it may have implications for the case where
    instances *were* evacuated, given potential shared resources.
    To counter that problem, this patch also makes _init_instance()
    skip initialization of the instance if it appears to be owned
    by another host, logging a prominent warning in that case.
    
    As a result, if you have destroy_after_evacuate=False and you start
    a nova compute with an incorrect hostname, or run it twice from
    another host, then the worst that will happen is you get log
    warnings about the instances on the host being ignored. This should
    be an indication that something is wrong, but still allow for
    fixing it without any loss. If the configuration option is disabled
    and a legitimate evacuation does occur, simply enabling it and then
    restarting the compute service will cause the cleanup to occur.
    
    This is added to the workarounds config group because it is really
    only relevant while evacuate is fundamentally broken in this way.
    It needs to be refactored to be more robust, and once that is done,
    this should be able to go away.
    
    DocImpact: New configuration option, and peril warning
    Partial-Bug: #1419785
    Change-Id: Ib9a3c72c096822dd5c65c905117ae14994c73e99

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-12: Change abandoned on nova (master)

#13

Change abandoned by Matthew Booth (<email address hidden>) on branch: master
Review: https://review.openstack.org/158269
Reason: Fucked if I know why.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-17: Fix proposed to nova (stable/juno)

#14

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/174779

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-16: Change abandoned on nova (master)

#15

Change abandoned by garyk (<email address hidden>) on branch: master
Review: https://review.openstack.org/154029
Reason: Need to discuss this

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-03: Fix merged to nova (stable/juno)

#16

Reviewed: https://review.openstack.org/174779
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6f1f9dbc211356a3d0e2d46d3a984d7ceee79ca6
Submitter: Jenkins
Branch: stable/juno

commit 6f1f9dbc211356a3d0e2d46d3a984d7ceee79ca6
Author: Tony Breeds <email address hidden>
Date: Tue Jan 27 11:17:54 2015 -0800

Allow disabling the evacuate cleanup mechanism in compute manager

    This mechanism attempts to destroy any locally-running instances on
    startup if instance.host != self.host. The assumption is that the
    instance has been evacuated and is safely running elsewhere. This is
    a dangerous assumption to make, so this patch adds a configuration
    variable to disable this behavior if it's not desired.

    Note that disabling it may have implications for the case where
    instances *were* evacuated, given potential shared resources.
    To counter that problem, this patch also makes _init_instance()
    skip initialization of the instance if it appears to be owned
    by another host, logging a prominent warning in that case.

    As a result, if you have destroy_after_evacuate=False and you start
    a nova compute with an incorrect hostname, or run it twice from
    another host, then the worst that will happen is you get log
    warnings about the instances on the host being ignored. This should
    be an indication that something is wrong, but still allow for
    fixing it without any loss. If the configuration option is disabled
    and a legitimate evacuation does occur, simply enabling it and then
    restarting the compute service will cause the cleanup to occur.

    This is added to the workarounds config group because it is really
    only relevant while evacuate is fundamentally broken in this way.
    It needs to be refactored to be more robust, and once that is done,
    this should be able to go away.

    Conflicts:
            nova/compute/manager.py
            nova/tests/unit/compute/test_compute.py
            nova/tests/unit/compute/test_compute_mgr.py
            nova/utils.py

NOTE: In nova/utils.py a new section has been introduced but
only the option addessed by this backport has been included.

    DocImpact: New configuration option, and peril warning
    Partial-Bug: #1419785
    (cherry picked from commit 922148ac45c5a70da8969815b4f47e3c758d6974)

-- squashed with commit --

Create a 'workarounds' config group.

This group is for very specific reasons.

    If you're:
    - Working around an issue in a system tool (e.g. libvirt or qemu) where the fix
      is in flight/discussed in that community.
    - The tool can be/is fixed in some distributions and rather than patch the code
      those distributions can trivially set a config option to get the "correct"
      behavior.
    This is a good place for your workaround.

(cherry picked from commit b1689b58409ab97ef64b8cec2ba3773aacca7ac5)

--

Change-Id: Ib9a3c72c096822dd5c65c905117ae14994c73e99

Reviewed:  https://review.openstack.org/174779
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6f1f9dbc211356a3d0e2d46d3a984d7ceee79ca6
Submitter: Jenkins
Branch:    stable/juno

commit 6f1f9dbc211356a3d0e2d46d3a984d7ceee79ca6
Author: Tony Breeds <tony@bakeyournoodle.com>
Date:   Tue Jan 27 11:17:54 2015 -0800

Allow disabling the evacuate cleanup mechanism in compute manager
    
    This mechanism attempts to destroy any locally-running instances on
    startup if instance.host != self.host. The assumption is that the
    instance has been evacuated and is safely running elsewhere. This is
    a dangerous assumption to make, so this patch adds a configuration
    variable to disable this behavior if it's not desired.
    
    Note that disabling it may have implications for the case where
    instances *were* evacuated, given potential shared resources.
    To counter that problem, this patch also makes _init_instance()
    skip initialization of the instance if it appears to be owned
    by another host, logging a prominent warning in that case.
    
    As a result, if you have destroy_after_evacuate=False and you start
    a nova compute with an incorrect hostname, or run it twice from
    another host, then the worst that will happen is you get log
    warnings about the instances on the host being ignored. This should
    be an indication that something is wrong, but still allow for
    fixing it without any loss. If the configuration option is disabled
    and a legitimate evacuation does occur, simply enabling it and then
    restarting the compute service will cause the cleanup to occur.
    
    This is added to the workarounds config group because it is really
    only relevant while evacuate is fundamentally broken in this way.
    It needs to be refactored to be more robust, and once that is done,
    this should be able to go away.
    
    Conflicts:
            nova/compute/manager.py
            nova/tests/unit/compute/test_compute.py
            nova/tests/unit/compute/test_compute_mgr.py
            nova/utils.py
    
    NOTE: In nova/utils.py a new section has been introduced but
    only the option addessed by this backport has been included.
    
    DocImpact: New configuration option, and peril warning
    Partial-Bug: #1419785
    (cherry picked from commit 922148ac45c5a70da8969815b4f47e3c758d6974)
    
    -- squashed with commit --
    
    Create a 'workarounds' config group.
    
    This group is for very specific reasons.
    
    If you're:
    - Working around an issue in a system tool (e.g. libvirt or qemu) where the fix
      is in flight/discussed in that community.
    - The tool can be/is fixed in some distributions and rather than patch the code
      those distributions can trivially set a config option to get the "correct"
      behavior.
    This is a good place for your workaround.
    
    (cherry picked from commit b1689b58409ab97ef64b8cec2ba3773aacca7ac5)
    
    --
    
    Change-Id: Ib9a3c72c096822dd5c65c905117ae14994c73e99

tags:

added: in-stable-juno

Revision history for this message

Dan Smith (danms) wrote on 2015-09-14:

#17

The actual fix for this is to make nova-compute not attempt to delete instances unless they were actually evacuated. That fix was committed here, so this should be fixed now:

https://review.openstack.org/#/c/183354

Changed in nova:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2015-09-24

Changed in nova:
milestone:	none → liberty-rc1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-10-15

Changed in nova:
milestone:	liberty-rc1 → 12.0.0

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Gary Kotton	OpenStack Compute (nova) 12.0.0 "liberty"
	Juno	Fix Released	Undecided	Unassigned	OpenStack Compute (nova) 2014.2.4

OpenStack Compute (nova)

VMware: running a redundant nova compute deletes running instances

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches