Bug #1444630 “nova-compute should stop handling virt lifecycle e...” : Bugs : OpenStack Compute (nova)

Matt Riedemann (mriedem) on 2015-04-15

Changed in nova:
status:	New → Triaged
importance:	Undecided → Medium
assignee:	nobody → Matt Riedemann (mriedem)
tags:	added: kilo-backport-potential

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-04-15:

#1

Download full text (3.5 KiB)

Attaching some logs sent from someone with IBM that recreated this on Juno with a debug patch (https://review.openstack.org/#/c/169782/) for logging:

Hi, I finished another round of testing, this time all the VMs were in SHUTOFF state after hypervisor reboot (), here are the key time points in the log file:
13:41:47 Triggered hypervisor reboot, "Emitting event" arrived
13:45:33 Nova compute server started after hypervisor started up
13:46:25 Finished VM state sync up
For more details please check the attached log file: compute_kvm_reboot.2.log.zip
Thanks!

================================ Before host reboot: =============================
================= on kvm001 node Before KVM Reboot ===================
[root@hkg02kvm001ccz023 ˜]# date
Wed Apr 15 13:39:32 UTC 2015
[root@hkg02kvm001ccz023 ˜]# virsh list
Id Name State
----------------------------------------------------
3 instance-000000a2 running
4 instance-00000058 running

================================ After host reboot: =============================
================= on kvm001 After KVM Reboot ===================
[root@hkg02kvm001ccz023 ˜]# date
Wed Apr 15 13:47:46 UTC 2015
[root@hkg02kvm001ccz023 ˜]# virsh list
Id Name State
----------------------------------------------------

Attaching some logs sent from someone with IBM that recreated this on Juno with a debug patch (https://review.openstack.org/#/c/169782/) for logging:

Hi, I finished another round of testing, this time all the VMs were in SHUTOFF state after hypervisor reboot (), here are the key time points in the log file:
13:41:47 Triggered hypervisor reboot, "Emitting event" arrived
13:45:33 Nova compute server started after hypervisor started up
13:46:25 Finished VM state sync up
For more details please check the attached log file: compute_kvm_reboot.2.log.zip
Thanks!

================================ Before host reboot: =============================
================= on kvm001 node Before KVM Reboot ===================
[root@hkg02kvm001ccz023 ˜]# date
Wed Apr 15 13:39:32 UTC 2015
[root@hkg02kvm001ccz023 ˜]# virsh list
Id    Name                           State
----------------------------------------------------
3     instance-000000a2              running
4     instance-00000058              running

================================ After host reboot: =============================
================= on kvm001 After KVM Reboot ===================
[root@hkg02kvm001ccz023 ˜]# date
Wed Apr 15 13:47:46 UTC 2015
[root@hkg02kvm001ccz023 ˜]# virsh list
Id    Name                           State
----------------------------------------------------

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-15: Fix proposed to nova (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/174069

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-16: Fix merged to nova (master)

#4

Reviewed: https://review.openstack.org/174069
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0
Submitter: Jenkins
Branch: master

commit d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 15 11:51:26 2015 -0700

compute: stop handling virt lifecycle events in cleanup_host()

    When rebooting a compute host, guest VMs can be getting shutdown
    automatically by the hypervisor and the virt driver is sending events to
    the compute manager to handle them. If the compute service is still up
    while this happens it will try to call the stop API to power off the
    instance and update the database to show the instance as stopped.

    When the compute service comes back up and events come in from the virt
    driver that the guest VMs are running, nova will see that the vm_state
    on the instance in the nova database is STOPPED and shut down the
    instance by calling the stop API (basically ignoring what the virt
    driver / hypervisor tells nova is the state of the guest VM).

    Alternatively, if the compute service shuts down after changing the
    intance task_state to 'powering-off' but before the stop API cast is
    complete, the instance can be in a strange vm_state/task_state
    combination that requires the admin to manually reset the task_state to
    recover the instance.

    Let's just try to avoid some of this mess by disconnecting the event
    handling when the compute service is shutting down like we do for
    neutron VIF plugging events. There could still be races here if the
    compute service is shutting down after the hypervisor (e.g. libvirtd),
    but this is at least a best attempt to do the mitigate the potential
    damage.

    Closes-Bug: #1444630
    Related-Bug: #1293480
    Related-Bug: #1408176

Change-Id: I1a321371dff7933cdd11d31d9f9c2a2f850fd8d9

Changed in nova:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-16: Fix proposed to nova (stable/kilo)

#5

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/174477

John Garbutt (johngarbutt) on 2015-04-21

tags:

added: kilo-rc-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-21: Fix merged to nova (stable/kilo)

#6

Reviewed: https://review.openstack.org/174477
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b19764d2c6a8160102a806c1d6811c4182a8bac8
Submitter: Jenkins
Branch: stable/kilo

commit b19764d2c6a8160102a806c1d6811c4182a8bac8
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 15 11:51:26 2015 -0700

compute: stop handling virt lifecycle events in cleanup_host()

    When rebooting a compute host, guest VMs can be getting shutdown
    automatically by the hypervisor and the virt driver is sending events to
    the compute manager to handle them. If the compute service is still up
    while this happens it will try to call the stop API to power off the
    instance and update the database to show the instance as stopped.

    When the compute service comes back up and events come in from the virt
    driver that the guest VMs are running, nova will see that the vm_state
    on the instance in the nova database is STOPPED and shut down the
    instance by calling the stop API (basically ignoring what the virt
    driver / hypervisor tells nova is the state of the guest VM).

    Alternatively, if the compute service shuts down after changing the
    intance task_state to 'powering-off' but before the stop API cast is
    complete, the instance can be in a strange vm_state/task_state
    combination that requires the admin to manually reset the task_state to
    recover the instance.

    Let's just try to avoid some of this mess by disconnecting the event
    handling when the compute service is shutting down like we do for
    neutron VIF plugging events. There could still be races here if the
    compute service is shutting down after the hypervisor (e.g. libvirtd),
    but this is at least a best attempt to do the mitigate the potential
    damage.

    Closes-Bug: #1444630
    Related-Bug: #1293480
    Related-Bug: #1408176

Change-Id: I1a321371dff7933cdd11d31d9f9c2a2f850fd8d9
(cherry picked from commit d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0)

Reviewed:  https://review.openstack.org/174477
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b19764d2c6a8160102a806c1d6811c4182a8bac8
Submitter: Jenkins
Branch:    stable/kilo

commit b19764d2c6a8160102a806c1d6811c4182a8bac8
Author: Matt Riedemann <mriedem@us.ibm.com>
Date:   Wed Apr 15 11:51:26 2015 -0700

compute: stop handling virt lifecycle events in cleanup_host()
    
    When rebooting a compute host, guest VMs can be getting shutdown
    automatically by the hypervisor and the virt driver is sending events to
    the compute manager to handle them. If the compute service is still up
    while this happens it will try to call the stop API to power off the
    instance and update the database to show the instance as stopped.
    
    When the compute service comes back up and events come in from the virt
    driver that the guest VMs are running, nova will see that the vm_state
    on the instance in the nova database is STOPPED and shut down the
    instance by calling the stop API (basically ignoring what the virt
    driver / hypervisor tells nova is the state of the guest VM).
    
    Alternatively, if the compute service shuts down after changing the
    intance task_state to 'powering-off' but before the stop API cast is
    complete, the instance can be in a strange vm_state/task_state
    combination that requires the admin to manually reset the task_state to
    recover the instance.
    
    Let's just try to avoid some of this mess by disconnecting the event
    handling when the compute service is shutting down like we do for
    neutron VIF plugging events. There could still be races here if the
    compute service is shutting down after the hypervisor (e.g. libvirtd),
    but this is at least a best attempt to do the mitigate the potential
    damage.
    
    Closes-Bug: #1444630
    Related-Bug: #1293480
    Related-Bug: #1408176
    
    Change-Id: I1a321371dff7933cdd11d31d9f9c2a2f850fd8d9
    (cherry picked from commit d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0)

Thierry Carrez (ttx) on 2015-04-23

tags:

removed: kilo-backport-potential kilo-rc-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-29: Fix merged to nova (master)

#7

Reviewed: https://review.openstack.org/159275
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d09785b97a282e8538642f6f8bcdd8491197ed74
Submitter: Jenkins
Branch: master

commit d09785b97a282e8538642f6f8bcdd8491197ed74
Author: Matt Riedemann <email address hidden>
Date: Wed Feb 25 14:13:45 2015 -0800

Add config option to disable handling virt lifecycle events

    Historically the _sync_power_states periodic task has had the potential
    for race conditions and several changes have been made to try and
    tighten up this code:

    cc5388bbe81aba635fb757e202d860aeed98f3e8
    aa1792eb4c1d10e9a192142ce7e20d37871d916a
    baabab45e0ae0e9e35872cae77eb04bdb5ee0545
    bd8329b34098436d18441a8129f3f20af53c2b91

    The handle_lifecycle_events method which gets power state change events
    from the compute driver (currently only implemented by the libvirt
    driver) and calls _sync_instance_power_state - the same method that the
    _sync_power_states periodic task uses, except the periodic task at least
    locks when it's running - expands the scope for race problems in the
    compute manager so cloud providers should be able to turn it off. It is
    also known to have races with reboot where rebooted instances are
    automatically shutdown because of delayed lifecycle events that the
    instance is stopped even though it's running.

    This is consistent with the view that Nova should manage it's own state
    and not rely on external events telling it what to do about state
    changes. For example, in _sync_instance_power_state, if the Nova
    database thinks an instance is stopped but the hypervisor says it's
    running, the compute manager issues a force-stop on the instance.

    Also, although not documented (at least from what I can find), Nova has
    historically held a stance that it does not support out-of-band
    discovery and management of instances, so allowing external events to
    change state somewhat contradicts that stance and should be at least a
    configurable deployment option.

    DocImpact: New config option "handle_virt_lifecycle_events" in the
               DEFAULT group of nova.conf. By default the value is True
               so there is no upgrade impact or change in functionality.

    Related-Bug: #1293480
    Partial-Bug: #1443186
    Partial-Bug: #1444630

Change-Id: I26a1bc70939fb40dc38e9c5c43bf58ed1378bcc7

Reviewed:  https://review.openstack.org/159275
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d09785b97a282e8538642f6f8bcdd8491197ed74
Submitter: Jenkins
Branch:    master

commit d09785b97a282e8538642f6f8bcdd8491197ed74
Author: Matt Riedemann <mriedem@us.ibm.com>
Date:   Wed Feb 25 14:13:45 2015 -0800

Add config option to disable handling virt lifecycle events
    
    Historically the _sync_power_states periodic task has had the potential
    for race conditions and several changes have been made to try and
    tighten up this code:
    
    cc5388bbe81aba635fb757e202d860aeed98f3e8
    aa1792eb4c1d10e9a192142ce7e20d37871d916a
    baabab45e0ae0e9e35872cae77eb04bdb5ee0545
    bd8329b34098436d18441a8129f3f20af53c2b91
    
    The handle_lifecycle_events method which gets power state change events
    from the compute driver (currently only implemented by the libvirt
    driver) and calls _sync_instance_power_state - the same method that the
    _sync_power_states periodic task uses, except the periodic task at least
    locks when it's running - expands the scope for race problems in the
    compute manager so cloud providers should be able to turn it off. It is
    also known to have races with reboot where rebooted instances are
    automatically shutdown because of delayed lifecycle events that the
    instance is stopped even though it's running.
    
    This is consistent with the view that Nova should manage it's own state
    and not rely on external events telling it what to do about state
    changes. For example, in _sync_instance_power_state, if the Nova
    database thinks an instance is stopped but the hypervisor says it's
    running, the compute manager issues a force-stop on the instance.
    
    Also, although not documented (at least from what I can find), Nova has
    historically held a stance that it does not support out-of-band
    discovery and management of instances, so allowing external events to
    change state somewhat contradicts that stance and should be at least a
    configurable deployment option.
    
    DocImpact: New config option "handle_virt_lifecycle_events" in the
               DEFAULT group of nova.conf. By default the value is True
               so there is no upgrade impact or change in functionality.
    
    Related-Bug: #1293480
    Partial-Bug: #1443186
    Partial-Bug: #1444630
    
    Change-Id: I26a1bc70939fb40dc38e9c5c43bf58ed1378bcc7

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-30: Fix proposed to nova (master)

#8

Fix proposed to branch: master
Review: https://review.openstack.org/179284

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-06: Fix merged to nova (master)

#9

Download full text (18.1 KiB)

Reviewed: https://review.openstack.org/179284
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5228d4e418734164ffa5ccd91d2865d9cc659c00
Submitter: Jenkins
Branch: master

commit 906ab9d6522b3559b4ad36d40dec3af20397f223
Author: He Jie Xu <email address hidden>
Date: Thu Apr 16 07:09:34 2015 +0800

Update rpc version aliases for kilo

    Update all of the rpc client API classes to include a version alias
    for the latest version implemented in Kilo. This alias is needed when
    doing rolling upgrades from Kilo to Liberty. With this in place, you can
    ensure all services only send messages that both Kilo and Liberty will
    understand.

Closes-Bug: #1444745

Conflicts:
nova/conductor/rpcapi.py

NOTE(alex_xu): The conflict is due to there are some logs already added
into the master.

Change-Id: I2952aec9aae747639aa519af55fb5fa25b8f3ab4
(cherry picked from commit 78a8b5802ca148dcf37c5651f75f2126d261266e)

commit f191a2147a21c7e50926b288768a96900cf4c629
Author: Hans Lindgren <email address hidden>
Date: Fri Apr 24 13:10:39 2015 +0200

Add security group calls missing from latest compute rpc api version bump

The recent compute rpc api version bump missed out on the security group
related calls that are part of the api.

    One possible reason is that both compute and security group client side
    rpc api:s share a single target, which is of little value and only cause
    mistakes like this.

This change eliminates future problems like this by combining them into
one to get a 1:1 relationship between client and server api:s.

    Change-Id: I9207592a87fab862c04d210450cbac47af6a3fd7
    Closes-Bug: #1448075
    (cherry picked from commit bebd00b117c68097203adc2e56e972d74254fc59)

commit a2872a9262985bd0ee2c6df4f7593947e0516406
Author: Dan Smith <email address hidden>
Date: Wed Apr 22 09:02:03 2015 -0700

Fix migrate_flavor_data() to catch instances with no instance_extra rows

    The way the query was being performed previously, we would not see any
    instances that didn't have a row in instance_extra. This could happen if
    an instance hasn't been touched for several releases, or if the data
    set is old.

    The fix is a simple change to use outerjoin instead of join. This patch
    includes a test that ensures that instances with no instance_extra rows
    are included in the migration. If we query an instance without such a
    row, we create it before doing a save on the instance.

    Closes-Bug: #1447132
    Change-Id: I2620a8a4338f5c493350f26cdba3e41f3cb28de7
    (cherry picked from commit 92714accc49e85579f406de10ef8b3b510277037)

commit e3a7b83834d1ae2064094e9613df75e3b07d77cd
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Apr 23 02:18:41 2015 +0000

Updated from global requirements

Change-Id: I5d4acd36329fe2dccb5772fed3ec55b442597150

commit 8c9b5e620eef3233677b64cd234ed2551e6aa182
Author: Divya <email address hidden>
Date: Tue Apr 21 08:26:29 2015 +0200

Control create/delete flavor api permissions using policy.json

The permissions of ...

Reviewed:  https://review.openstack.org/179284
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5228d4e418734164ffa5ccd91d2865d9cc659c00
Submitter: Jenkins
Branch:    master

commit 906ab9d6522b3559b4ad36d40dec3af20397f223
Author: He Jie Xu <hejie.xu@intel.com>
Date:   Thu Apr 16 07:09:34 2015 +0800

Update rpc version aliases for kilo
    
    Update all of the rpc client API classes to include a version alias
    for the latest version implemented in Kilo.  This alias is needed when
    doing rolling upgrades from Kilo to Liberty.  With this in place, you can
    ensure all services only send messages that both Kilo and Liberty will
    understand.
    
    Closes-Bug: #1444745
    
    Conflicts:
    	nova/conductor/rpcapi.py
    
    NOTE(alex_xu): The conflict is due to there are some logs already added
    into the master.
    
    Change-Id: I2952aec9aae747639aa519af55fb5fa25b8f3ab4
    (cherry picked from commit 78a8b5802ca148dcf37c5651f75f2126d261266e)

commit f191a2147a21c7e50926b288768a96900cf4c629
Author: Hans Lindgren <hanlind@kth.se>
Date:   Fri Apr 24 13:10:39 2015 +0200

Add security group calls missing from latest compute rpc api version bump
    
    The recent compute rpc api version bump missed out on the security group
    related calls that are part of the api.
    
    One possible reason is that both compute and security group client side
    rpc api:s share a single target, which is of little value and only cause
    mistakes like this.
    
    This change eliminates future problems like this by combining them into
    one to get a 1:1 relationship between client and server api:s.
    
    Change-Id: I9207592a87fab862c04d210450cbac47af6a3fd7
    Closes-Bug: #1448075
    (cherry picked from commit bebd00b117c68097203adc2e56e972d74254fc59)

commit a2872a9262985bd0ee2c6df4f7593947e0516406
Author: Dan Smith <dansmith@redhat.com>
Date:   Wed Apr 22 09:02:03 2015 -0700

Fix migrate_flavor_data() to catch instances with no instance_extra rows
    
    The way the query was being performed previously, we would not see any
    instances that didn't have a row in instance_extra. This could happen if
    an instance hasn't been touched for several releases, or if the data
    set is old.
    
    The fix is a simple change to use outerjoin instead of join. This patch
    includes a test that ensures that instances with no instance_extra rows
    are included in the migration. If we query an instance without such a
    row, we create it before doing a save on the instance.
    
    Closes-Bug: #1447132
    Change-Id: I2620a8a4338f5c493350f26cdba3e41f3cb28de7
    (cherry picked from commit 92714accc49e85579f406de10ef8b3b510277037)

commit e3a7b83834d1ae2064094e9613df75e3b07d77cd
Author: OpenStack Proposal Bot <openstack-infra@lists.openstack.org>
Date:   Thu Apr 23 02:18:41 2015 +0000

Updated from global requirements
    
    Change-Id: I5d4acd36329fe2dccb5772fed3ec55b442597150

commit 8c9b5e620eef3233677b64cd234ed2551e6aa182
Author: Divya <dikonoor@in.ibm.com>
Date:   Tue Apr 21 08:26:29 2015 +0200

Control create/delete flavor api permissions using policy.json
    
    The permissions of create/delete flavor api is currently broken
    and expects the user to be always an admin, instead of controlling
    the permissions by the rules defined in the nova policy.json.
    
    Change-Id: Ide3c9ec2fa674b4fe3ea9d935cd4f7848914b82e
    Closes-Bug: 1445335
    (cherry picked from commit ced60b1d1b1608dc8229741b207a95498bc0b212)

commit bf79742d26ae66886bcdc55eeaf27e1d7ce24be5
Author: Przemyslaw Czesnowicz <przemyslaw.czesnowicz@intel.com>
Date:   Tue Apr 14 16:28:57 2015 +0100

Fix handling of pci_requests in consume_from_instance.
    
    Properly retrieve requests from pci_requests in consume_from_instance.
    Without this the call to numa_fit_instance_to_host will fail because
    it expects the request list.
    And change the order in which apply_requests and numa_fit_instance_to_host
    are called. Calling apply_requests first will remove devices from pools
    and  may make numa_fit_instance_to_host fail.
    
    Change-Id: I41cf4e8e5c1dea5f91e5261a8f5e88f46c7994ef
    Closes-bug: #1444021
    (cherry picked from commit 0913e799e9ce3138235f5ea6f80159f468ad2aaa)

commit c2d7060b480608d9773340f51d6496fadf97b667
Author: Przemyslaw Czesnowicz <przemyslaw.czesnowicz@intel.com>
Date:   Thu Apr 16 17:10:06 2015 +0100

Use list of requests in InstancePCIRequests.obj_from_db.
    
    InstancePCIRequests.obj_from_db assumes it's called with with a dict
    of values from instances_extra table, but in some cases it's called
    with just the value of pci_requests column.
    This changes obj_from_db to be used with just the value of pci_requests column.
    
    Change-Id: I7bed733c845c365081719a70b8a2f0cc9a58370c
    Closes-bug: #1445040
    (cherry picked from commit a074d7b4465b45730a5171e024c5c39a66a9c927)

commit 7a609f153808f7cee1edbbb36accc292fa8df0d0
Author: Przemyslaw Czesnowicz <przemyslaw.czesnowicz@intel.com>
Date:   Tue Apr 7 16:31:05 2015 +0100

Add numa_node field to PciDevicePool
    
    Without this field, PciDevicePool.from_dict will treat numa_node key in
    the dict as a tag, which in turn means that the scheduler client will
    drop it when converting stats to objects before reporting.
    
    Converting it back to dicts on the scheduler side thus will not have
    access to the numa_node information which would cause any requests that
    will look for the exact match between the device and instance NUMA nodes
    in the NUMATopologyFilter to fail.
    
    Closes-Bug: #1441169
    (cherry picked from commit 7db1ebc66c59205f78829d1e9cd10dcc1201d798)
    
    Conflicts:
    	nova/tests/unit/objects/test_objects.py
    
    Change-Id: I7381f909620e8e787178c0be9a362f8d3eb9ff7d

commit 880a356e40d327c0af4ce94b5a08fe0cd6fcab5d
Author: Nikola Dipanov <ndipanov@redhat.com>
Date:   Tue Apr 7 20:53:32 2015 +0100

scheduler: re-calculate NUMA on consume_from_instance
    
    This patch narrows down the race window between the filter running and
    the consumption of resources from the instance after the host has been
    chosen.
    
    It does so by re-calculating the fitted NUMA topology just before consuming it
    from the chosen host. Thus we avoid any locking, but also make sure that
    the host_state is kept as up to date as possible for concurrent
    requests, as there is no opportunity for switching threads inside a
    consume_from_instance.
    
    Several things worth noting:
      * Scheduler being lock free (and thus racy) does not really affect
      resources other than PCI and NUMA topology this badly - this is due
      to complexity of said resources. In order for scheduler decesions to not
      be based on basically guessing, in case of those two we will likely need
      to introduce either locking or special heuristics.
    
      * There is a lot of repeated code between the 'consume_from_instance'
      method and the actual filters. This situation should really be fixed but
      is out of scope for this bug fix (which is about preventing valid
      requests failing because of races in the scheduler).
    
    Change-Id: If0c7ad20506c9dddf4dec1eb64c9d6dd4fb75633
    Closes-bug: #1438238
    (cherry picked from commit d6b3156a6c89ddff9b149452df34c4b32c50b6c3)

commit a4e9a146c3993f5775501716a21632f34a63a3ad
Author: Rajesh Tailor <rajesh.tailor@nttdata.com>
Date:   Wed Apr 15 06:59:04 2015 -0700

Fix kwargs['migration'] KeyError in @errors_out_migration decorator
    
    @errors_out_migration decorator is used in the compute manager on
    resize_instance and finish_resize methods of ComputeManager class.
    It is decorated via @utils.expects_func_args('migration') to check
    'migration' is a parameter to the decorator method, however, that
    only ensures there is a migration argument, not that it's in args or
    kwargs (either is fine for what expects_func_args checks).
    The errors_out_migration decorator can get a KeyError when checking
    kwargs['migration'] and fails to set the migration status to 'error'.
    
    This fixes the KeyError in the decorator by normalizing the args/kwargs
    list into a single dict that we can pull the migration from.
    
    Change-Id: I774ac9b749b21085f4fbcafa4965a78d68eec9c7
    Closes-Bug: 1444300
    (cherry picked from commit 3add7923fc16c050d4cfaef98a87886c6b6a589c)

commit 389368bcfe498323b369f68682babb92a5b0ca54
Author: Gary Kotton <gkotton@vmware.com>
Date:   Wed Apr 15 05:14:42 2015 -0700

Resource tracker: unable to restart nova compute
    
    The resource tracker calculates its used resources. In certain cases
    of failed migrations and an instance being deleted the resource tracker
    causes an exception in nova compute. If this situation arises then nova
    compute may not even be able to restart.
    
    Change-Id: I4a154e0cae3b8e22bd59ed05ba708e07eed8dea7
    Closes-bug: #1444439
    (cherry picked from commit ee7a7446cc6947a6bacacb6cb514934cc22e5782)

commit bd6a40fecde943a3ded0124481a12c27dbb167de
Author: Andreas Jaeger <aj@suse.de>
Date:   Mon Apr 20 11:01:22 2015 +0200

Release Import of Translations from Transifex
    
    Manual import of Translations from Transifex. This change also removes
    all po files that are less than 66 per cent translated since such
    partially translated files will not help users.
    
    This updates also recreates all pot (translation source files) to
    reflect the state of the repository.
    
    This change needs to be done manually since the automatic import does
    not handle the proposed branches and we need to sync with latest
    translations.
    
    Change-Id: I0e9ef00182a2229602d23b8a67a02f0be62ee239

commit 8ebd515aa94ed399074a3b55bd36fd8cd579a499
Author: Matt Riedemann <mriedem@us.ibm.com>
Date:   Thu Apr 16 11:08:50 2015 -0700

Use kwargs from compute v4 proxy change_instance_metadata
    
    The args were passed to the compute manager method in the wrong order.
    We noticed this in the gate with KeyError: 'uuid' in the logs because of
    the LOG.debug statement in change_instance_metadata. Just use kwargs
    like rpcapi would normally.
    
    There isn't a unit test for this since the v4 proxy code goes away in
    liberty, this is for getting it into stable/kilo.
    
    Closes-Bug: #1444728
    
    Change-Id: Ic988f48d99e626ee5773c97904e09dbf00c5414a
    (cherry picked from commit e55f746ea8590cce7c2b07a023197f369251a7ef)

commit b19764d2c6a8160102a806c1d6811c4182a8bac8
Author: Matt Riedemann <mriedem@us.ibm.com>
Date:   Wed Apr 15 11:51:26 2015 -0700

compute: stop handling virt lifecycle events in cleanup_host()
    
    When rebooting a compute host, guest VMs can be getting shutdown
    automatically by the hypervisor and the virt driver is sending events to
    the compute manager to handle them. If the compute service is still up
    while this happens it will try to call the stop API to power off the
    instance and update the database to show the instance as stopped.
    
    When the compute service comes back up and events come in from the virt
    driver that the guest VMs are running, nova will see that the vm_state
    on the instance in the nova database is STOPPED and shut down the
    instance by calling the stop API (basically ignoring what the virt
    driver / hypervisor tells nova is the state of the guest VM).
    
    Alternatively, if the compute service shuts down after changing the
    intance task_state to 'powering-off' but before the stop API cast is
    complete, the instance can be in a strange vm_state/task_state
    combination that requires the admin to manually reset the task_state to
    recover the instance.
    
    Let's just try to avoid some of this mess by disconnecting the event
    handling when the compute service is shutting down like we do for
    neutron VIF plugging events. There could still be races here if the
    compute service is shutting down after the hypervisor (e.g. libvirtd),
    but this is at least a best attempt to do the mitigate the potential
    damage.
    
    Closes-Bug: #1444630
    Related-Bug: #1293480
    Related-Bug: #1408176
    
    Change-Id: I1a321371dff7933cdd11d31d9f9c2a2f850fd8d9
    (cherry picked from commit d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0)

commit 75e9de5d572578520c217b540aa2a40726f137f0
Author: Roman Podoliaka <rpodolyaka@mirantis.com>
Date:   Wed Mar 4 17:27:06 2015 +0200

Forbid booting of QCOW2 images with virtual_size > root_gb
    
    Currently, it's possible to boot an instance from a QCOW2 image,
    which has virtual_size bigger than one allowed by the given flavor
    (root_gb).
    
    The issue is caused by two different problems in the code:
    
    1) typo in get_disk_size() has made it always return None and
       effectively disabled verify_base_size() checks
    
    2) Rbd image backend skips the verify_base_size() step for
       'cached' images (the one with base files), so it is possible to
       boot an instance using a larger flavor once and then use smaller
       flavors to boot the same image, even if allowed root_gb size is
       smaller than the image virtual size
    
    Closes-Bug: #1429093
    
    Change-Id: I383130e5f8cc288f4b428ed43fe4d3aba7169473
    (cherry picked from commit c1f9ed27af64e6893d9d0153a964df5aba99b8f0)

commit 33ba90240a2ad3165274d9e54ceb156273404c9a
Author: Matt Riedemann <mriedem@us.ibm.com>
Date:   Mon Apr 13 14:47:20 2015 -0700

Pass migrate_data to pre_live_migration
    
    Commit ebfa09fa197a1d88d1b3ab1f308232c3df7dc009 added an RPC proxy but
    as part of that was passing migrate_data=None for pre_live_migration
    which breaks live block migration when not using shared storage.
    
    Closes-Bug: #984996
    
    Change-Id: I2a83f1fb0e4468f9a6c67a188af725c3406139d1
    (cherry picked from commit 4e515ec2269a1c3187ee9ffad3a6be059ec74b0b)

commit dea6116723f22632c2e478e00bb0aafcd2febdc9
Author: Timofey Durakov <tdurakov@mirantis.com>
Date:   Fri Apr 10 19:38:33 2015 +0300

Fixed order of arguments during execution live_migrate()
    
    order of arguments that passed to
    ComputeManager.live_,migration() differs in ComputeManager and
    _ComputeV4Proxy classes
    
    Change-Id: I23c25d219e9cdd0673ae6a12250219680fb7bda9
    Closes-Bug:#1442656
    (cherry picked from commit ba521fa53711774e0718808fe333aca676de57ae)

commit 22d7547c6b62fb9dabd861e4941edd34eedabfc6
Author: Doug Hellmann <doug@doughellmann.com>
Date:   Wed Apr 15 19:58:17 2015 +0000

update .gitreview for stable/kilo
    
    Change-Id: I6356513ac42b79402dbde8ee5e75cbbd1aee7eef

commit 68d6f924037f3b931add2ce5d0d433913e720ca6
Author: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Date:   Wed Apr 15 03:13:43 2015 +0000

Add min/max of API microversions to version API
    
    As nova-spec api-microversions, versions API needs to expose minimum
    and maximum microversions to version API, because clients need to
    know available microversions through the API. That is very important
    for the interoperability.
    This patch adds these versions as the nova-spec mentioned.
    
    Note:
      As v2(not v2.1) API change manner, we have added new extensions if
      changing API. However, this patch doesn't add a new extension even
      if adding new parameters "version" and "min_version" because version
      API is independent from both v2 and v2.1 APIs.
    
    Change-Id: Id464a07d624d0e228fe0aa66a04c8e51f292ba0c
    Closes-Bug: #1443375
    (cherry picked from commit 1830870718fe7472b47037f3331cfe59b5bdda07)
    (cherry picked from commit 853671e912c6ad9a4605acad2575417911875cdd)

commit d82d492c67db546d01addc5fead9708760fb6abd
Author: Sabari Kumar Murugesan <smurugesan@vmware.com>
Date:   Mon Apr 6 20:36:48 2015 -0700

VMware: Fix attribute error in resize
    
    The class DatastorePath was recently removed from ds_util as it's
    available in oslo.vmware. One of the reference was missed during
    the refactor.
    
    Change-Id: Idc5825c304a99e83cbf36e93751148d6f995131a
    Closes-Bug: #1440968
    (cherry picked from commit ab4a5a5300179a79f7a67688f0e9f3fc280c0efa)

commit 3cff2c673c6cdf487c2a1eb2a5c6c89c6de80d11
Author: jichenjc <jichenjc@cn.ibm.com>
Date:   Fri Mar 20 08:36:37 2015 +0800

Release bdm constraint source and dest type
    
    https://bugs.launchpad.net/nova/+bug/1377958 fixed a problem
    that source_type: image, destination_type: local is not
    supported for boot instance, exception should be raised to
    reject the param otherwise it will lead to instance become
    ERROR state.
    
    However the fix introduced a problem on nova client
    https://bugs.launchpad.net/python-novaclient/+bug/1418484
    The fix of the bug leads to following command become invalid
    
    nova boot test-vm --flavor m1.medium --image centos-vm-32
    --nic net-id=c3f40e33-d535-4217-916b-1450b8cd3987 --block-device
    id=26b7b917-2794-452a-95e5-2efb2ca6e32d,bus=sata,source=volume,bootindex=1
    
    So we need to release the original constraint to allow
    the above special case pass the validation check then
    we can revert the nova client exception
    (https://review.openstack.org/#/c/165932/)
    
    This patch checks the boot_index and whether image param is
    given after we found the bdm has source_type: image,
    destination_type: local, if this is the special case, then
    no exception will be raised.
    
    Closes-Bug: #1433609
    
    Change-Id: If43faae95169bc3864449a8364975f5c887aac14
    (cherry picked from commit cadbcc440a2fcfb8532f38111999a06557fbafc2)

commit 97145ba175291d6522afa079860c72220e43024a
Author: Dan Smith <dansmith@redhat.com>
Date:   Fri Apr 10 07:10:52 2015 -0700

Fix check_can_live_migrate_destination() in ComputeV4Proxy
    
    There was a mismatch in the V4 proxy in the call signatures of this
    function. This was missed because the "destination" parameter is passed
    in the rpcapi as the host to contact, which is consumed by the rpc
    layer and not passed. Since it was not called one of the standard
    names (either host if to be not passed, or host_param if to be passed),
    this was missed.
    
    Change-Id: Idf2160934dade650ed02b672f3b64cb26247f8e6
    Closes-Bug: #1442602
    (cherry picked from commit 0c08f7f2ef070f7c6172d7742f9789e0a8bda91a)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-16: Fix proposed to nova (stable/juno)

#10

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/192244

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-18: Fix merged to nova (stable/juno)

#11

Reviewed: https://review.openstack.org/192244
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7bc4be781564c6b9e7a519aecea84ddbee6bd935
Submitter: Jenkins
Branch: stable/juno

commit 7bc4be781564c6b9e7a519aecea84ddbee6bd935
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 15 11:51:26 2015 -0700

compute: stop handling virt lifecycle events in cleanup_host()

    When rebooting a compute host, guest VMs can be getting shutdown
    automatically by the hypervisor and the virt driver is sending events to
    the compute manager to handle them. If the compute service is still up
    while this happens it will try to call the stop API to power off the
    instance and update the database to show the instance as stopped.

    When the compute service comes back up and events come in from the virt
    driver that the guest VMs are running, nova will see that the vm_state
    on the instance in the nova database is STOPPED and shut down the
    instance by calling the stop API (basically ignoring what the virt
    driver / hypervisor tells nova is the state of the guest VM).

    Alternatively, if the compute service shuts down after changing the
    intance task_state to 'powering-off' but before the stop API cast is
    complete, the instance can be in a strange vm_state/task_state
    combination that requires the admin to manually reset the task_state to
    recover the instance.

    Let's just try to avoid some of this mess by disconnecting the event
    handling when the compute service is shutting down like we do for
    neutron VIF plugging events. There could still be races here if the
    compute service is shutting down after the hypervisor (e.g. libvirtd),
    but this is at least a best attempt to do the mitigate the potential
    damage.

    Closes-Bug: #1444630
    Related-Bug: #1293480
    Related-Bug: #1408176

    Conflicts:
     nova/compute/manager.py
     nova/tests/unit/compute/test_compute_mgr.py

Change-Id: I1a321371dff7933cdd11d31d9f9c2a2f850fd8d9
(cherry picked from commit d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0)

Reviewed:  https://review.openstack.org/192244
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7bc4be781564c6b9e7a519aecea84ddbee6bd935
Submitter: Jenkins
Branch:    stable/juno

commit 7bc4be781564c6b9e7a519aecea84ddbee6bd935
Author: Matt Riedemann <mriedem@us.ibm.com>
Date:   Wed Apr 15 11:51:26 2015 -0700

compute: stop handling virt lifecycle events in cleanup_host()
    
    When rebooting a compute host, guest VMs can be getting shutdown
    automatically by the hypervisor and the virt driver is sending events to
    the compute manager to handle them. If the compute service is still up
    while this happens it will try to call the stop API to power off the
    instance and update the database to show the instance as stopped.
    
    When the compute service comes back up and events come in from the virt
    driver that the guest VMs are running, nova will see that the vm_state
    on the instance in the nova database is STOPPED and shut down the
    instance by calling the stop API (basically ignoring what the virt
    driver / hypervisor tells nova is the state of the guest VM).
    
    Alternatively, if the compute service shuts down after changing the
    intance task_state to 'powering-off' but before the stop API cast is
    complete, the instance can be in a strange vm_state/task_state
    combination that requires the admin to manually reset the task_state to
    recover the instance.
    
    Let's just try to avoid some of this mess by disconnecting the event
    handling when the compute service is shutting down like we do for
    neutron VIF plugging events. There could still be races here if the
    compute service is shutting down after the hypervisor (e.g. libvirtd),
    but this is at least a best attempt to do the mitigate the potential
    damage.
    
    Closes-Bug: #1444630
    Related-Bug: #1293480
    Related-Bug: #1408176
    
    Conflicts:
    	nova/compute/manager.py
    	nova/tests/unit/compute/test_compute_mgr.py
    
    Change-Id: I1a321371dff7933cdd11d31d9f9c2a2f850fd8d9
    (cherry picked from commit d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0)

tags:

added: in-stable-juno

Thierry Carrez (ttx) on 2015-06-24

Changed in nova:
milestone:	none → liberty-1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-10-15

Changed in nova:
milestone:	liberty-1 → 12.0.0

Revision history for this message

Marian Horban (mhorban) wrote on 2015-12-17:

#12

Libvirt event threads are not stopped during stopping of nova-compute service. That'w why during restart nova-compute with SIGHUP signal we can see traceback:

2015-11-30 10:03:06.013 INFO nova.service [-] Starting compute node (version 13.0.0)
2015-11-30 10:03:06.013 DEBUG nova.virt.libvirt.host [-] Starting native event thread from (pid=17505) _init_events /opt/stack/nova/nova/virt/libvirt/host.py:452
2015-11-30 10:03:06.014 DEBUG nova.virt.libvirt.host [-] Starting green dispatch thread from (pid=17505) _init_events /opt/stack/nova/nova/virt/libvirt/host.py:458
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 115, in wait
    listener.cb(fileno)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/nova/nova/utils.py", line 1158, in context_wrapper
    return func(*args, **kwargs)
  File "/opt/stack/nova/nova/virt/libvirt/host.py", line 248, in _dispatch_thread
    self._dispatch_events()
  File "/opt/stack/nova/nova/virt/libvirt/host.py", line 353, in _dispatch_events
    assert _c
AssertionError
Removing descriptor: 9

Started threads should be stopped during stopping of nova-compute service

Changed in nova:
status:	Fix Released → In Progress
assignee:	Matt Riedemann (mriedem) → Marian Horban (mhorban)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-17: Fix proposed to nova (master)

#13

Fix proposed to branch: master
Review: https://review.openstack.org/259066

Davanum Srinivas (DIMS) (dims-v) on 2016-03-06

Changed in nova:
assignee:	Marian Horban (mhorban) → nobody
status:	In Progress → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-18: Change abandoned on nova (master)

#14

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/259066
Reason: Please open a new bug for tracking this rather than re-opening something that was already marked as fixed.

Matt Riedemann (mriedem) on 2016-04-18

Changed in nova:
status:	Confirmed → Fix Released
assignee:	nobody → Matt Riedemann (mriedem)

OpenStack Compute (nova)

nova-compute should stop handling virt lifecycle events when it's shutting down

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack Compute (nova)	Fix Released	Medium	Matt Riedemann	OpenStack Compute (nova) 12.0.0 "liberty"
Juno	Fix Released	Medium	Vladik Romanovsky	OpenStack Compute (nova) 2014.2.4
Kilo	Fix Released	Medium	Matt Riedemann	OpenStack Compute (nova) 2015.1.0 "kilo"