nova-live-migration evacuation fails if volumes created on subnode c-vol backend

Bug #1868234 reported by Lee Yarwood
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Lee Yarwood
Stein
Fix Released
High
Lee Yarwood
Train
Fix Released
High
Lee Yarwood
Ussuri
Fix Released
High
Lee Yarwood

Bug Description

Description
===========

I8af2ad741ca08c3d88efb9aa817c4d1470491a23 has started to correctly fence the subnode during evacuation testing. However it missed that we deploy c-vol and g-api on these nodes. As a result during BFV evacuation testing we will fail if the volume has been created on the subnode c-vol.

https://zuul.opendev.org/t/openstack/build/c78d3ab4e6a748b4a53c6ff6dc273106/log/logs/screen-n-cpu.txt#7060

Mar 19 19:43:26.844295 ubuntu-bionic-rax-ord-0015339373 nova-compute[9838]: ERROR nova.compute.manager [None req-512a96c8-8b32-49c7-8d29-7ff300ed4482 demo admin] [instance: 702ff125-d947-4a28-853b-82dcd58b990e] Setting instance vm_state to ERROR: ClientException: The server has either erred or is incapable of performing the requested operation. (HTTP 500)

https://zuul.opendev.org/t/openstack/build/c78d3ab4e6a748b4a53c6ff6dc273106/log/logs/screen-c-api.txt#1936

Mar 19 19:43:26.262818 ubuntu-bionic-rax-ord-0015339373 <email address hidden>[27200]: ERROR cinder.api.middleware.fault [req-512a96c8-8b32-49c7-8d29-7ff300ed4482 req-826f7c01-3c02-4d9e-9046-8a15d7fa9b61 demo admin] Caught error: <class 'oslo_messaging.exceptions.MessagingTimeout'> Timed out waiting for a reply to message ID 23fabce9b79441198fbe4fe71c0ac7ab: MessagingTimeout: Timed out waiting for a reply to message ID 23fabce9b79441198fbe4fe71c0ac7ab
Mar 19 19:43:26.262818 ubuntu-bionic-rax-ord-0015339373 <email address hidden>[27200]: ERROR

Ultimately we shouldn't run these services on the computes but for now we should limit the services we stop on the subnode to n-cpu and q-agt.

Steps to reproduce
==================
Run nova-live-migration, if volumes are created on the subnode evacuation testing will fail.

Expected result
===============
nova-live-migration passes.

Actual result
=============
nova-live-migration fails.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   Master or stabe/train with I8af2ad741ca08c3d88efb9aa817c4d1470491a23 applied.

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/714057

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: New → In Progress
Lee Yarwood (lyarwood)
Changed in nova:
importance: Undecided → High
tags: added: live-migration volumes
tags: added: evacuate
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/714057
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1e16b3184d4e298c454ede7c56040f6d70276a0c
Submitter: Zuul
Branch: master

commit 1e16b3184d4e298c454ede7c56040f6d70276a0c
Author: Lee Yarwood <email address hidden>
Date: Fri Mar 20 09:19:55 2020 +0000

    nova-live-migration: Only stop n-cpu and q-agt during evacuation testing

    I8af2ad741ca08c3d88efb9aa817c4d1470491a23 started to correctly fence the
    subnode ahead of evacuation testing but missed that c-vol and g-api
    where also running on the host. As a result the BFV evacuation test will
    fail if the volume being used is created on the c-vol backend hosted on
    the subnode.

    This change now avoids this by limiting the services stopped ahead of
    the evacuation on the subnode to n-cpu and q-agt.

    Change-Id: Ia7c317e373e4037495d379d06eda19a71412d409
    Closes-Bug: #1868234

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/713961
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=91d410b92f86524dffb781c7a49ecb3ec8ac8a56
Submitter: Zuul
Branch: stable/train

commit 91d410b92f86524dffb781c7a49ecb3ec8ac8a56
Author: Lee Yarwood <email address hidden>
Date: Wed Mar 18 15:17:27 2020 +0000

    nova-live-migration: Ensure subnode is fenced during evacuation testing

    As stated in the forced-down API [1]:

    > Setting a service forced down without completely fencing it will
    > likely result in the corruption of VMs on that host.

    Previously only the libvirtd service was stopped on the subnode prior to
    calling this API, allowing n-cpu, q-agt and the underlying guest domains
    to continue running on the host.

    This change now ensures all devstack services are stopped on the subnode
    and all active domains destroyed.

    It is hoped that this will resolve bug #1813789 where evacuations have
    timed out due to VIF plugging issues on the new destination host.

    [1] https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down

    NOTE(lyarwood): The following change is squashed here to allow both to
    pass the gate without encoutering additional failures.

    nova-live-migration: Only stop n-cpu and q-agt during evacuation testing

    I8af2ad741ca08c3d88efb9aa817c4d1470491a23 started to correctly fence the
    subnode ahead of evacuation testing but missed that c-vol and g-api
    where also running on the host. As a result the BFV evacuation test will
    fail if the volume being used is created on the c-vol backend hosted on
    the subnode.

    This change now avoids this by limiting the services stopped ahead of
    the evacuation on the subnode to n-cpu and q-agt.

    Change-Id: Ia7c317e373e4037495d379d06eda19a71412d409
    Closes-Bug: #1868234
    (cherry picked from commit 1e16b3184d4e298c454ede7c56040f6d70276a0c)

    Related-Bug: #1813789
    Change-Id: I8af2ad741ca08c3d88efb9aa817c4d1470491a23
    (cherry picked from commit b097959c1cbc9af1d90c7502286bc3e20972201f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/713962
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e7cd8b49e21852b5fafec7215d40e758cde92a35
Submitter: Zuul
Branch: stable/stein

commit e7cd8b49e21852b5fafec7215d40e758cde92a35
Author: Lee Yarwood <email address hidden>
Date: Wed Mar 18 15:17:27 2020 +0000

    nova-live-migration: Ensure subnode is fenced during evacuation testing

    As stated in the forced-down API [1]:

    > Setting a service forced down without completely fencing it will
    > likely result in the corruption of VMs on that host.

    Previously only the libvirtd service was stopped on the subnode prior to
    calling this API, allowing n-cpu, q-agt and the underlying guest domains
    to continue running on the host.

    This change now ensures all devstack services are stopped on the subnode
    and all active domains destroyed.

    It is hoped that this will resolve bug #1813789 where evacuations have
    timed out due to VIF plugging issues on the new destination host.

    [1] https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down

    NOTE(lyarwood): The following change is squashed here to allow both to
    pass the gate without encoutering additional failures.

    nova-live-migration: Only stop n-cpu and q-agt during evacuation testing

    I8af2ad741ca08c3d88efb9aa817c4d1470491a23 started to correctly fence the
    subnode ahead of evacuation testing but missed that c-vol and g-api
    where also running on the host. As a result the BFV evacuation test will
    fail if the volume being used is created on the c-vol backend hosted on
    the subnode.

    This change now avoids this by limiting the services stopped ahead of
    the evacuation on the subnode to n-cpu and q-agt.

    Change-Id: Ia7c317e373e4037495d379d06eda19a71412d409
    Closes-Bug: #1868234
    (cherry picked from commit 1e16b3184d4e298c454ede7c56040f6d70276a0c)

    Related-Bug: #1813789
    Change-Id: I8af2ad741ca08c3d88efb9aa817c4d1470491a23
    (cherry picked from commit b097959c1cbc9af1d90c7502286bc3e20972201f)
    (cherry picked from commit 91d410b92f86524dffb781c7a49ecb3ec8ac8a56)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.