Instance stopping fails randomly due to already stopped instances

Bug #1980736 reported by suzhengwei
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
masakari
Fix Released
Undecided
suzhengwei

Bug Description

when an instance qemu-kvm pid abnormally terminated, it triggered an instance failure notification. The recovery workflow for the instance faiure is to stop then start the instance. In my env, the instance stop action random failed, because the nova-compute would periodically sync the instance status. The instance showed active but calling stop action it just returned 409 "code": 409, "message": "Cannot 'stop' instance be66077f-1092-4490-bf2c-a5be6268fd52 while it is in vm_state stopped".

masakari-engine.log
2022-07-05 17:41:47.818 28 INFO masakari.compute.nova [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-b21745a8-69d0-419a-8045-c0f4bc90d15b nova - - - -] Call get server command for instance be66077f-1092-4490-bf2c-a5be6268fd52
2022-07-05 17:41:48.185 28 DEBUG novaclient.v2.client [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-b21745a8-69d0-419a-8045-c0f4bc90d15b nova - - - -] REQ: curl -g -i -X GET http://100.2.244.113:5080/nova/v2.1/8ae58d5fe1864792bba4dcd79b9b4ccd/servers/be66077f-1092-4490-bf2c-a5be6268fd52 -H "Accept: application/json" -H "OpenStack-API-Version: compute 2.56" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA256}6107c32c79f4d88bb6c96020b8c1de43a98a7747651cacc0825874d6b6decf7b" -H "X-OpenStack-Nova-API-Version: 2.56" -H "X-OpenStack-Request-ID: req-ea99511d-29a5-43d7-8b99-a03f8fd46d87" _http_log_request /var/lib/kolla/venv/lib/python3.6/site-packages/keystoneauth1/session.py:517
2022-07-05 17:41:48.445 28 DEBUG novaclient.v2.client [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-b21745a8-69d0-419a-8045-c0f4bc90d15b nova - - - -] RESP: [200] Connection: keep-alive Content-Length: 2085 Content-Type: application/json Date: Tue, 05 Jul 2022 09:41:48 GMT Openstack-Api-Version: compute 2.56 Vary: OpenStack-API-Version, X-OpenStack-Nova-API-Version X-Compute-Request-Id: req-9e1f1d7d-dbbc-4979-aca3-1c702cd12cd2 X-Openstack-Nova-Api-Version: 2.56 X-Openstack-Request-Id: req-9e1f1d7d-dbbc-4979-aca3-1c702cd12cd2 _http_log_response /var/lib/kolla/venv/lib/python3.6/site-packages/keystoneauth1/session.py:548
2022-07-05 17:41:48.446 28 DEBUG novaclient.v2.client [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-b21745a8-69d0-419a-8045-c0f4bc90d15b nova - - - -] RESP BODY: {"server": {"id": "be66077f-1092-4490-bf2c-a5be6268fd52", "name": "scyrbd007", "status": "ACTIVE", "tenant_id": "c2bcf4067e8a4bacb20619ddadbd7377", "user_id": "21c0dcf1bc78464c8df149a1ae8a5304", "metadata": {"panic": "", "cpu_mode": "host-model", "source_type": "volume", "serial_number": "0daf20e6-7103-4bb4-935d-4b96062e635c"}, "hostId": "6560346edd56072f5c98239a5dfb938aaa52b45a00fa56f90a940773", "image": "", "flavor": {"vcpus": 1, "ram": 512, "disk": 1, "ephemeral": 0, "swap": 0, "original_name": "m1.tiny", "extra_specs": {"capabilities:hypervisor_type": "QEMU"}}, "created": "2022-07-05T06:49:26Z", "updated": "2022-07-05T08:51:03Z", "addresses": {"zqx_net": [{"version": 4, "addr": "192.168.67.74", "OS-EXT-IPS:type": "fixed", "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:4e:ab:9f"}]}, "accessIPv4": "", "accessIPv6": "", "links": [{"rel": "self", "href": "http://100.2.244.113:5080/v2.1/8ae58d5fe1864792bba4dcd79b9b4ccd/servers/be66077f-1092-4490-bf2c-a5be6268fd52"}, {"rel": "bookmark", "href": "http://100.2.244.113:5080/8ae58d5fe1864792bba4dcd79b9b4ccd/servers/be66077f-1092-4490-bf2c-a5be6268fd52"}], "OS-DCF:diskConfig": "AUTO", "progress": 0, "locked": false, "description": "", "tags": [], "config_drive": "True", "OS-EXT-AZ:availability_zone": "nova", "OS-EXT-SRV-ATTR:hypervisor_hostname": "node3.cluster.local", "OS-EXT-SRV-ATTR:host": "node3.cluster.local", "OS-EXT-SRV-ATTR:instance_name": "instance-00000038", "OS-EXT-SRV-ATTR:reservation_id": "r-wfbn2dyz", "OS-EXT-SRV-ATTR:launch_index": 0, "OS-EXT-SRV-ATTR:hostname": "scyrbd007", "OS-EXT-SRV-ATTR:kernel_id": "", "OS-EXT-SRV-ATTR:ramdisk_id": "", "OS-EXT-SRV-ATTR:root_device_name": "/dev/vda", "OS-EXT-SRV-ATTR:user_data": null, "host_status": "UP", "OS-EXT-STS:task_state": null, "OS-EXT-STS:vm_state": "active", "OS-EXT-STS:power_state": 1, "os-extended-volumes:volumes_attached": [{"id": "9f982463-b85d-4f48-9a1c-6edb9b1c7a74", "delete_on_termination": false}], "key_name": null, "security_groups": [{"name": "default"}], "OS-SRV-USG:launched_at": "2022-07-05T06:50:08.000000", "OS-SRV-USG:terminated_at": null}} _http_log_response /var/lib/kolla/venv/lib/python3.6/site-packages/keystoneauth1/session.py:580
2022-07-05 17:41:48.446 28 DEBUG novaclient.v2.client [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-b21745a8-69d0-419a-8045-c0f4bc90d15b nova - - - -] GET call to nova for http://100.2.244.113:5080/nova/v2.1/8ae58d5fe1864792bba4dcd79b9b4ccd/servers/be66077f-1092-4490-bf2c-a5be6268fd52 used request id req-9e1f1d7d-dbbc-4979-aca3-1c702cd12cd2 request /var/lib/kolla/venv/lib/python3.6/site-packages/keystoneauth1/session.py:943
2022-07-05 17:41:48.447 28 DEBUG masakari.compute.nova [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-9411b91e-ed81-4a81-8ca5-a48716c436f3 nova - - - -] Creating a Nova client using "nova" user novaclient /var/lib/kolla/venv/lib/python3.6/site-packages/masakari/compute/nova.py:109
2022-07-05 17:41:48.449 28 INFO masakari.compute.nova [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-9411b91e-ed81-4a81-8ca5-a48716c436f3 nova - - - -] Call stop server command for instance be66077f-1092-4490-bf2c-a5be6268fd52
2022-07-05 17:41:48.812 28 DEBUG novaclient.v2.client [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-9411b91e-ed81-4a81-8ca5-a48716c436f3 nova - - - -] REQ: curl -g -i -X POST http://100.2.244.113:5080/nova/v2.1/8ae58d5fe1864792bba4dcd79b9b4ccd/servers/be66077f-1092-4490-bf2c-a5be6268fd52/action -H "Accept: application/json" -H "Content-Type: application/json" -H "OpenStack-API-Version: compute 2.56" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA256}d18bb647ea28f4167bb4b4fe97cf41980cc9704dd8857a6bf09314a5c98c3788" -H "X-OpenStack-Nova-API-Version: 2.56" -H "X-OpenStack-Request-ID: req-ea99511d-29a5-43d7-8b99-a03f8fd46d87" -d '{"os-stop": null}' _http_log_request /var/lib/kolla/venv/lib/python3.6/site-packages/keystoneauth1/session.py:517
2022-07-05 17:41:48.924 28 DEBUG novaclient.v2.client [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-9411b91e-ed81-4a81-8ca5-a48716c436f3 nova - - - -] RESP: [409] Connection: keep-alive Content-Length: 143 Content-Type: application/json; charset=UTF-8 Date: Tue, 05 Jul 2022 09:41:48 GMT Openstack-Api-Version: compute 2.56 Vary: OpenStack-API-Version, X-OpenStack-Nova-API-Version X-Compute-Request-Id: req-3b6d1ff0-dff8-4d0f-b2b7-6368387b9b27 X-Openstack-Nova-Api-Version: 2.56 X-Openstack-Request-Id: req-3b6d1ff0-dff8-4d0f-b2b7-6368387b9b27 _http_log_response /var/lib/kolla/venv/lib/python3.6/site-packages/keystoneauth1/session.py:548
2022-07-05 17:41:48.925 28 DEBUG novaclient.v2.client [req-ea99511d-29a5-43d7-8b99-a03f8fd46d87 req-9411b91e-ed81-4a81-8ca5-a48716c436f3 nova - - - -] RESP BODY: {"conflictingRequest": {"code": 409, "message": "Cannot 'stop' instance be66077f-1092-4490-bf2c-a5be6268fd52 while it is in vm_state stopped"}} _http_log_response /var/lib/kolla/venv/lib/python3.6/site-packages/keystoneauth1/session.py:580

nova-compute.log
2022-07-05 17:41:48.344 444801 INFO nova.compute.manager [- req-None - - - - -] [instance: be66077f-1092-4490-bf2c-a5be6268fd52] VM Stopped (Lifecycle Event)
2022-07-05 17:41:48.382 444801 INFO nova.compute.manager [None req-9aff95d8-54f8-4a20-9db7-1a529805435c - - - - -] Lifecycle Event is 1
2022-07-05 17:41:48.422 444801 INFO nova.compute.manager [None req-9aff95d8-54f8-4a20-9db7-1a529805435c - - - - -] [instance: be66077f-1092-4490-bf2c-a5be6268fd52] During _sync_instance_power_state the DB power_state (1) does not match the vm_power_state from the hypervisor (4). Updating power_state in the DB to match the hypervisor.
2022-07-05 17:41:48.476 444801 WARNING nova.compute.manager [None req-9aff95d8-54f8-4a20-9db7-1a529805435c - - - - -] [instance: be66077f-1092-4490-bf2c-a5be6268fd52] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4
2022-07-05 17:41:48.562 444801 INFO nova.compute.manager [None req-9aff95d8-54f8-4a20-9db7-1a529805435c - - - - -] [instance: be66077f-1092-4490-bf2c-a5be6268fd52] Instance is already powered off in the hypervisor when stop is called.
2022-07-05 17:41:48.642 444801 INFO nova.virt.driver [None req-9aff95d8-54f8-4a20-9db7-1a529805435c - - - - -] [instance: be66077f-1092-4490-bf2c-a5be6268fd52] Instance already shutdown.
2022-07-05 17:41:48.647 444801 INFO nova.virt.driver [- req-None - - - - -] [instance: be66077f-1092-4490-bf2c-a5be6268fd52] Instance destroyed successfully.

suzhengwei (sue.sam)
Changed in masakari:
assignee: nobody → suzhengwei (sue.sam)
Changed in masakari:
status: New → In Progress
summary: - stop instance random failed
+ Instance stopping fails randomly due to already stopped instances
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari (master)

Reviewed: https://review.opendev.org/c/openstack/masakari/+/850866
Committed: https://opendev.org/openstack/masakari/commit/7241101163550b3cd981262468dc523ccb25cc9e
Submitter: "Zuul (22348)"
Branch: master

commit 7241101163550b3cd981262468dc523ccb25cc9e
Author: suzhengwei <email address hidden>
Date: Mon Jul 25 17:28:31 2022 +0800

    Fix stopping stopped instances

    Nova-compute would periodically sync the instance status. So in the
    instance failure recovery workflow,it would stop instance random failed
    because of conflict, which will terminate the recovery workflow.
    This patch can catch the Confict exception and continue the recovery
    workflow if the instance already stopped.

    Closes-Bug: 1980736
    Change-Id: I59a1f9d7078614c1ddc8f4c362e967a15b8ec5e8

Changed in masakari:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/masakari/+/856585

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/masakari/+/856586

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/masakari/+/856587

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/masakari/+/856587
Committed: https://opendev.org/openstack/masakari/commit/1680a7320a36faa463d3e6b098b26f478a324996
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 1680a7320a36faa463d3e6b098b26f478a324996
Author: suzhengwei <email address hidden>
Date: Mon Jul 25 17:28:31 2022 +0800

    Fix stopping stopped instances

    Nova-compute would periodically sync the instance status. So in the
    instance failure recovery workflow,it would stop instance random failed
    because of conflict, which will terminate the recovery workflow.
    This patch can catch the Confict exception and continue the recovery
    workflow if the instance already stopped.

    Closes-Bug: 1980736
    Change-Id: I59a1f9d7078614c1ddc8f4c362e967a15b8ec5e8
    (cherry picked from commit 7241101163550b3cd981262468dc523ccb25cc9e)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/masakari/+/856586
Committed: https://opendev.org/openstack/masakari/commit/ebfbd6de4357b89d836e9fdb1bb3b3565dce023b
Submitter: "Zuul (22348)"
Branch: stable/xena

commit ebfbd6de4357b89d836e9fdb1bb3b3565dce023b
Author: suzhengwei <email address hidden>
Date: Mon Jul 25 17:28:31 2022 +0800

    Fix stopping stopped instances

    Nova-compute would periodically sync the instance status. So in the
    instance failure recovery workflow,it would stop instance random failed
    because of conflict, which will terminate the recovery workflow.
    This patch can catch the Confict exception and continue the recovery
    workflow if the instance already stopped.

    Closes-Bug: 1980736
    Change-Id: I59a1f9d7078614c1ddc8f4c362e967a15b8ec5e8
    (cherry picked from commit 7241101163550b3cd981262468dc523ccb25cc9e)

tags: added: in-stable-xena
tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/masakari/+/856585
Committed: https://opendev.org/openstack/masakari/commit/f2c4aada79b455836ac733c41daea474dc5979e3
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit f2c4aada79b455836ac733c41daea474dc5979e3
Author: suzhengwei <email address hidden>
Date: Mon Jul 25 17:28:31 2022 +0800

    Fix stopping stopped instances

    Nova-compute would periodically sync the instance status. So in the
    instance failure recovery workflow,it would stop instance random failed
    because of conflict, which will terminate the recovery workflow.
    This patch can catch the Confict exception and continue the recovery
    workflow if the instance already stopped.

    Closes-Bug: 1980736
    Change-Id: I59a1f9d7078614c1ddc8f4c362e967a15b8ec5e8
    (cherry picked from commit 7241101163550b3cd981262468dc523ccb25cc9e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari 14.0.0.0rc1

This issue was fixed in the openstack/masakari 14.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari 11.0.2

This issue was fixed in the openstack/masakari 11.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari 12.0.1

This issue was fixed in the openstack/masakari 12.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari 13.0.1

This issue was fixed in the openstack/masakari 13.0.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.