evacuation failure causes POST_FAILURE in nova-ovs-hybrid-plug job

Bug #1971563 reported by Balazs Gibizer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned
neutron
Confirmed
Undecided
Unassigned

Bug Description

It is similar on the surface to https://bugs.launchpad.net/nova/+bug/1970642 but in that cinder returned http 500 but here neutron vif plug timed out.

This is the instance that failed: https://zuul.opendev.org/t/openstack/build/518f8641b9a7419391b0f99f795f26bd/log/job-output.txt#57033
And this is the vif timeout: https://zuul.opendev.org/t/openstack/build/518f8641b9a7419391b0f99f795f26bd/log/controller/logs/screen-n-cpu.txt#15986

May 03 16:53:41.879039 ubuntu-focal-ovh-bhs1-0029531414 nova-compute[96742]: WARNING nova.compute.manager [None req-6fcb7ce5-c9d7-4f0c-8547-c71bd0f09496 demo admin] [instance: 3a81145d-d263-4e1d-8ec3-faf38fed34f2] Timeout waiting for ['network-vif-plugged-b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f'] for instance with vm_state error and task_state rebuild_spawning. Event states are: network-vif-plugged-b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f: timed out after 300.00 seconds: eventlet.timeout.Timeout: 300 seconds

I see an ACTIVE -> ACTIVE port status transition in the neutron log
https://zuul.opendev.org/t/openstack/build/518f8641b9a7419391b0f99f795f26bd/log/controller/logs/screen-q-svc.txt#17035 that seems to fit the timing when the plug happened based on
https://zuul.opendev.org/t/openstack/build/518f8641b9a7419391b0f99f795f26bd/log/controller/logs/screen-n-cpu.txt#15700

So I think neutron failed to sent the vif plug.

Tags: gate-failure
tags: added: gate-failure
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

It is not super frequent I see two hits in the last two weeks: https://paste.opendev.org/show/bst9YNOiZEbIyDBGrc5f/

Revision history for this message
sean mooney (sean-k-mooney) wrote :
Download full text (3.9 KiB)

looking at this brifely it appears that the neutron server has rescived the provisioning complete notification form the dhcp agent and l2 agent

May 03 16:47:04.296348 ubuntu-focal-ovh-bhs1-0029531414 neutron-server[89517]: DEBUG neutron.db.provisioning_blocks [None req-89d4fdba-e0f4-4778-b944-ed87f5102ff1 None None] Provisioning complete for port b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f triggered by entity DHCP.

then since the port is not bound we see

May 03 16:47:04.325819 ubuntu-focal-ovh-bhs1-0029531414 neutron-server[89517]: DEBUG neutron.plugins.ml2.plugin [None req-89d4fdba-e0f4-4778-b944-ed87f5102ff1 None None] Port b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f cannot update to ACTIVE because it is not bound. {{(pid=89517) _port_provisioned /opt/stack/neutron/neutron/plugins/ml2/plugin.py:352}}

later we see the l2 agent wire up the port

May 03 16:47:22.223506 ubuntu-focal-ovh-bhs1-0029531414 neutron-server[89517]: DEBUG neutron.plugins.ml2.rpc [None req-d5636366-a9cc-4a96-a55f-72faf18b8834 None None] Device b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f up at agent ovs-agent-ubuntu-focal-ovh-bhs1-0029531417 {{(pid=89517) update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:296}}

May 03 16:47:22.277185 ubuntu-focal-ovh-bhs1-0029531414 neutron-server[89517]: DEBUG neutron.db.provisioning_blocks [None req-d5636366-a9cc-4a96-a55f-72faf18b8834 None None] Provisioning for port b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f completed by entity L2. {{(pid=89517) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:137}}

and the neutron server moves the port to active

May 03 16:47:22.346079 ubuntu-focal-ovh-bhs1-0029531414 neutron-server[89517]: DEBUG neutron.plugins.ml2.plugin [None req-d5636366-a9cc-4a96-a55f-72faf18b8834 None None] Current status of the port b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f is: DOWN; New status is: ACTIVE {{(pid=89517) _update_individual_port_db_status /opt/stack/neutron/neutron/plugins/ml2/plugin.py:2297}}

neutron sends the event at

May 03 16:47:22.410536 ubuntu-focal-ovh-bhs1-0029531414 neutron-server[89517]: INFO neutron.notifiers.nova [-] Nova event matching ['req-17e4fee0-ad06-4350-af09-1db0d331d6b5'] response: {'server_uuid': '3a81145d-d263-4e1d-8ec3-faf38fed34f2', 'name': 'network-vif-plugged', 'status': 'completed', 'tag': 'b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f', 'code': 200}

we then have additional events where the l2 agent marks the port as provisioned

May 03 16:47:24.195188 ubuntu-focal-ovh-bhs1-0029531414 neutron-server[89517]: DEBUG neutron.plugins.ml2.rpc [None req-d5636366-a9cc-4a96-a55f-72faf18b8834 None None] Device b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f up at agent ovs-agent-ubuntu-focal-ovh-bhs1-0029531417 {{(pid=89517) update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:296}}

May 03 16:47:24.228930 ubuntu-focal-ovh-bhs1-0029531414 neutron-server[89517]: DEBUG neutron.db.provisioning_blocks [None req-d5636366-a9cc-4a96-a55f-72faf18b8834 None None] Provisioning complete for port b6dc2b79-ed38-4907-86e2-bdff1c5a9b9f triggered by entity L2. {{(pid=89517) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:145}}

May 03 16:47:24.229155 ubuntu-focal-ovh-bhs1-002...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/840446

Changed in nova:
status: New → In Progress
Revision history for this message
Elvira García Ruiz (elviragr) wrote :

I saw this other bug that was filed long ago about a similar problem, adding in case it can be of any use: https://bugs.launchpad.net/neutron/+bug/1808171

Changed in neutron:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/840446
Committed: https://opendev.org/openstack/nova/commit/a3a593ad55480d240d0dd17b0dcf1185ea1498b8
Submitter: "Zuul (22348)"
Branch: master

commit a3a593ad55480d240d0dd17b0dcf1185ea1498b8
Author: Balazs Gibizer <email address hidden>
Date: Wed May 4 14:20:56 2022 +0200

    Enable live_migration_events in nova-ovs-hybrid-plug

    The flag was added to prevent the source host neutron agents to trigger
    a vif-plugged event too early during a live migration. But actually it
    can be used in a more generic sense as the code filter on migration_to
    binding profile attribute.

    We saw too early vif-plugged events from neutron during evacuation in
    post part of the nova-ovs-hybrid-plug job. So this patch enables the
    workaround flag for this job too.

    Closes-Bug: #1971563
    Change-Id: Ifd20ece3a4f126da16f077247c2f1e072edb7163

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/840931

Revision history for this message
Elvira García Ruiz (elviragr) wrote :

The fix released for this bug might not be necessary anymore. That config option in Neutron is no longer used and it always defaults to True: https://review.opendev.org/c/openstack/neutron/+/840448

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Elvira: You are correct. On master the neutron change makes the nova fix unnecessary. However the nova fix is needed on stable branches as the neutron change will not be backported. Anyhow I will propose the revert on master after the stable backports are landed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 26.0.0.0rc1

This issue was fixed in the openstack/nova 26.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/yoga)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/840931
Reason: stable/yoga branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/yoga if you want to further work on this patch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.