ml2/ovn refuses to bind port due to dead agent randomly in the nova-live-migrate ci job

Bug #2020215 reported by sean mooney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
High
sean mooney

Bug Description

we have seen random failures of

test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume]

in the nova-live-migaration job with the following error

Details: {'code': 400, 'message': 'Migration pre-check error: Binding failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check neutron logs for more information.'}

looking at the neuton log we see

May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb-ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Refusing to bind port e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent: <neutron.plugins.ml2.drivers.ovn.agent.neutron_agent.ControllerAgent object at 0x7f6a7a6d2950>

May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to bind port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853 for vnic_type normal using segments [{'id': '1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve', 'physical_network': None, 'segmentation_id': 525, 'network_id': '745f0724-2779-4d60-845c-8f673d567d0d'}]

and the following in the neutorn-ovn-metadata-agent on the host where the VM is migrating too.

May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]: DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis table for 10 seconds {{(pid=38857) run /opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}}

This looks like it might be related to

https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e

This modified the code to add some randomness due to https://bugs.launchpad.net/neutron/+bug/1991817

but that seams to negitivly impact the stability of the agent.

to fix this i will propose a patch to change the interval form

interval = randint(0, cfg.CONF.agent_down_time // 2)

to

interval = randint(0, cfg.CONF.agent_down_time // 3)

to increase the likelihood that we send the heartbeat in time.

when we are making calls to privsep and ovs the logs stop for multiple second while those operations are happening and if that happens the the wrong time i belive this leads to use missing the heartbeat interval.

Changed in neutron:
assignee: nobody → sean mooney (sean-k-mooney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/883687

Changed in neutron:
status: New → Confirmed
tags: added: ovn
Changed in neutron:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/883687
Committed: https://opendev.org/openstack/neutron/commit/5e0c102830a18850e35f746160867613e96d1dbc
Submitter: "Zuul (22348)"
Branch: master

commit 5e0c102830a18850e35f746160867613e96d1dbc
Author: Sean Mooney <email address hidden>
Date: Wed May 31 13:23:32 2023 +0100

    Send ovn heatbeat more often.

    This change modifies the metadata agent heatbeat
    to use a random offset with a max delay of 10 seconds.

    The orgial reason for the current logic was to mitigate
    https://bugs.launchpad.net/neutron/+bug/1991817
    so the logic to spread the heatbeats is maintained but
    we now set an upper bound on the delay.

    Close-Bug: #2020215
    Change-Id: I4d382793255520b9c44ca2aaacebcbda9a432dde

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/892592

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/892592
Committed: https://opendev.org/openstack/neutron/commit/98c4ae595b7962e71fa1007d60b782da37528fee
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 98c4ae595b7962e71fa1007d60b782da37528fee
Author: Sean Mooney <email address hidden>
Date: Wed May 31 13:23:32 2023 +0100

    Send ovn heatbeat more often.

    This change modifies the metadata agent heatbeat
    to use a random offset with a max delay of 10 seconds.

    The orgial reason for the current logic was to mitigate
    https://bugs.launchpad.net/neutron/+bug/1991817
    so the logic to spread the heatbeats is maintained but
    we now set an upper bound on the delay.

    Close-Bug: #2020215
    Change-Id: I4d382793255520b9c44ca2aaacebcbda9a432dde
    (cherry picked from commit 5e0c102830a18850e35f746160867613e96d1dbc)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/892746

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/898940

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/898940
Committed: https://opendev.org/openstack/neutron/commit/d1e34461334b6711523dcbd7b3365e32bf2af6dc
Submitter: "Zuul (22348)"
Branch: stable/zed

commit d1e34461334b6711523dcbd7b3365e32bf2af6dc
Author: Sean Mooney <email address hidden>
Date: Wed May 31 13:23:32 2023 +0100

    Send ovn heatbeat more often.

    This change modifies the metadata agent heatbeat
    to use a random offset with a max delay of 10 seconds.

    The orgial reason for the current logic was to mitigate
    https://bugs.launchpad.net/neutron/+bug/1991817
    so the logic to spread the heatbeats is maintained but
    we now set an upper bound on the delay.

    Close-Bug: #2020215
    Change-Id: I4d382793255520b9c44ca2aaacebcbda9a432dde
    (cherry picked from commit 5e0c102830a18850e35f746160867613e96d1dbc)

tags: added: in-stable-zed
tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/892746
Committed: https://opendev.org/openstack/neutron/commit/7746912334092b3c71290b7a063bf286545252e1
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 7746912334092b3c71290b7a063bf286545252e1
Author: Sean Mooney <email address hidden>
Date: Wed May 31 13:23:32 2023 +0100

    Send ovn heatbeat more often.

    This change modifies the metadata agent heatbeat
    to use a random offset with a max delay of 10 seconds.

    The orgial reason for the current logic was to mitigate
    https://bugs.launchpad.net/neutron/+bug/1991817
    so the logic to spread the heatbeats is maintained but
    we now set an upper bound on the delay.

    Close-Bug: #2020215
    Change-Id: I4d382793255520b9c44ca2aaacebcbda9a432dde
    (cherry picked from commit 5e0c102830a18850e35f746160867613e96d1dbc)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.