gate-grenade-dsvm-neutron-dvr-multinode-ubuntu-xenial fails constantly

Bug #1713927 reported by Jakub Libosvar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Brian Haley
Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
Changed in neutron:
assignee: venkata anil (anil-venkata) → nobody
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

I am trying to reproduce the problem by restarting the agent in a two node setup. But unfortunately I am not able to reproduce the problem.

Changed in neutron:
status: New → Confirmed
Revision history for this message
Kevin Benton (kevinbenton) wrote :

My analysis of the failure from these logs: http://logs.openstack.org/32/498932/1/check/gate-grenade-dsvm-neutron-dvr-multinode-ubuntu-xenial/6ba16a7/logs/grenade.sh.txt.gz

floating IP is 172.24.5.5 which is associated with port fc637413 which is bound to ubuntu-xenial-2-node-rax-ord-10690035. This *is not* the subnode-2 host.

However, you can see that subnode-2 sets up the floating IP on the interface and arpings for it as though the floating IP were on that host: http://logs.openstack.org/32/498932/1/check/gate-grenade-dsvm-neutron-dvr-multinode-ubuntu-xenial/6ba16a7/logs/subnode-2/screen-q-l3.txt.gz#_Aug_29_21_25_58_221845

Note that this occurs after the Neutron server is upgraded while subnode-2 l3 agent remains running in the Pike version. The server is offline long enough for the Pike subnode-2 agent to do a full sync on revival, at which point it attempts the floating IP setup for the address that doesn't belong to it.

It definitely seems like the issue is that both subnode-2 and the main node l3 agents are trying to setup the floating IP. Note that subnode-2 is in dvr_snat mode.

So @Swami, it looks like we are missing some logic on the centralized router to not try to host these floating IPs.

To reproduce it, try running a dvr mode node and a dvr_snat mode node. And do the following without even worrying about upgrade tests:

1. Boot a VM so on the dvr node and associate a floating IP to it and ensure you can ping it.
2. Stop the neutron-server process long enough that the report_state calls fail on the dvr_snat node.
3. Start the neutron-server process again and watch the sync process on the dvr_snat node.
4. The dvr_snat node should try to ARPing for the floating IP that belongs on the dvr node.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/499585

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/500077

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/500077
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=188f5c29a918e40379671d7d5dda729d3952ab64
Submitter: Jenkins
Branch: stable/pike

commit 188f5c29a918e40379671d7d5dda729d3952ab64
Author: Jakub Libosvar <email address hidden>
Date: Thu Aug 31 12:54:47 2017 +0200

    dvr: Don't raise KeyError in _get_floatingips_bound_to_host

    We thought _get_floatingips_bound_to_host is not needed but removing the
    method caused sending garps for fip that doesn't belong to node during
    the full-sync.

    This patch just replaces dict lookup with get() method, so fips are
    filtered based on presence on the host and if host is not set on fip, it
    won't raise a KeyError.

    Note: This patch hasn't been merged in master yet because of KeyError happening in grenade job now.

    Co-Authored-By: Swaminathan Vasudevan <email address hidden>

    Related-bug: #1712412
    Related-bug: #1713927

    Change-Id: I0fbc772d757fb13b788f9df8d6d7d28d288d054a

tags: added: in-stable-pike
Revision history for this message
Jakub Libosvar (libosvar) wrote :

It looks like the KeyError was actually caused by the issue and ignoring it brings back what happened before we reverted Swami's patch.

Looking at the logs:

This is from the node that doesn't have the port bound:
Sep 03 10:26:08.092721 ubuntu-xenial-2-node-inap-mtl01-10753438 neutron-l3-agent[5210]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'fip-634d8488-5d69-454c-b8bc-5c7f61f0779d', 'arping', '-U', '-I', 'fg-7addba21-cf', '-c', '1', '-w', '1.5', '172.24.5.15'] {{(pid=5210) execute_rootwrap_daemon /opt/stack/old/neutron/neutron/agent/linux/utils.py:108}}
Sep 03 10:26:08.164393 ubuntu-xenial-2-node-inap-mtl01-10753438 neutron-l3-agent[5210]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'fip-634d8488-5d69-454c-b8bc-5c7f61f0779d', 'arping', '-A', '-I', 'fg-7addba21-cf', '-c', '1', '-w', '1.5', '172.24.5.15'] {{(pid=5210) execute_rootwrap_daemon

This is the correct node:
Sep 03 10:31:51.838473 ubuntu-xenial-2-node-inap-mtl01-10753438-867050 neutron-l3-agent[11997]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'fip-634d8488-5d69-454c-b8bc-5c7f61f0779d', 'arping', '-U', '-I', 'fg-dc66f98d-28', '-c', '1', '-w', '1.5', '172.24.5.15'] {{(pid=11997) execute_rootwrap_daemon /opt/stack/old/neutron/neutron/agent/linux/utils.py:108}}
Sep 03 10:31:51.898297 ubuntu-xenial-2-node-inap-mtl01-10753438-867050 neutron-l3-agent[11997]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'fip-634d8488-5d69-454c-b8bc-5c7f61f0779d', 'arping', '-A', '-I', 'fg-dc66f98d-28', '-c', '1', '-w', '1.5', '172.24.5.15'] {{(pid=11997) execute_rootwrap_daemon /opt/stack/old/neutron/neutron/agent/linux/utils.py:108}}

The correct node sends out garps as a last one - but we still fail to ping the address ~2 minutes later:
10:34:01.918 | PING 172.24.5.15 (172.24.5.15) 56(84) bytes of data.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/499585
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=32700292615b528a9095b99f795a290033d982d5
Submitter: Jenkins
Branch: master

commit 32700292615b528a9095b99f795a290033d982d5
Author: Jakub Libosvar <email address hidden>
Date: Thu Aug 31 12:54:47 2017 +0200

    dvr: Don't raise KeyError in _get_floatingips_bound_to_host

    We thought _get_floatingips_bound_to_host is not needed but removing the
    method caused sending garps for fip that doesn't belong to node during
    the full-sync.

    This patch just replaces dict lookup with get() method, so fips are
    filtered based on presence on the host and if host is not set on fip, it
    won't raise a KeyError.

    Co-Authored-By: Swaminathan Vasudevan <email address hidden>

    Related-bug: #1712412
    Related-bug: #1713927

    Change-Id: I0fbc772d757fb13b788f9df8d6d7d28d288d054a

Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/499725
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=afd1995d91353874d04ac2d12be0162cea27c1d1
Submitter: Jenkins
Branch: master

commit afd1995d91353874d04ac2d12be0162cea27c1d1
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 31 08:50:50 2017 -0700

    DVR: Fix agent to process only floatingips that have a host match

    The agent is not currently checking for the host bound
    before configuring the floatingip. That leads to
    floatingips being configured on multiple hosts.

    This is a partial fix on the agent side to prevent
    configuring a floatingip ip that is not bound to
    this host.

    Related-Bug: #1712412
    Related-Bug: #1713927

    Change-Id: I1bc8c42425f97234f56412a2f109a996d9f896de

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/502493

Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/502493
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2e6654067af4374443a502de8a42da361c76c6eb
Submitter: Jenkins
Branch: stable/pike

commit 2e6654067af4374443a502de8a42da361c76c6eb
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 31 08:50:50 2017 -0700

    DVR: Fix agent to process only floatingips that have a host match

    The agent is not currently checking for the host bound
    before configuring the floatingip. That leads to
    floatingips being configured on multiple hosts.

    This is a partial fix on the agent side to prevent
    configuring a floatingip ip that is not bound to
    this host.

    Related-Bug: #1712412
    Related-Bug: #1713927

    Change-Id: I1bc8c42425f97234f56412a2f109a996d9f896de
    (cherry picked from commit afd1995d91353874d04ac2d12be0162cea27c1d1)

Changed in neutron:
assignee: Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Revision history for this message
Brian Haley (brian-haley) wrote :

Failure rate of this job is now under 20% (actually zero now for some reason), so I'm going to lower the priority. We have at least one more DVR patch to merge for the server-side code to completely fix the original problem.

Changed in neutron:
importance: Critical → High
Changed in neutron:
assignee: Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

@Brian, what's the final fix that we need to land to claim this fixed?

Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Revision history for this message
Brian Haley (brian-haley) wrote :

We still need https://review.openstack.org/#/c/500143/ - don't know why it doesn't show up here as it does have a closes-bug for this bug in the commit message.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/500143
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7bff99ac4a5ef0d4b2cc6f77f5679bb8e01f86d7
Submitter: Jenkins
Branch: master

commit 7bff99ac4a5ef0d4b2cc6f77f5679bb8e01f86d7
Author: Brian Haley <email address hidden>
Date: Fri Sep 1 13:52:51 2017 -0400

    DVR: Always initialize floating IP host

    With a recent change to the neutron server code, the server was
    processing floating IPs that were not bound to the respective
    agent during fullsync operation.

    Change to always initialize floating IP host info so callers
    can determine if info should be sent to an agent or not.

    Also changed the logic that decides when the server should
    send a floating IP to an agent to be easier to understand.

    Closes-bug: #1713927
    Change-Id: Ic916225e0a11c3fb8cd94437ca063e0d3295a569

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/509630

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Swaminathan Vasudevan (<email address hidden>) on branch: master
Review: https://review.openstack.org/499859
Reason: An alternate patch just merged. So I can abandon this patch.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/509630
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6051e3d672a8401272dc345d11d10bc097f116ed
Submitter: Zuul
Branch: stable/pike

commit 6051e3d672a8401272dc345d11d10bc097f116ed
Author: Brian Haley <email address hidden>
Date: Fri Sep 1 13:52:51 2017 -0400

    DVR: Always initialize floating IP host

    With a recent change to the neutron server code, the server was
    processing floating IPs that were not bound to the respective
    agent during fullsync operation.

    Change to always initialize floating IP host info so callers
    can determine if info should be sent to an agent or not.

    Also changed the logic that decides when the server should
    send a floating IP to an agent to be easier to understand.

    Closes-bug: #1713927
    Change-Id: Ic916225e0a11c3fb8cd94437ca063e0d3295a569
    (cherry picked from commit 7bff99ac4a5ef0d4b2cc6f77f5679bb8e01f86d7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.0.0b1

This issue was fixed in the openstack/neutron 12.0.0.0b1 development milestone.

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.2

This issue was fixed in the openstack/neutron 11.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.