OVS tunneling between multiple neutron nodes misconfigured if amqp is restarted

Bug #1385234 reported by Giulio Fidente
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Medium
Unassigned
oslo.messaging
Fix Released
Undecided
Unassigned
tripleo
Fix Released
High
Giulio Fidente

Bug Description

At completion of a deployment with multiple controllers, by observing the gre tunnels created in OVS by the neutron ovs-agent, one will find that some neutron nodes may miss the tunnels in between them or to the computes.

This is due to ovs-agents getting disconnected from the rabbit cluster without them noticing and as a result, being unable to receive updates from other nodes or publish updates.

The disconnection may happen following a reconfig of a rabbit node, the VIP moving over a different node when rabbit is load balanced, or even _during_ tripleo overcloud deployment due to rabbit cluster configuration changes.

This was observed using Kombu 3.0.33 as well as 2.5.

Use of some aggressive (low) kernel keepalive probes interval seems to improve the reliability but a more appropriate fix seems to be support for heartbeat in oslo.messaging

Tags: ovs
summary: - OVS tunneling between multiple neutron nodes breaks if amqp is restarted
+ OVS tunneling between multiple neutron nodes misconfigured if amqp is
+ restarted
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Giulio Fidente (gfidente) wrote :
Changed in oslo.messaging:
status: New → Incomplete
status: Incomplete → New
Changed in neutron:
importance: Undecided → High
Ben Nemec (bnemec)
Changed in tripleo:
status: New → Triaged
Romil Gupta (romilg)
Changed in neutron:
assignee: nobody → Romil Gupta (romilg)
tags: added: ovs
Changed in neutron:
importance: High → Medium
status: New → Confirmed
Revision history for this message
Mehdi Abaakouk (sileht) wrote :

Hi,

Can you test with oslo.messaging 1.5.0 if the issue still exists, timeout and reconnection handling have got many fixes, and I think this one should not occurs anymore.

Cheers

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
Giulio Fidente (gfidente) wrote :

unfortunately this seems to persist, looks like there are patches still on review for a fix, see https://review.openstack.org/#/c/126330/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/142524

Changed in tripleo:
assignee: nobody → Giulio Fidente (gfidente)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/142527

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-incubator (master)

Reviewed: https://review.openstack.org/142524
Committed: https://git.openstack.org/cgit/openstack/tripleo-incubator/commit/?id=bdfde7186ba77c66bfa4b492fe7847d54d008a29
Submitter: Jenkins
Branch: master

commit bdfde7186ba77c66bfa4b492fe7847d54d008a29
Author: Giulio Fidente <email address hidden>
Date: Wed Dec 17 18:55:54 2014 +0100

    Add sysctl to all images

    This is needed to customize the default kernel keepalive timings, see
    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19

    Change-Id: I9a8ace4712c8a3b71f63b0bafe381e3bc6c707da
    Partial-Bug: 1301431
    Partial-Bug: 1385240
    Partial-Bug: 1385234

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/142527
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2f7f4ed50c53e25041bf29d317c5f3358e46e706
Submitter: Jenkins
Branch: master

commit 2f7f4ed50c53e25041bf29d317c5f3358e46e706
Author: Giulio Fidente <email address hidden>
Date: Wed Dec 17 19:06:28 2014 +0100

    Set more aggressive keepalive timings

    We want to customize the default kernel keepalive timings and
    make them more aggressive to workaround lack of hearbeat support
    in the Oslo RabbitMQ client, see:

    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19
    and
    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/70

    Change-Id: Ieac08f595086acb8dd336e33efc705ee0b8a3a87
    Closes-Bug: 1301431
    Closes-Bug: 1385240
    Closes-Bug: 1385234

Changed in tripleo:
status: In Progress → Fix Committed
Derek Higgins (derekh)
Changed in tripleo:
status: Fix Committed → Fix Released
Revision history for this message
QingchuanHao (haoqingchuan-28) wrote :

oslo.messaging fix the bug in 1.8.1
https://bugs.launchpad.net/nova/+bug/856764

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 172 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
assignee: Romil Gupta (romilg) → nobody
status: Confirmed → Incomplete
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Ben Nemec (bnemec)
Changed in oslo.messaging:
status: Incomplete → Fix Released
Changed in neutron:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.