OVS tunneling between multiple neutron nodes misconfigured if amqp is restarted

Bug #1385234 reported by Giulio Fidente on 2014-10-24
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Medium
Unassigned
oslo.messaging
Undecided
Unassigned
tripleo
High
Giulio Fidente

Bug Description

At completion of a deployment with multiple controllers, by observing the gre tunnels created in OVS by the neutron ovs-agent, one will find that some neutron nodes may miss the tunnels in between them or to the computes.

This is due to ovs-agents getting disconnected from the rabbit cluster without them noticing and as a result, being unable to receive updates from other nodes or publish updates.

The disconnection may happen following a reconfig of a rabbit node, the VIP moving over a different node when rabbit is load balanced, or even _during_ tripleo overcloud deployment due to rabbit cluster configuration changes.

This was observed using Kombu 3.0.33 as well as 2.5.

Use of some aggressive (low) kernel keepalive probes interval seems to improve the reliability but a more appropriate fix seems to be support for heartbeat in oslo.messaging

Tags: ovs Edit Tag help
summary: - OVS tunneling between multiple neutron nodes breaks if amqp is restarted
+ OVS tunneling between multiple neutron nodes misconfigured if amqp is
+ restarted
description: updated
description: updated
description: updated
description: updated
Giulio Fidente (gfidente) wrote :
Changed in oslo.messaging:
status: New → Incomplete
status: Incomplete → New
Changed in neutron:
importance: Undecided → High
Ben Nemec (bnemec) on 2014-10-30
Changed in tripleo:
status: New → Triaged
Romil Gupta (romilg) on 2014-11-16
Changed in neutron:
assignee: nobody → Romil Gupta (romilg)
tags: added: ovs
Changed in neutron:
importance: High → Medium
status: New → Confirmed
Mehdi Abaakouk (sileht) wrote :

Hi,

Can you test with oslo.messaging 1.5.0 if the issue still exists, timeout and reconnection handling have got many fixes, and I think this one should not occurs anymore.

Cheers

Mehdi Abaakouk (sileht) on 2014-12-03
Changed in oslo.messaging:
status: New → Incomplete
Giulio Fidente (gfidente) wrote :

unfortunately this seems to persist, looks like there are patches still on review for a fix, see https://review.openstack.org/#/c/126330/

Fix proposed to branch: master
Review: https://review.openstack.org/142524

Changed in tripleo:
assignee: nobody → Giulio Fidente (gfidente)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/142524
Committed: https://git.openstack.org/cgit/openstack/tripleo-incubator/commit/?id=bdfde7186ba77c66bfa4b492fe7847d54d008a29
Submitter: Jenkins
Branch: master

commit bdfde7186ba77c66bfa4b492fe7847d54d008a29
Author: Giulio Fidente <email address hidden>
Date: Wed Dec 17 18:55:54 2014 +0100

    Add sysctl to all images

    This is needed to customize the default kernel keepalive timings, see
    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19

    Change-Id: I9a8ace4712c8a3b71f63b0bafe381e3bc6c707da
    Partial-Bug: 1301431
    Partial-Bug: 1385240
    Partial-Bug: 1385234

Reviewed: https://review.openstack.org/142527
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2f7f4ed50c53e25041bf29d317c5f3358e46e706
Submitter: Jenkins
Branch: master

commit 2f7f4ed50c53e25041bf29d317c5f3358e46e706
Author: Giulio Fidente <email address hidden>
Date: Wed Dec 17 19:06:28 2014 +0100

    Set more aggressive keepalive timings

    We want to customize the default kernel keepalive timings and
    make them more aggressive to workaround lack of hearbeat support
    in the Oslo RabbitMQ client, see:

    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19
    and
    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/70

    Change-Id: Ieac08f595086acb8dd336e33efc705ee0b8a3a87
    Closes-Bug: 1301431
    Closes-Bug: 1385240
    Closes-Bug: 1385234

Changed in tripleo:
status: In Progress → Fix Committed
Derek Higgins (derekh) on 2014-12-24
Changed in tripleo:
status: Fix Committed → Fix Released
QingchuanHao (haoqingchuan-28) wrote :

oslo.messaging fix the bug in 1.8.1
https://bugs.launchpad.net/nova/+bug/856764

This bug is > 172 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
assignee: Romil Gupta (romilg) → nobody
status: Confirmed → Incomplete

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Ben Nemec (bnemec) on 2018-12-04
Changed in oslo.messaging:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers