ovs tunnel state not syncing (failure pinging overcloud instance)

Bug #1292105 reported by Ben Nemec on 2014-03-13
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Critical
Unassigned
Icehouse
Undecided
Unassigned
tripleo
Critical
Robert Collins

Bug Description

I saw this in a recent CI overcloud run: http://logs.openstack.org/66/74866/8/check-tripleo/check-tripleo-overcloud-precise/aa490f1/console.html

2014-03-12 20:01:46.509 | Timing out after 300 seconds:
2014-03-12 20:01:46.509 | COMMAND=ping -c 1 192.0.2.46
2014-03-12 20:01:46.509 | OUTPUT=PING 192.0.2.46 (192.0.2.46) 56(84) bytes of data.
2014-03-12 20:01:46.509 | From 192.0.2.46 icmp_seq=1 Destination Host Unreachable

It appears as though everything ran fine up until it tried to ping the booted overcloud instance. I'm fairly certain it has nothing to do with my change, so I wanted to open a bug to track it in case anyone else runs into a similar problem.

Dan Prince (dan-prince) wrote :

This appears to be happening to all Fedora 20 jobs as of today.

Dan Prince (dan-prince) wrote :

Debugging things here:

https://review.openstack.org/#/c/100374/

Also, ran experimental jobs which should tell us whether both the Fedora and Ubuntu overcloud jobs fail.

Changed in tripleo:
importance: Medium → Critical
Changed in tripleo:
assignee: nobody → Robert Collins (lifeless)
status: Triaged → In Progress
Marios Andreou (marios-b) wrote :

+1 yup I came here via https://review.openstack.org/#/c/95396/ Check for relevant environment variables

Ben Nemec (bnemec) wrote :

So I managed to reproduce this in my local devtest environment, and I built a second user image with stackuser so I could log in to it for debugging.

The user instance isn't getting an address from DHCP on default-net, so the ping to the floating ip naturally can't work. I can see DHCP requests on br-int of the compute node, but they don't seem to be getting any further than that (if they are, I haven't found the right system/namespace/port to listen on).

That's about as far as I've gotten debugging the problem. I'm going to try setting up a second devtest system (which hopefully won't hit this problem) so I'll have something working to compare with.

Dan Prince (dan-prince) wrote :

Trying a test revert of a suspicious neutron commit here:

https://review.openstack.org/#/c/101384/

Robert Collins (lifeless) wrote :

the gre overlay network is not coming up properly:
on a novacompute0
$ sudo ovs-vsctl show
fff1865e-f1dd-4556-9b24-cf93a597cd17
    Bridge br-tun
        Port br-tun
            Interface br-tun
                type: internal
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
    Bridge br-int
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
    ovs_version: "2.1.2"

and novacompute1
]# ovs-vsctl show
0ed498b8-983a-4da7-8cd7-e1eff3b00ded
    Bridge br-int
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port "qvod92b0ee8-ef"
            tag: 1
            Interface "qvod92b0ee8-ef"
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
    Bridge br-tun
        Port br-tun
            Interface br-tun
                type: internal
        Port "gre-c0000203"
            Interface "gre-c0000203"
                type: gre
                options: {df_default="true", in_key=flow, local_ip="192.0.2.6", out_key=flow, remote_ip="192.0.2.3"}
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
    ovs_version: "2.1.2"

What it should have, on br-tun is this (from the controller):

$ sudo ovs-vsctl show
6a11686d-c375-4be8-b346-725f8d890540
    Bridge br-ex
        Port "qg-c39ccf5a-0a"
            Interface "qg-c39ccf5a-0a"
                type: internal
        Port br-ex
            Interface br-ex
                type: internal
        Port "eth0"
            Interface "eth0"
    Bridge br-tun
        Port br-tun
            Interface br-tun
                type: internal
        Port "gre-c0000203"
            Interface "gre-c0000203"
                type: gre
                options: {df_default="true", in_key=flow, local_ip="192.0.2.4", out_key=flow, remote_ip="192.0.2.3"}
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port "gre-c0000206"
            Interface "gre-c0000206"
                type: gre
                options: {df_default="true", in_key=flow, local_ip="192.0.2.4", out_key=flow, remote_ip="192.0.2.6"}
    Bridge br-int
        fail_mode: secure
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port "tap357190d8-c9"
            tag: 1
            Interface "tap357190d8-c9"
                type: internal
        Port "qr-b474f6f4-e8"
            tag: 1
            Interface "qr-b474f6f4-e8"
                type: internal
        Port br-int
            Interface br-int
                type: internal
    ovs_version: "2.1.2"

Note the mesh of point to point links

Robert Collins (lifeless) wrote :

We should start capturing:
 - console-log of all OC instances
 -ovs-vsctl show on all seed/UC/OC machines in the state dump ... its there but we need to make that more obvious - its stuck in the tar atm.

Robert Collins (lifeless) wrote :

So the faulty bridge is br-tun, not br-int, but its possibly something more subtle like a dropped rpc packet - since the tunnel agent ports are maintained in a delta fashion in fdb_add

Robert Collins (lifeless) wrote :

http://logs.openstack.org/74/100374/2/experimental-tripleo/check-tripleo-overcloud-precise/0180696/logs/ which has neutron in debug mode also shows the bridge issue, looking in agent logs for clues about the failure to sync.

summary: - CI failed pinging overcloud instance
+ ovs tunnel state not syncing (failure pinging overcloud instance)
Robert Collins (lifeless) wrote :

This run that Ben did - http://logs.openstack.org/74/100374/2/experimental-tripleo/check-tripleo-overcloud-precise/0180696 has debug neutron logs, and you can see that only one tunnel update is received on each agent- the initial one rather than one initial one and then one as each new agent joins.

Robert Collins (lifeless) wrote :

FWIW the agents 'appear' fine.
+--------------------------------------+--------------------+-------------------------------------+-------+----------------+
| id | agent_type | host | alive | admin_state_up |
+--------------------------------------+--------------------+-------------------------------------+-------+----------------+
| 276544ef-f8ac-482a-a019-664954386cfd | Metadata agent | overcloud-controller0-weema5lgmm2g | :-) | True |
| a7c30165-659c-40f2-a6d0-6df19a2202b5 | L3 agent | overcloud-controller0-weema5lgmm2g | :-) | True |
| b719b7ea-df66-4af2-85e4-9aae35da9c3f | Open vSwitch agent | overcloud-novacompute1-65l24kligq75 | :-) | True |
| bd744905-1d5b-48c5-b202-edc2a261d495 | Open vSwitch agent | overcloud-controller0-weema5lgmm2g | :-) | True |
| cd834c39-f822-4cee-84c4-ae68e89efb7d | DHCP agent | overcloud-controller0-weema5lgmm2g | :-) | True |
| d46a0049-e76f-4400-8ac5-6791f72d0198 | Open vSwitch agent | overcloud-novacompute0-uj2tl4avvl7q | :-) | True |
+--------------------------------------+--------------------+-------------------------------------+-------+----------------+

Fix proposed to branch: master
Review: https://review.openstack.org/101395

Changed in neutron:
assignee: nobody → Robert Collins (lifeless)
status: New → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/101447

Changed in neutron:
assignee: Robert Collins (lifeless) → Kevin Benton (kevinbenton)

Reviewed: https://review.openstack.org/101395
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ccaad827b2bdcdcf4b538c5d5367f14cfc22c08f
Submitter: Jenkins
Branch: master

commit ccaad827b2bdcdcf4b538c5d5367f14cfc22c08f
Author: Robert Collins <email address hidden>
Date: Fri Jun 20 14:59:49 2014 +1200

    Revert "ovs-agent: Ensure integration bridge is created"

    This reverts commit e5cdad90f97d3a54a493eca19e7a3ff643426de1.

    TripleO's multi-node testing shows that this patch caused a failure to receive
    tunnel updates, leading to the first node up to have no agent ports, the second
    to have one agent port, the third to have 2 agent ports, etc. Needless to say,
    this doesn't work all that well :)

    Change-Id: Ie90dd4d113a404948dd5debad48065b7db48faa5
    Closes-Bug: #1292105

Changed in neutron:
status: In Progress → Fix Committed
Mark McLoughlin (markmc) on 2014-06-20
Changed in neutron:
importance: Undecided → Critical

Change abandoned by Ben Nemec (<email address hidden>) on branch: master
Review: https://review.openstack.org/100374
Reason: Retiring this again since CI is happy now. We do still need a way to configure log levels properly though (I see there are a couple of bugs open already relating to that so I'm not opening a new one).

Dan Prince (dan-prince) on 2014-06-24
Changed in tripleo:
status: In Progress → Fix Committed
Jay Dobies (jdob) on 2014-06-24
Changed in tripleo:
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/101447
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3445715f98e3aaa967f1661e403e45ad85392909
Submitter: Jenkins
Branch: master

commit 3445715f98e3aaa967f1661e403e45ad85392909
Author: Kevin Benton <email address hidden>
Date: Wed Jun 18 22:34:00 2014 -0700

    OVS agent: Correct bridge setup ordering

    This patch fixes three issues that combined to break tunnel networks.

    The first issue was the the failure mode was being set on the
    integration bridge after the canary flow was installed. This
    immediately wiped out the canary flow.

    The second problem was that the main loop would sync tunnel networks
    right before resetting the tunnel bridge when the canary flow was
    missing. This meant that it would sync all of the tunnels via RPC and
    then wipe them out.

    The final issue was that after resetting the tunnel bridge in the
    main loop, the tunnel_sync variable was not set to True, so the
    tunnels would never be resynchronized.

    This patch addresses the three issues in the following ways:
    1. Set the failure mode on the bridge before the canary flow is
    installed.
    2. Run the OVS restart logic before the tunnel synchronization.
    3. If the restart logic is triggered, set tunnel_sync to True to
    trigger synchronization.

    Closes-Bug: #1292105
    Change-Id: I6381e3fee49910127c420dd2e3205c64cdb9e185

Change abandoned by lifeless (<email address hidden>) on branch: master
Review: https://review.openstack.org/101040

Kyle Mestery (mestery) on 2014-07-02
Changed in neutron:
milestone: none → juno-2

Change abandoned by Rossella Sblendido (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/104517

Baohua Yang (yangbaohua) on 2014-07-23
Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Baohua Yang (yangbaohua)
assignee: Baohua Yang (yangbaohua) → nobody
Changed in neutron:
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/104517
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ea5ecf9b535053fe3bd32fa87409bb606a655555
Submitter: Jenkins
Branch: stable/icehouse

commit ea5ecf9b535053fe3bd32fa87409bb606a655555
Author: Kevin Benton <email address hidden>
Date: Wed Jun 18 22:34:00 2014 -0700

    OVS agent: Correct bridge setup ordering

    This patch fixes three issues that combined to break tunnel networks.

    The first issue was the the failure mode was being set on the
    integration bridge after the canary flow was installed. This
    immediately wiped out the canary flow.

    The second problem was that the main loop would sync tunnel networks
    right before resetting the tunnel bridge when the canary flow was
    missing. This meant that it would sync all of the tunnels via RPC and
    then wipe them out.

    The final issue was that after resetting the tunnel bridge in the
    main loop, the tunnel_sync variable was not set to True, so the
    tunnels would never be resynchronized.

    This patch addresses the three issues in the following ways:
    1. Set the failure mode on the bridge before the canary flow is
    installed.
    2. Run the OVS restart logic before the tunnel synchronization.
    3. If the restart logic is triggered, set tunnel_sync to True to
    trigger synchronization.

    Conflicts:
     neutron/tests/unit/openvswitch/test_ovs_tunnel.py

    Closes-Bug: #1292105
    Change-Id: I6381e3fee49910127c420dd2e3205c64cdb9e185
    (cherry-picked from commit 0e5c50ecb2e2317ca6908a9bcec3bcb8ca6d3af8)

tags: added: in-stable-icehouse

Change abandoned by mark mcclain (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/101638
Reason: Abandoning since https://review.openstack.org/#/c/104517/ already merged.

Thierry Carrez (ttx) on 2014-10-16
Changed in neutron:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers