neutron-openvswitch-agent does not recreate flows after ovsdb-server restarts

Bug #1290486 reported by Jon-Paul Sullivan on 2014-03-10
88
This bug affects 13 people
Affects Status Importance Assigned to Milestone
neutron
High
Eugene Nikanorov
Icehouse
High
Kyle Mestery
tripleo
Critical
James Polley

Bug Description

The DHCP requests were not being responded to after they were seen on the undercloud network interface. The neutron services were restarted in an attempt to ensure they had the newest configuration and knew they were supposed to respond to the requests.

Rather than using the heat stack create (called in devtest_overcloud.sh) to test, it was simple to use the following to directly boot a baremetal node.

    nova boot --flavor $(nova flavor-list | grep "|[[:space:]]*baremetal[[:space:]]*|" | awk '{print $2}) \
          --image $(nova image-list | grep "|[[:space:]]*overcloud-control[[:space:]]*|" | awk '{print $2}') \
          bm-test1

Whilst the baremetal node was attempting to pxe boot a restart of the neutron services was performed. This allowed the baremetal node to boot.

It has been observed that a neutron restart was needed for each subsequent reboot of the baremetal nodes to succeed.

Robert Collins (lifeless) wrote :

This suggests that neutron events to the agent were not propogating properly - almost certainly a neutron bug. Can you reproduce this?

Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
summary: - Baremetal: DHCP requests not being responded to
+ dhcp agent not serving responses

Oh, can you check syslog and check that there are no errors there? This might be a duplicate with bug 1271344

There are no occurrences of "configured address" in any of the syslog files for the undercloud. Given that I do not believe it is a duplicate of bug 1271344.

root@undercloud-undercloud-q5d4s2sbkzx6:/var/log# for i in syslog syslog.1 syslog.2.gz syslog.3.gz syslog.4.gz syslog.5.gz syslog.6.gz syslog.7.gz ; do (zcat $i || cat $i) | grep -e "configured address " ; done

gzip: syslog: not in gzip format

gzip: syslog.1: not in gzip format
root@undercloud-undercloud-q5d4s2sbkzx6:/var/log#

James Polley (tchaypo) on 2014-03-21
Changed in tripleo:
assignee: nobody → James Polley (tchaypo)
James Polley (tchaypo) wrote :

I believe I was able to reproduce this on my setup. If I observed what I think I observed, it's definitely not a duplicate of bug 1271344 - that bug talks about re-assigning an IP from one VM to a new VM, but in my case even the existing VMs were not getting a response when they rebooted.

After restarting neutron-dhcp-agent, the VMs started getting responses and came back on the network.

I'm not sure what triggered the error state in my case - I left my setup for ~18 hours, came back, and it was in the error state. Next step is to dig into the logs to see if I can see likely problems, and see if I'm able to trigger the error condition.

James Polley (tchaypo) on 2014-03-21
Changed in tripleo:
status: Triaged → In Progress
James Polley (tchaypo) wrote :

After leaving my environment alone for a few days, I've got the bug again.

tcpdump running on the br-ctlplane interface does show the dhcp requests coming in; but a tcpdump running inside ip netns on the tap interface doesn't see them.

Shortly after restarting neutron-openvswitch-agent, traffic started flowing again. neutron-server and neutron-dhcp-agent had also been restarted, but no change was observed in ~15 seconds after restarting each of them.

Robert Collins (lifeless) wrote :

I believe this is the race condition Clint identified on the weekend: we're trying to do things before ovs-db is up and running and neutron-openvswitch-agent is not handling ovs-db being down properly - it should back off and retry, or alternatively, do a full sync once the db is available.

James Polley (tchaypo) wrote :

I've been able to track down what I believe is the root problem.

If ovsdb-server (run by the openvswitch-switch service) restarts, the neutron-openvswitch-agent loses its connection and needs to be manually restarted in order to reconnect.

Causes of this bug I've seen have included ovsdb-server segfaulting, being kill -9ed, and being gracefully restarted with "service openvswitch-switch restart".

The errors recorded in /var/log/upstart/neutron-openvswitch-agent.log vary depending on why ovsdb-server went away:

2014-03-23 20:10:01.883 20375 ERROR neutron.agent.linux.ovsdb_monitor [req-a776b981-b86b-4437-ab65-0c6be6070094 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-24 01:40:17.617 20375 ERROR neutron.agent.linux.ovsdb_monitor [req-a776b981-b86b-4437-ab65-0c6be6070094 None] Error received from ovsdb monitor: 2014-03-24T01:40:17Z|00001|fatal_signal|WARN|terminating with signal 15 (Terminated)
2014-03-24 04:08:59.718 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-24 22:44:22.174 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-24 22:44:52.220 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:45:22.266 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:45:52.310 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:46:22.355 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:49:27.179 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: 2014-03-24T22:49:27Z|00001|fatal_signal|WARN|terminating with signal 15 (Terminated)
2014-03-24 22:55:45.441 16033 ERROR neutron.agent.linux.ovsdb_monitor [req-5fe682ce-138e-46d6-aa7e-f0d43ab576ee None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)

In all cases, the result is the same: until neutron-openvswitch-agent is restarted, no traffic is passed onto the tapXXXXX interface inside the dhcp-XXXXX netns

James Polley (tchaypo) on 2014-03-26
summary: - dhcp agent not serving responses
+ neutron-openvswitch-agent must be restarted after ovsdb-server failure
+ in order to pass traffic
Kyle Mestery (mestery) on 2014-03-31
Changed in neutron:
assignee: nobody → Kyle Mestery (mestery)
Kyle Mestery (mestery) on 2014-03-31
Changed in neutron:
importance: Undecided → High
tags: added: icehouse-rc-potential

After discussing with @marun in-channel, we think this could be due to the polling minimization monitor work done in Neutron. That is the only part of the code with a persistent connection to OVSDB. @marun indicated this was easy enough to verify: Look at the process list for a subprocess of the agent that calls ovsdb-client, and make sure it is killed/spawned again after OVSDB is restarted.

I'll try this myself tonight and see what happens locally. Would be good if you folks could try this as well! The default timeout for the monitor is 30 seconds BTW.

James Polley (tchaypo) wrote :
Download full text (9.5 KiB)

Just after a ``service openvswitch-switch`` restart:

root@undercloud-undercloud-ojtyffepm45g:~# service openvswitch-switch restart
openvswitch-switch stop/waiting
openvswitch-switch start/running
root@undercloud-undercloud-ojtyffepm45g:~# ps -ef f | grep -C3 [o]vsdb
root 8426 1 0 Mar24 ? S 0:12 tcpdump -ni tapbcf76f51-14
neutron 25679 1 0 01:30 ? Ss 0:00 /opt/stack/venvs/neutron/bin/python /opt/stack/venvs/neutron/bin/neutron-openvswitch-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini --config-dir /etc/neutron
root 25826 25679 0 01:30 ? Z 0:00 \_ [sudo] <defunct>
root 26028 1 0 01:32 ? S<s 0:00 ovsdb-server: monitoring pid 26029 (healthy)
root 26029 26028 0 01:32 ? S< 0:00 \_ ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach --monitor
root 26037 1 0 01:32 ? S<s 0:00 ovs-vswitchd: monitoring pid 26038 (healthy)
root 26038 26037 0 01:32 ? S<L 0:00 \_ ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor
root 26039 26038 0 01:32 ? S< 0:00 \_ ovs-vswitchd: worker process for pid 26038

Just over 30s later:

root@undercloud-undercloud-ojtyffepm45g:~# ps -ef f | grep -C3 [o]vsdb
nobody 8414 1 0 Mar24 ? S 0:00 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tapbcf76f51-14 --except-interface=lo --pid-file=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/pid --dhcp-hostsfile=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/host --dhcp-optsfile=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/opts --leasefile-ro --dhcp-range=set:tag0,192.0.2.0,static,86400s --dhcp-lease-max=256 --conf-file= --domain=openstacklocal
root 8426 1 0 Mar24 ? S 0:12 tcpdump -ni tapbcf76f51-14
neutron...

Read more...

tags: added: icehouse-backport-potential ovs
removed: icehouse-rc-potential
Endre Karlson (endre-karlson) wrote :

I can verify that I have the same error. If vswitchd dies / is restarted all flow entries are gone causing the network to not work.

Endre Karlson (endre-karlson) wrote :

I am on ubuntu 14.04 with ovs 2.0

Also I am finding that the neutron agent deosn't kill off the ovsdb-clients sufficiently due to a missing filter it says:
Stderr: 'sudo: unable to resolve host svg-cn03\n/usr/bin/neutron-rootwrap: Unauthorized command: kill -9 29831 (no filter matched)\n' execute /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:74
2014-04-01 22:56:38.203 20967 ERROR neutron.agent.linux.async_process [-] An error occurred while killing [['ovsdb-client', 'monitor', 'Interface', 'name,ofport', '--format=json']].
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Traceback (most recent call last):
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/async_process.py", line 160, in _kill_process
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process utils.execute(['kill', '-9', pid], root_helper=self.root_helper)
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py", line 76, in execute
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process raise RuntimeError(m)
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process RuntimeError:
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '29831']
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Exit code: 99
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Stdout: ''
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Stderr: 'sudo: unable to resolve host svg-cn03\n/usr/bin/neutron-rootwrap: Unauthorized command: kill -9 29831 (no filter matched)\n'

James Polley (tchaypo) wrote :
Download full text (13.5 KiB)

Contrary to what I said in IRC this morning, I'm actually not on Trusty:

root@undercloud-undercloud-ojtyffepm45g:~# lsb_release -rc
Release: 13.10
Codename: saucy
root@undercloud-undercloud-ojtyffepm45g:~# ovsdb-server --version
ovsdb-server (Open vSwitch) 1.10.2
Compiled Sep 23 2013 15:02:24
root@undercloud-undercloud-ojtyffepm45g:~# neutron --version
2.3.4.36

I don't have any logs showing problems killing the client; in fact, my /var/log/auth.log shows the kill happening quite successfully:

root@undercloud-undercloud-ojtyffepm45g:/var/log# grep kill auth.log.1
Mar 24 01:38:53 undercloud-undercloud-ojtyffepm45g sudo: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf kill -9 20611
root@undercloud-undercloud-ojtyffepm45g:/var/log# zgrep kill auth.log.2.gz
Mar 20 01:03:19 undercloud-undercloud-ojtyffepm45g sudo: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf kill -HUP 3891
<<snip>>

I can reproduce the problem through a simple ``service openvswitch-switch restart``; here are the logs I see when I do that:

==> upstart/openvswitch-switch.log <==
 * Killing ovs-vswitchd (1236)
 * Killing ovsdb-server (1226)

==> auth.log <==
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root

==> upstart/neutron-openvswitch-agent.log <==
2014-04-01 22:18:45.198 27450 ERROR neutron.agent.linux.ovsdb_monitor [req-642d9e73-e9fd-4e37-9364-0cc9f89956f6 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
openvswitch-switch stop/waiting

==> auth.log <==
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-vsctl --timeout=10 list-ports br-int
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)

==> syslog <==
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g ovs-vsctl: 00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g ovs-vsctl: 00002|vsctl|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)

==> auth.log <==
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root

==> upstart/neutron-openvswitch-agent.log <==
2014-04-01 22:18:45.497 27450 ERROR neutron.agent.linux.ovs_lib [req-642d9e73-e9fd-4e37-9364-0cc9f89956f6 None] Unable to execute ['ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']. Exception:
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']
Exit code: 1
Stdout: ''
Stderr: '2014-04-01T22:18:45Z|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)\novs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)\n'
2014-04-01 22:18:45.511 27450 ERROR n...

James Polley (tchaypo) wrote :

My last comment was perhaps a bit longer than it needed to be.

The tl;dr version is that after ovsdb-server is restarted, n-o-a starts a new ovsdb-client (in ekarlson's case the old client is not killed, but a new one does get started). The new ovsdb-client doesn't re-add the flows, so no traffic flows.

When n-o-a is restarted, the flows are recreated.

The brute-force workaround is to restart neutron-openvswitch-agent each time ovsdb-server is restarted; a better solution might be for the new ovsdb-client process be a trigger for checking and re-adding the required flows.

summary: - neutron-openvswitch-agent must be restarted after ovsdb-server failure
- in order to pass traffic
+ neutron-openvswitch-agent does not recreate flows after ovsdb-server
+ restarts
Maru Newby (maru) wrote :

James: The client monitor should always trigger polling on respawn, but I don't think there is a functional test for that condition. I'll work on addressing that oversight.

Marios Andreou (marios-b) wrote :

Do we actually have a valid reproducer for this? I have set up a test environment to understand the bug and try and reproduce it. I have used dev/test up to getting the undercloud deployed. I then build and register my overcloud-control and overcloud-compute images.

Instead of deploying an overcloud I just booted overcloud-control as suggested by Jon-Paul in the original report:

[root@undercloud-undercloud-asurl7euj5oa ~]# nova boot --flavor baremetal --image overcloud-control bm-test1

OK. I then ssh into undercloud and run tcpdump to monitor dhcp as suggested by James [1] on both br-ctlplane
and the netns tap:

tcpdump -i br-ctlplane -vvv -s 1500 '(port 67 or port 68)'

ip netns exec qdhcp-f6ec58e6-601b-4ed2-9c1c-512dfccbe0a9 tcpdump -i tap7fb1f038-88 -vvv -s 1500 '(port 67 or port 68)'

I then did a nova reboot --hard on my bm-test1 node. I can see the requests on br-ctlplane and the requests&replies on the netns tap. Fine so far.

I try to induce the reproducer suggested by James @ [2] - on the undercloud I kill ovsdb-server:

ps ax | grep ovsdb-server
kill -9 ...

Repeat the nova reboot --hard and I can still see request/replies for dhcp. Have I done something wrong with my setup above? Most of the 'reproduced' comments above suggest an element of time 'when i came back to my setup' etc.

thanks, marios

[1] https://bugs.launchpad.net/neutron/+bug/1290486/comments/5
[2] https://bugs.launchpad.net/neutron/+bug/1290486/comments/7

Changed in neutron:
status: New → Confirmed
James Polley (tchaypo) wrote :

I've run through my testing again; similar to what Marios did, except that I let the full devtest build happen, and ran "nova reboot --hard overcloud-NovaCompute0-bd3lkfo6ta2h "; I can't imagine that small difference in procedure would matter.

I don't even need to hard-kill ovsdb-server; a simple "service openvswitch-switch restart" is enough to put my setup into the error state, where traffic is seen on br-ctlplane but not the netns interface, and there are no flows listed on br-int:

    root@undercloud-undercloud-6taqd6dgghrg:~# ovs-ofctl dump-flows br-int
    NXST_FLOW reply (xid=0x4):
     cookie=0x0, duration=303.916s, table=0, n_packets=172, n_bytes=10292, idle_age=1, priority=0 actions=NORMAL

After a "service neutron-openvswitch-agent restart", the flows come back:

    root@undercloud-undercloud-6taqd6dgghrg:~# ovs-ofctl dump-flows br-int
    NXST_FLOW reply (xid=0x4):
     cookie=0x0, duration=1.124s, table=0, n_packets=2, n_bytes=160, idle_age=1, priority=3,in_port=1,vlan_tci=0x0000 actions=mod_vlan_vid:1,NORMAL
     cookie=0x0, duration=2.034s, table=0, n_packets=3, n_bytes=258, idle_age=1, priority=2,in_port=1 actions=drop
     cookie=0x0, duration=2.821s, table=0, n_packets=0, n_bytes=0, idle_age=2, priority=1 actions=NORMAL

and I can see traffic on the interface.

I'll be online just before the TripleO meeting at 1900UTC April 29th; I'll ping Marios to see if we can figure out what's different between our setups.

Michael Kazakov (gnomino) wrote :

I have same bug:
network freezes for some seconds after service neutron-plugin-openvswitch-agent restart
and flows does not recreates after service openvswitch-switch restart

Marios Andreou (marios-b) wrote :

OK after a quick chat with tchaypo on irc, he has an ubuntu dev environment whilst I'm on F20. I am still unable to replicate on F20x86

06:39 < tchaypo> marios: fwiw I've replicated on both i686 and amd64 - but only saucy, and only running on trusty

I *think* I replicated the bug but not using the 'kill -9 ovsdb-server' as above but rather by killing ovs-vswitchd directly. Then I messed up my setup so I am rebuilding now to try and confirm this.

Marios Andreou (marios-b) wrote :

OK, so for the record, in a F20 environment, this cannot be reproduced by just restarting ovsdb-server as documented by by James @ [1] . I *could* reproduce it but only by restarting the openvswitch service altogether.

NOTE: one thing that confused my is that in fedora, the service is 'openvswitch' and not 'openvswitch-switch' as in ubuntu, please someone correct me if I'm wrong.

After doing 'service openvswitch restart' I could no longer see the DHCP traffic on the tap interface of the internal bridge (only on br-ctlplane for example). As suggested above, doing a 'service neutron-openvswitch-agent restart' fixed the problem and traffic again flows through/to the internal bridge.

thanks! marios

[1] https://bugs.launchpad.net/neutron/+bug/1290486/comments/7

Kyle Mestery (mestery) wrote :

Thanks for looking into this one a bit more Marios! I'm going to try again recreating this without tripleo as well, I suspect this bug happens even without tripleo. I'll report back once I've done that.

Eugene Nikanorov (enikanorov) wrote :

This bug was also observed by our deployers, that's why i have transferred it to 'confirmed'

Kyle Mestery (mestery) wrote :

I am continuing to have trouble reproducing this issue locally. Here's what I've tried:

1. Single node instance running devstack. Have tried with both Ubuntu 12.04 and 13.10.
2. OVS versions 1.4.6 (with ubuntu 12.04) and 1.10.2 (ubuntu 13.10).
3. Bring the instance up with devstack with the latest upstream master.
4. Boot a VM, verify it gets an IP address.
5. Stop openvswitch (all services).
6. Verify the OVS agent begins to fail connecting.
7. Restart openvswitch.
8. Boot another VM.

For step 8, the VM continues to get an IP address. So, I'm wondering what is different in what I'm trying vs. what happens with the tripleo setup.

Kyle Mestery (mestery) wrote :

One other note here: I never see anything other than a "NORMAL" flow on my br-int. I'm curious to know how are configs are different such that you are getting mod_flows with VLANs on yours. Can you share a bit more? I even tried with multiple different networks as well.

Kyle Mestery (mestery) wrote :

The difference here is that I was using GRE tunnels and tripleo is using VLANs underneath. I will know later tonight if I can recreate this now, stay tuned.

Kyle Mestery (mestery) wrote :

I've been able to confirm with using VLAN networks I can recreate this now with a single node devstack instance. The flows are not programmed. It appears to be the case the rpc_loop() code is not detecting an OVS restart as a signal to program flows for ports.

Endre Karlson (endre-karlson) wrote :

This is actually extremely annoying and doesn't seem like a tripleo bug rather then a neutron bug.

I got 2 compute nodes out of 4 where this happens all the time randomly it seems (or at least I haven't been able to figure why yet).

Any suggestions? I can help with logs etc.

Kyle Mestery (mestery) wrote :

I have a patch I'm testing for this bug now. The basic idea is to bubble up the OVSDB restart to the agent so it can reprogram the bridges. I hope to push this out for review later today once I complete some additional testing.

Fix proposed to branch: master
Review: https://review.openstack.org/95060

Changed in neutron:
status: Confirmed → In Progress
Kyle Mestery (mestery) on 2014-05-27
Changed in neutron:
milestone: none → juno-1
Kyle Mestery (mestery) wrote :

I believe there is a red herring in this report: It's actually the restart of ovs-vswitchd which is causing the loss of all flows, not ovsdb-server. In my own testing, restarting ovsdb does not trigger the loss of flows. Restarting ovs-vswitchd does. I'm going to modify my patch to take that into account and resubmit.

Reviewed: https://review.openstack.org/95060
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8e9f00a19dab98e5cfc7ca32beb9f17ebb5bc1bb
Submitter: Jenkins
Branch: master

commit 8e9f00a19dab98e5cfc7ca32beb9f17ebb5bc1bb
Author: Kyle Mestery <email address hidden>
Date: Fri May 16 04:21:32 2014 +0000

    Reprogram flows when ovs-vswitchd restarts

    When OVS is restarted, by default it will not reprogram flows which were
    programmed. For the case of the OVS agent, this means a restart will cause
    all traffic to be switched using the NORMAL action. This is undesirable for
    a number of reasons, including obvious security reasons.

    This change provides a way for the agent to check if a restart of ovs-vswitchd
    has happened in the main agent loop. If a restart of ovs-vswitchd is detected,
    the agent will run through the setup of the bridges on the host and reprogram
    flows for all the ports connected.

    DocImpact
    This changes adds a new table (table 23) to the integration bridge, with a
    single 'drop' flow. This is used to monitor OVS restarts and to reprogram
    flows from the agent.

    Change-Id: If9e07465c43115838de23e12a4e0087c9218cea2
    Closes-Bug: #1290486

Changed in neutron:
status: In Progress → Fix Committed
James Polley (tchaypo) wrote :

In my cursory testing, this fix seems to have fixed the problems we saw.

Changed in tripleo:
status: In Progress → Fix Committed
James Denton (james-denton) wrote :

This patch has fixed the issue for us as well.

Alan Pevec (apevec) on 2014-06-04
tags: added: security
Carlos Goncalves (cgoncalves) wrote :

Kyle, as per your comment #8 is it valid to assume that if minimize_polling is set to False, we are not hit by this bug? If so, that would be a workaround for icehouse while review #96919 is not merged and distribution packages are not updated. My guess is that such backport won't make in time for 2014.1.1 which is due to tomorrow, unfortunately.

Alan Pevec (apevec) on 2014-06-04
tags: removed: icehouse-backport-potential
Roman Podoliaka (rpodolyaka) wrote :

Still seeing this with neutron as of commit 53b701a3f91530c9462a9cb0690aaf68efd45f6d
(ubuntu saucy, linux 3.11, openvswitch-server 1.10.2)

Steps to reproduce:
1. Run devtest.sh
2. Start pinging a user VM using the floating ip.
3. ssh to the controller node.
4. Do: sudo service openvswitch-switch restart

The user VM becomes unreachable until neutron-openvswitch-agent is restarted on the controller node.

Changed in tripleo:
status: Fix Committed → Triaged
Changed in neutron:
status: Fix Committed → Confirmed
Roman Podoliaka (rpodolyaka) wrote :

Looks like ovs periodically crashes: http://paste.openstack.org/show/82970/

Kyle Mestery (mestery) wrote :

Roman, I think this is a different bug, because you're using a floating IP, which means the L3 agent needs to recreate it's flows as well. Can you file a separate one to track that issue? I'll assign that one to myself and address this in the L3 agent as well.

Kyle Mestery (mestery) wrote :

Roman, I actually just tried to reproduce this with a single-node setup. What I did was this:

1. Run devstack to setup an all-in one with ML2 and VLANs.
2. Create a VM. Assign a floating IP.
3. Ping the floating IP from the host.
4. Restart OVS.
5. The ping keeps working.

So, I'm wondering what's different here. I'll setup a multi-node devstack and verify now, but I think my prior comments about the L3 agent from #37 were incorrect. Are you sure you're running up to commit 53b701a3f91530c9462a9cb0690aaf68efd45f6d on all of your nodes?

Reviewed: https://review.openstack.org/96919
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d00446be1739c93921e3b88763e05fc194ea9b2b
Submitter: Jenkins
Branch: stable/icehouse

commit d00446be1739c93921e3b88763e05fc194ea9b2b
Author: Kyle Mestery <email address hidden>
Date: Fri May 16 04:21:32 2014 +0000

    Reprogram flows when ovs-vswitchd restarts

    When OVS is restarted, by default it will not reprogram flows which were
    programmed. For the case of the OVS agent, this means a restart will cause
    all traffic to be switched using the NORMAL action. This is undesirable for
    a number of reasons, including obvious security reasons.

    This change provides a way for the agent to check if a restart of ovs-vswitchd
    has happened in the main agent loop. If a restart of ovs-vswitchd is detected,
    the agent will run through the setup of the bridges on the host and reprogram
    flows for all the ports connected.

    DocImpact
    This changes adds a new table (table 23) to the integration bridge, with a
    single 'drop' flow. This is used to monitor OVS restarts and to reprogram
    flows from the agent.

    Conflicts:
     neutron/plugins/openvswitch/common/constants.py

    Change-Id: If9e07465c43115838de23e12a4e0087c9218cea2
    Closes-Bug: #1290486
    (cherry picked from commit 8e9f00a19dab98e5cfc7ca32beb9f17ebb5bc1bb)

Roman Podoliaka (rpodolyaka) wrote :

Kyle, just reproduced this on Neutron master (d6634da6eb073e4a17d8b877c2662a15cbf0a4be) on two-nodes setup: 1 control + 1 compute node. Restart of neutron-openvswitch-agent on the compute node fixes the problem.

Here is what I see in neutron-openvswitch-agent logs on the compute node: http://paste.openstack.org/show/83533/ (14:31 is the moment I restarted ovs service).

Kyle Mestery (mestery) wrote :

Roman, can you file a new bug to track this issue? I am not sure this is the same issue. Also, please put in detailed steps of how you reproduced this. I've verified this fix does indeed work. You are 100% sure you're running the latest code on the control node as well?

Changed in neutron:
status: Confirmed → Fix Committed
Kyle Mestery (mestery) wrote :

Moving back to "Fix Committed" state. Roman will file a new bug to track the new issue with the L3 agent.

Thierry Carrez (ttx) on 2014-06-12
Changed in neutron:
status: Fix Committed → Fix Released
Roman Podoliaka (rpodolyaka) wrote :

Sorry for a long reply, wasn't subscribed for the notifications and missed you comments :(

So I've double-checked I'm running the neutron master. This doesn't seem to be an L3-agent issue. I'm running Neutron master as of commit 24718e6f1764e95f0c393ba042546e3584981b31 (Ubuntu 14.04, 3.13.0-29-generic, OVS 2.0.1+git20140120-0ubuntu2).

Steps to reproduce:

1. Run tripleo devtest story. This will give you a 3 node cluster - 1 controller node + 2 compute nodes. Neutron ML2 plugin is used, OVS agents are run on each node.
2. SSH to a controller node.
3. Start pinging the VM using its private IP address from a DHCP agent namespace.
4. SSH to a compute node running the VM.
5. Restart the OVS.
6. The ping stops working until neutron-openvswitch-agent is restarted on the compute node.

Right after OVS restart I see this in neutron-openvswitch-agent log: http://paste.openstack.org/show/84472/
The complete log is here http://paste.openstack.org/show/84473/ (there are some errors, but I'm not sure they are related to this problem, 'WAS HERE' is a string I log to ensure the code of your fix is executed)

Changed in neutron:
assignee: Kyle Mestery (mestery) → Eugene Nikanorov (enikanorov)
status: Fix Released → New
James Polley (tchaypo) on 2014-06-20
Changed in tripleo:
status: Triaged → Fix Released
Kyle Mestery (mestery) wrote :

Roman and James, you're both working in TripleO, and yet one of you is saying this bug isn't fixed, and one is saying it is fixed. We need to coordinate on IRC to see what's going on here, as I can no longer reproduce this at all.

Kyle Mestery (mestery) on 2014-07-10
Changed in neutron:
status: New → Incomplete
James Polley (tchaypo) wrote :

I've tried to follow Roman's steps above, but I can't reproduce this problem. I'm not sure if this is because I didn't quite do exactly the same thing though.

I ran a standard devtest build from trunk, but with --no-undercloud, so I've only got a seed and the overcloud.

9:04:17 0 130 polleyj@bill:~/.cache/tripleo (master)$ nova list
+--------------------------------------+-------------------------------------+--------+------------+-------------+--------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------------------+--------+------------+-------------+--------------------+
| 2ccc4069-7801-4de2-8d22-a4da62aacb42 | overcloud-NovaCompute0-pok6xaae4p2j | ACTIVE | - | Running | ctlplane=192.0.2.3 |
| 4d99893a-be5f-48f9-8932-1123cdcaf3e0 | overcloud-NovaCompute1-fci7it3qq57q | ACTIVE | - | Running | ctlplane=192.0.2.6 |
| ba8dd7db-6189-4426-909d-84e63ec44c7b | overcloud-controller0-vyxuppnmkdf2 | ACTIVE | - | Running | ctlplane=192.0.2.4 |
+--------------------------------------+-------------------------------------+--------+------------+-------------+--------------------+

I sshed to 192.0.2.4 and used:

sudo ip netns exec qdhcp-8b8a6df3-f19f-4fa5-bed5-b13e5cbbe70c ping 192.0.2.3

to try to ping out the right interface.

To restart the OVS, I sshed into 192.0.2.3 and ran "service openvswitch-switch restart"

The ping running on the controller node didn't see any packets get dropped.

Roman, am I missing some step from your process?

Kyle Mestery (mestery) wrote :

Thanks for trying this out James! Roman, I'm also keen to see what we may have missed in your steps to reproduce, as I can't reproduce this either.

Ilya Shakhat (shakhat) wrote :

Roman, can you try to do repro once again? The issue should be fixed by https://review.openstack.org/#/c/101447/

Ilya Shakhat (shakhat) wrote :

The issue that Roman complains about is tracked as bug https://bugs.launchpad.net/tripleo/+bug/1292105 and resolved by patch https://review.openstack.org/#/c/101447/

Changed in neutron:
status: Incomplete → Fix Committed
Thierry Carrez (ttx) on 2014-10-01
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2014-10-16
Changed in neutron:
milestone: juno-1 → 2014.2
Hua Zhang (zhhuabj) wrote :

are you sure the patch https://review.openstack.org/#/c/101447/ can resolve this question ? I still can hit this problem after confirming my env has contained this patch.

It definitely solved my problem - but it seems to have caused problems for
other people.

There's a thread starting at
http://lists.openstack.org/pipermail/openstack-dev/2014-October/049311.html
- it continues into november and there's one last post from me in December.

On Thu, Dec 25, 2014 at 11:19 AM, Hua Zhang <email address hidden>
wrote:

> are you sure the patch https://review.openstack.org/#/c/101447/ can
> resolve this question ? I still can hit this problem after confirming
> my env has contained this patch.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1290486
>
> Title:
> neutron-openvswitch-agent does not recreate flows after ovsdb-server
> restarts
>
> Status in OpenStack Neutron (virtual network service):
> Fix Released
> Status in neutron icehouse series:
> Fix Released
> Status in tripleo - openstack on openstack:
> Fix Released
>
> Bug description:
> The DHCP requests were not being responded to after they were seen on
> the undercloud network interface. The neutron services were restarted
> in an attempt to ensure they had the newest configuration and knew
> they were supposed to respond to the requests.
>
> Rather than using the heat stack create (called in
> devtest_overcloud.sh) to test, it was simple to use the following to
> directly boot a baremetal node.
>
> nova boot --flavor $(nova flavor-list | grep
> "|[[:space:]]*baremetal[[:space:]]*|" | awk '{print $2}) \
> --image $(nova image-list | grep
> "|[[:space:]]*overcloud-control[[:space:]]*|" | awk '{print $2}') \
> bm-test1
>
> Whilst the baremetal node was attempting to pxe boot a restart of the
> neutron services was performed. This allowed the baremetal node to
> boot.
>
> It has been observed that a neutron restart was needed for each
> subsequent reboot of the baremetal nodes to succeed.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1290486/+subscriptions
>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers