Bug #1290486 “neutron-openvswitch-agent does not recreate flows ...” : Bugs : neutron

Revision history for this message

Robert Collins (lifeless) wrote on 2014-03-11:

#1

This suggests that neutron events to the agent were not propogating properly - almost certainly a neutron bug. Can you reproduce this?

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → Critical
summary:	- Baremetal: DHCP requests not being responded to + dhcp agent not serving responses

Revision history for this message

Robert Collins (lifeless) wrote on 2014-03-11: Re: dhcp agent not serving responses

#2

Oh, can you check syslog and check that there are no errors there? This might be a duplicate with bug 1271344

Revision history for this message

Jon-Paul Sullivan (jonpaul-sullivan) wrote on 2014-03-11:

#3

There are no occurrences of "configured address" in any of the syslog files for the undercloud. Given that I do not believe it is a duplicate of bug 1271344.

root@undercloud-undercloud-q5d4s2sbkzx6:/var/log# for i in syslog syslog.1 syslog.2.gz syslog.3.gz syslog.4.gz syslog.5.gz syslog.6.gz syslog.7.gz ; do (zcat $i || cat $i) | grep -e "configured address " ; done

gzip: syslog: not in gzip format

gzip: syslog.1: not in gzip format
root@undercloud-undercloud-q5d4s2sbkzx6:/var/log#

James Polley (tchaypo) on 2014-03-21

Changed in tripleo:
assignee:	nobody → James Polley (tchaypo)

Revision history for this message

James Polley (tchaypo) wrote on 2014-03-21:

#4

I believe I was able to reproduce this on my setup. If I observed what I think I observed, it's definitely not a duplicate of bug 1271344 - that bug talks about re-assigning an IP from one VM to a new VM, but in my case even the existing VMs were not getting a response when they rebooted.

After restarting neutron-dhcp-agent, the VMs started getting responses and came back on the network.

I'm not sure what triggered the error state in my case - I left my setup for ~18 hours, came back, and it was in the error state. Next step is to dig into the logs to see if I can see likely problems, and see if I'm able to trigger the error condition.

James Polley (tchaypo) on 2014-03-21

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

James Polley (tchaypo) wrote on 2014-03-24:

#5

After leaving my environment alone for a few days, I've got the bug again.

tcpdump running on the br-ctlplane interface does show the dhcp requests coming in; but a tcpdump running inside ip netns on the tap interface doesn't see them.

Shortly after restarting neutron-openvswitch-agent, traffic started flowing again. neutron-server and neutron-dhcp-agent had also been restarted, but no change was observed in ~15 seconds after restarting each of them.

Revision history for this message

Robert Collins (lifeless) wrote on 2014-03-24:

#6

I believe this is the race condition Clint identified on the weekend: we're trying to do things before ovs-db is up and running and neutron-openvswitch-agent is not handling ovs-db being down properly - it should back off and retry, or alternatively, do a full sync once the db is available.

Revision history for this message

James Polley (tchaypo) wrote on 2014-03-24:

#7

I've been able to track down what I believe is the root problem.

If ovsdb-server (run by the openvswitch-switch service) restarts, the neutron-openvswitch-agent loses its connection and needs to be manually restarted in order to reconnect.

Causes of this bug I've seen have included ovsdb-server segfaulting, being kill -9ed, and being gracefully restarted with "service openvswitch-switch restart".

The errors recorded in /var/log/upstart/neutron-openvswitch-agent.log vary depending on why ovsdb-server went away:

2014-03-23 20:10:01.883 20375 ERROR neutron.agent.linux.ovsdb_monitor [req-a776b981-b86b-4437-ab65-0c6be6070094 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-24 01:40:17.617 20375 ERROR neutron.agent.linux.ovsdb_monitor [req-a776b981-b86b-4437-ab65-0c6be6070094 None] Error received from ovsdb monitor: 2014-03-24T01:40:17Z|00001|fatal_signal|WARN|terminating with signal 15 (Terminated)
2014-03-24 04:08:59.718 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-24 22:44:22.174 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-24 22:44:52.220 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:45:22.266 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:45:52.310 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:46:22.355 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:49:27.179 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: 2014-03-24T22:49:27Z|00001|fatal_signal|WARN|terminating with signal 15 (Terminated)
2014-03-24 22:55:45.441 16033 ERROR neutron.agent.linux.ovsdb_monitor [req-5fe682ce-138e-46d6-aa7e-f0d43ab576ee None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)

In all cases, the result is the same: until neutron-openvswitch-agent is restarted, no traffic is passed onto the tapXXXXX interface inside the dhcp-XXXXX netns

I've been able to track down what I believe is the root problem.

If ovsdb-server (run by the openvswitch-switch service) restarts, the neutron-openvswitch-agent loses its connection and needs to be manually restarted in order to reconnect.

Causes of this bug I've seen have included ovsdb-server segfaulting, being kill -9ed, and being gracefully restarted with "service openvswitch-switch restart".

The errors recorded in /var/log/upstart/neutron-openvswitch-agent.log vary depending on why ovsdb-server went away:

2014-03-23 20:10:01.883 20375 ERROR neutron.agent.linux.ovsdb_monitor [req-a776b981-b86b-4437-ab65-0c6be6070094 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-24 01:40:17.617 20375 ERROR neutron.agent.linux.ovsdb_monitor [req-a776b981-b86b-4437-ab65-0c6be6070094 None] Error received from ovsdb monitor: 2014-03-24T01:40:17Z|00001|fatal_signal|WARN|terminating with signal 15 (Terminated)
2014-03-24 04:08:59.718 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-24 22:44:22.174 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-24 22:44:52.220 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:45:22.266 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:45:52.310 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:46:22.355 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (Connection refused)
2014-03-24 22:49:27.179 8455 ERROR neutron.agent.linux.ovsdb_monitor [req-d2c2cbd5-a77a-4455-84ac-0a8ec69b41e8 None] Error received from ovsdb monitor: 2014-03-24T22:49:27Z|00001|fatal_signal|WARN|terminating with signal 15 (Terminated)
2014-03-24 22:55:45.441 16033 ERROR neutron.agent.linux.ovsdb_monitor [req-5fe682ce-138e-46d6-aa7e-f0d43ab576ee None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)

In all cases, the result is the same: until neutron-openvswitch-agent is restarted, no traffic is passed onto the tapXXXXX interface inside the dhcp-XXXXX netns

James Polley (tchaypo) on 2014-03-26

summary:

- dhcp agent not serving responses
+ neutron-openvswitch-agent must be restarted after ovsdb-server failure
+ in order to pass traffic

Kyle Mestery (mestery) on 2014-03-31

Changed in neutron:
assignee:	nobody → Kyle Mestery (mestery)

Kyle Mestery (mestery) on 2014-03-31

Changed in neutron:
importance:	Undecided → High

Mark McClain (markmcclain) on 2014-03-31

tags:

added: icehouse-rc-potential

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-03-31: Re: neutron-openvswitch-agent must be restarted after ovsdb-server failure in order to pass traffic

#8

After discussing with @marun in-channel, we think this could be due to the polling minimization monitor work done in Neutron. That is the only part of the code with a persistent connection to OVSDB. @marun indicated this was easy enough to verify: Look at the process list for a subprocess of the agent that calls ovsdb-client, and make sure it is killed/spawned again after OVSDB is restarted.

I'll try this myself tonight and see what happens locally. Would be good if you folks could try this as well! The default timeout for the monitor is 30 seconds BTW.

Revision history for this message

James Polley (tchaypo) wrote on 2014-04-01:

#9

Download full text (9.5 KiB)

Just after a ``service openvswitch-switch`` restart:

root@undercloud-undercloud-ojtyffepm45g:~# service openvswitch-switch restart
openvswitch-switch stop/waiting
openvswitch-switch start/running
root@undercloud-undercloud-ojtyffepm45g:~# ps -ef f | grep -C3 [o]vsdb
root 8426 1 0 Mar24 ? S 0:12 tcpdump -ni tapbcf76f51-14
neutron 25679 1 0 01:30 ? Ss 0:00 /opt/stack/venvs/neutron/bin/python /opt/stack/venvs/neutron/bin/neutron-openvswitch-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini --config-dir /etc/neutron
root 25826 25679 0 01:30 ? Z 0:00 \_ [sudo] <defunct>
root 26028 1 0 01:32 ? S<s 0:00 ovsdb-server: monitoring pid 26029 (healthy)
root 26029 26028 0 01:32 ? S< 0:00 \_ ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach --monitor
root 26037 1 0 01:32 ? S<s 0:00 ovs-vswitchd: monitoring pid 26038 (healthy)
root 26038 26037 0 01:32 ? S<L 0:00 \_ ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor
root 26039 26038 0 01:32 ? S< 0:00 \_ ovs-vswitchd: worker process for pid 26038

Just over 30s later:

root@undercloud-undercloud-ojtyffepm45g:~# ps -ef f | grep -C3 [o]vsdb
nobody 8414 1 0 Mar24 ? S 0:00 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tapbcf76f51-14 --except-interface=lo --pid-file=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/pid --dhcp-hostsfile=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/host --dhcp-optsfile=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/opts --leasefile-ro --dhcp-range=set:tag0,192.0.2.0,static,86400s --dhcp-lease-max=256 --conf-file= --domain=openstacklocal
root 8426 1 0 Mar24 ? S 0:12 tcpdump -ni tapbcf76f51-14
neutron...

Just after a ``service openvswitch-switch`` restart:

root@undercloud-undercloud-ojtyffepm45g:~# service openvswitch-switch restart
openvswitch-switch stop/waiting
openvswitch-switch start/running
root@undercloud-undercloud-ojtyffepm45g:~# ps -ef f | grep -C3 [o]vsdb                                                                                                                                                                                                                                                         
root      8426     1  0 Mar24 ?        S      0:12 tcpdump -ni tapbcf76f51-14
neutron  25679     1  0 01:30 ?        Ss     0:00 /opt/stack/venvs/neutron/bin/python /opt/stack/venvs/neutron/bin/neutron-openvswitch-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini --config-dir /etc/neutron
root     25826 25679  0 01:30 ?        Z      0:00  \_ [sudo] <defunct>
root     26028     1  0 01:32 ?        S<s    0:00 ovsdb-server: monitoring pid 26029 (healthy)                                                                                                                                                                                                                                                                                                                                                        
root     26029 26028  0 01:32 ?        S<     0:00  \_ ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach --monitor
root     26037     1  0 01:32 ?        S<s    0:00 ovs-vswitchd: monitoring pid 26038 (healthy)                                                                                                                                                                                    
root     26038 26037  0 01:32 ?        S<L    0:00  \_ ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor
root     26039 26038  0 01:32 ?        S<     0:00      \_ ovs-vswitchd: worker process for pid 26038

Just over 30s later:

root@undercloud-undercloud-ojtyffepm45g:~# ps -ef f | grep -C3 [o]vsdb
nobody    8414     1  0 Mar24 ?        S      0:00 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tapbcf76f51-14 --except-interface=lo --pid-file=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/pid --dhcp-hostsfile=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/host --dhcp-optsfile=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/opts --leasefile-ro --dhcp-range=set:tag0,192.0.2.0,static,86400s --dhcp-lease-max=256 --conf-file= --domain=openstacklocal
root      8426     1  0 Mar24 ?        S      0:12 tcpdump -ni tapbcf76f51-14
neutron  25679     1  0 01:30 ?        Ss     0:00 /opt/stack/venvs/neutron/bin/python /opt/stack/venvs/neutron/bin/neutron-openvswitch-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini --config-dir /etc/neutron
root     26270 25679  0 01:33 ?        S      0:00  \_ sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovsdb-client monitor Interface name,ofport --format=json
root     26271 26270  0 01:33 ?        S      0:00      \_ /opt/stack/venvs/neutron/bin/python /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovsdb-client monitor Interface name,ofport --format=json
root     26272 26271  0 01:33 ?        S      0:00          \_ /usr/bin/ovsdb-client monitor Interface name,ofport --format=json
root     26028     1  0 01:32 ?        S<s    0:00 ovsdb-server: monitoring pid 26029 (healthy)                                                                                                                                                                                                                                                                                                                                                        
root     26029 26028  0 01:32 ?        S<     0:00  \_ ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach --monitor
root     26037     1  0 01:32 ?        S<s    0:00 ovs-vswitchd: monitoring pid 26038 (healthy)                                                                                                                                                                                    
root     26038 26037  0 01:32 ?        S<L    0:00  \_ ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor
root     26039 26038  0 01:32 ?        S<     0:00      \_ ovs-vswitchd: worker process for pid 26038

To me it looks as though neutron-openvswitch-agent (PID 25679) does detect that ovsdb-server has gone away, and does respawn a new ovsdb-client

However, we still have the error condition - traffic not being forwarded onto the interface, manifesting itself as nodes not getting responses from the dhcp server.

root@undercloud-undercloud-ojtyffepm45g:~# service neutron-openvswitch-agent restart
neutron-openvswitch-agent stop/waiting
neutron-openvswitch-agent start/running, process 26578
root@undercloud-undercloud-ojtyffepm45g:~# ps -ef f | grep -C3 [o]vsdb                                                                                                                                                                                                                                                         
neutron   8378     1  0 Mar24 ?        Ss     3:48 /opt/stack/venvs/neutron/bin/python /opt/stack/venvs/neutron/bin/neutron-dhcp-agent --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron
nobody    8414     1  0 Mar24 ?        S      0:00 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tapbcf76f51-14 --except-interface=lo --pid-file=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/pid --dhcp-hostsfile=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/host --dhcp-optsfile=/var/run/neutron/dhcp/44ab7a66-fc35-4b49-9a15-9dc2227ee414/opts --leasefile-ro --dhcp-range=set:tag0,192.0.2.0,static,86400s --dhcp-lease-max=256 --conf-file= --domain=openstacklocal
root      8426     1  0 Mar24 ?        S      0:12 tcpdump -ni tapbcf76f51-14
root     26028     1  0 01:32 ?        S<s    0:00 ovsdb-server: monitoring pid 26029 (healthy)                                                                                                                                                                                                                                                                                                                                                        
root     26029 26028  0 01:32 ?        S<     0:00  \_ ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach --monitor
root     26037     1  0 01:32 ?        S<s    0:00 ovs-vswitchd: monitoring pid 26038 (healthy)                                                                                                                                                                                    
root     26038 26037  0 01:32 ?        S<L    0:00  \_ ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor
root     26039 26038  0 01:32 ?        S<     0:00      \_ ovs-vswitchd: worker process for pid 26038                                                                                                                                                                                      
neutron  26578     1  6 01:42 ?        Ss     0:00 /opt/stack/venvs/neutron/bin/python /opt/stack/venvs/neutron/bin/neutron-openvswitch-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini --config-dir /etc/neutron
root     26724 26578  0 01:42 ?        S      0:00  \_ sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovsdb-client monitor Interface name,ofport --format=json
root     26726 26724  0 01:42 ?        S      0:00      \_ /opt/stack/venvs/neutron/bin/python /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovsdb-client monitor Interface name,ofport --format=json
root     26729 26726  0 01:42 ?        S      0:00          \_ /usr/bin/ovsdb-client monitor Interface name,ofport --format=json
root@undercloud-undercloud-ojtyffepm45g:~#

This ps output looks the same to me, but dhcp requests now work.

Mark McClain (markmcclain) on 2014-04-01

tags:

added: icehouse-backport-potential ovs
removed: icehouse-rc-potential

Revision history for this message

Endre Karlson (endre-karlson) wrote on 2014-04-01:

#10

I can verify that I have the same error. If vswitchd dies / is restarted all flow entries are gone causing the network to not work.

Revision history for this message

Endre Karlson (endre-karlson) wrote on 2014-04-01:

#11

I am on ubuntu 14.04 with ovs 2.0

Also I am finding that the neutron agent deosn't kill off the ovsdb-clients sufficiently due to a missing filter it says:
Stderr: 'sudo: unable to resolve host svg-cn03\n/usr/bin/neutron-rootwrap: Unauthorized command: kill -9 29831 (no filter matched)\n' execute /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:74
2014-04-01 22:56:38.203 20967 ERROR neutron.agent.linux.async_process [-] An error occurred while killing [['ovsdb-client', 'monitor', 'Interface', 'name,ofport', '--format=json']].
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Traceback (most recent call last):
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/async_process.py", line 160, in _kill_process
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process utils.execute(['kill', '-9', pid], root_helper=self.root_helper)
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py", line 76, in execute
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process raise RuntimeError(m)
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process RuntimeError:
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '29831']
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Exit code: 99
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Stdout: ''
2014-04-01 22:56:38.203 20967 TRACE neutron.agent.linux.async_process Stderr: 'sudo: unable to resolve host svg-cn03\n/usr/bin/neutron-rootwrap: Unauthorized command: kill -9 29831 (no filter matched)\n'

Revision history for this message

James Polley (tchaypo) wrote on 2014-04-01:

#12

Download full text (13.5 KiB)

Contrary to what I said in IRC this morning, I'm actually not on Trusty:

root@undercloud-undercloud-ojtyffepm45g:~# lsb_release -rc
Release: 13.10
Codename: saucy
root@undercloud-undercloud-ojtyffepm45g:~# ovsdb-server --version
ovsdb-server (Open vSwitch) 1.10.2
Compiled Sep 23 2013 15:02:24
root@undercloud-undercloud-ojtyffepm45g:~# neutron --version
2.3.4.36

I don't have any logs showing problems killing the client; in fact, my /var/log/auth.log shows the kill happening quite successfully:

root@undercloud-undercloud-ojtyffepm45g:/var/log# grep kill auth.log.1
Mar 24 01:38:53 undercloud-undercloud-ojtyffepm45g sudo: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf kill -9 20611
root@undercloud-undercloud-ojtyffepm45g:/var/log# zgrep kill auth.log.2.gz
Mar 20 01:03:19 undercloud-undercloud-ojtyffepm45g sudo: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf kill -HUP 3891
<<snip>>

I can reproduce the problem through a simple ``service openvswitch-switch restart``; here are the logs I see when I do that:

==> upstart/openvswitch-switch.log <==
* Killing ovs-vswitchd (1236)
* Killing ovsdb-server (1226)

==> auth.log <==
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root

==> upstart/neutron-openvswitch-agent.log <==
2014-04-01 22:18:45.198 27450 ERROR neutron.agent.linux.ovsdb_monitor [req-642d9e73-e9fd-4e37-9364-0cc9f89956f6 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
openvswitch-switch stop/waiting

==> auth.log <==
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-vsctl --timeout=10 list-ports br-int
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)

==> syslog <==
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g ovs-vsctl: 00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g ovs-vsctl: 00002|vsctl|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)

==> auth.log <==
Apr 1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root

==> upstart/neutron-openvswitch-agent.log <==
2014-04-01 22:18:45.497 27450 ERROR neutron.agent.linux.ovs_lib [req-642d9e73-e9fd-4e37-9364-0cc9f89956f6 None] Unable to execute ['ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']. Exception:
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']
Exit code: 1
Stdout: ''
Stderr: '2014-04-01T22:18:45Z|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)\novs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)\n'
2014-04-01 22:18:45.511 27450 ERROR n...

Contrary to what I said in IRC this morning, I'm actually not on Trusty:

root@undercloud-undercloud-ojtyffepm45g:~# lsb_release -rc
Release:        13.10
Codename:       saucy
root@undercloud-undercloud-ojtyffepm45g:~# ovsdb-server --version
ovsdb-server (Open vSwitch) 1.10.2
Compiled Sep 23 2013 15:02:24
root@undercloud-undercloud-ojtyffepm45g:~# neutron --version
2.3.4.36

I don't have any logs showing problems killing the client; in fact, my /var/log/auth.log shows the kill happening quite successfully:

root@undercloud-undercloud-ojtyffepm45g:/var/log# grep kill auth.log.1 
Mar 24 01:38:53 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf kill -9 20611
root@undercloud-undercloud-ojtyffepm45g:/var/log# zgrep kill auth.log.2.gz 
Mar 20 01:03:19 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf kill -HUP 3891
<<snip>>

I can reproduce the problem through a simple ``service openvswitch-switch restart``; here are the logs I see when I do that:

==> upstart/openvswitch-switch.log <==
 * Killing ovs-vswitchd (1236)
 * Killing ovsdb-server (1226)

==> auth.log <==
Apr  1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root

==> upstart/neutron-openvswitch-agent.log <==
2014-04-01 22:18:45.198 27450 ERROR neutron.agent.linux.ovsdb_monitor [req-642d9e73-e9fd-4e37-9364-0cc9f89956f6 None] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
openvswitch-switch stop/waiting

==> auth.log <==
Apr  1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-vsctl --timeout=10 list-ports br-int
Apr  1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)

==> syslog <==
Apr  1 22:18:45 undercloud-undercloud-ojtyffepm45g ovs-vsctl: 00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
Apr  1 22:18:45 undercloud-undercloud-ojtyffepm45g ovs-vsctl: 00002|vsctl|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)

==> auth.log <==
Apr  1 22:18:45 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root

==> upstart/neutron-openvswitch-agent.log <==
2014-04-01 22:18:45.497 27450 ERROR neutron.agent.linux.ovs_lib [req-642d9e73-e9fd-4e37-9364-0cc9f89956f6 None] Unable to execute ['ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']. Exception: 
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']
Exit code: 1
Stdout: ''
Stderr: '2014-04-01T22:18:45Z|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)\novs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)\n'
2014-04-01 22:18:45.511 27450 ERROR neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-642d9e73-e9fd-4e37-9364-0cc9f89956f6 None] Error while processing VIF ports
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 1216, in rpc_loop
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     port_info = self.scan_ports(ports, updated_ports_copy)
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 818, in scan_ports
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     cur_ports = self.int_br.get_vif_port_set()
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/ovs_lib.py", line 359, in get_vif_port_set
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     port_names = self.get_port_name_list()
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/ovs_lib.py", line 315, in get_port_name_list
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     res = self.run_vsctl(["list-ports", self.br_name], check_error=True)
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/ovs_lib.py", line 73, in run_vsctl
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     ctxt.reraise = False
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/openstack/common/excutils.py", line 68, in __exit__
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     six.reraise(self.type_, self.value, self.tb)
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/ovs_lib.py", line 66, in run_vsctl
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     return utils.execute(full_args, root_helper=self.root_helper)
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 76, in execute
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     raise RuntimeError(m)
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent RuntimeError: 
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Exit code: 1
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Stdout: ''
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Stderr: '2014-04-01T22:18:45Z|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)\novs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such[210/483]
directory)\n'
2014-04-01 22:18:45.511 27450 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent

==> syslog <==
Apr  1 22:18:45 undercloud-undercloud-ojtyffepm45g ovs-vsctl: 00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=6.12.0

==> upstart/openvswitch-switch.log <==
 * Starting ovsdb-server

==> syslog <==
Apr  1 22:18:45 undercloud-undercloud-ojtyffepm45g ovs-vsctl: 00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=1.10.2 "external-ids:system-id=\"cb5c322e-e603-49ed-a397-edd1f75e0467\"" "system-type=\"Ubuntu\"" "system-version=\"13.10-saucy\""

==> upstart/openvswitch-switch.log <==
 * Configuring Open vSwitch system IDs
 * Starting ovs-vswitchd
 * Enabling remote OVSDB managers
openvswitch-switch start/running
root@undercloud-undercloud-ojtyffepm45g:/var/log# 
==> auth.log <==
Apr  1 22:18:47 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-vsctl --timeout=10 list-ports br-int
Apr  1 22:18:47 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr  1 22:18:47 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root
<<snip lots of ovs-vsctl commands>>

To fix it, I can simply restart the neutron-openvswitch-agent service. Logs:

==> syslog <==
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g ovsdb-client: 00001|fatal_signal|WARN|terminating with signal 15 (Terminated)

==> upstart/neutron-openvswitch-agent.log <==
2014-04-01 22:23:39.591 27450 ERROR neutron.agent.linux.ovsdb_monitor [req-642d9e73-e9fd-4e37-9364-0cc9f89956f6 None] Error received from ovsdb monitor: 2014-04-01T22:23:39Z|00001|fatal_signal|WARN|terminating with signal 15 (Terminated)

==> auth.log <==
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root

==> upstart/neutron-openvswitch-agent.log <==
neutron-openvswitch-agent stop/waiting

==> auth.log <==
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-vsctl --timeout=10 -- --if-exists del-port br-int patch-tun
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)

==> syslog <==
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g ovs-vsctl: 00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=10 -- --if-exists del-port br-int patch-tun

==> auth.log <==
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-ofctl del-flows br-int
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-ofctl add-flow br-int hard_timeout=0,idle_timeout=0,priority=1,actions=normal
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-ofctl del-flows br-ctlplane
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-ofctl add-flow br-ctlplane hard_timeout=0,idle_timeout=0,priority=1,actions=normal
Apr  1 22:23:39 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr  1 22:23:40 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session closed for user root
Apr  1 22:23:40 undercloud-undercloud-ojtyffepm45g sudo:  neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-vsctl --timeout=10 -- --if-exists del-port br-int int-br-ctlplane
Apr  1 22:23:40 undercloud-undercloud-ojtyffepm45g sudo: pam_unix(sudo:session): session opened for user root by (uid=0)

ekarlson noted on irc that after ovsdb-server is restarted. I can confirm that this is the case for me too.

After ``service openvswitch-switch restart``, I waited until neutron-openvswitch-agent had restarted its ovsdb-client. At this point:

root@undercloud-undercloud-ojtyffepm45g:/var/log# ovs-ofctl dump-flows br-int                                                                                                                                                                                                                                                  
NXST_FLOW reply (xid=0x4):
 cookie=0x0, duration=3.267s, table=0, n_packets=2, n_bytes=284, idle_age=3, priority=0 actions=NORMAL
root@undercloud-undercloud-ojtyffepm45g:/var/log#

After restarting neutron-openvswitch-agent:

root@undercloud-undercloud-ojtyffepm45g:/var/log# ovs-ofctl dump-flows br-int                                                                                                                                                                                                                                                  
NXST_FLOW reply (xid=0x4):
 cookie=0x0, duration=4.131s, table=0, n_packets=4, n_bytes=294, idle_age=3, priority=2,in_port=1 actions=drop
 cookie=0x0, duration=3.554s, table=0, n_packets=5, n_bytes=340, idle_age=0, priority=3,in_port=1,vlan_tci=0x0000 actions=mod_vlan_vid:1,NORMAL
 cookie=0x0, duration=4.613s, table=0, n_packets=0, n_bytes=0, idle_age=4, priority=1 actions=NORMAL
root@undercloud-undercloud-ojtyffepm45g:/var/log#

It seems as though neutron-openvswitch-agent is detecting the lack of rules when it restarts; it seems as though it would be helpful if it could run through the same logic after ovsdb-server is restarted

Revision history for this message

James Polley (tchaypo) wrote on 2014-04-01:

#13

My last comment was perhaps a bit longer than it needed to be.

The tl;dr version is that after ovsdb-server is restarted, n-o-a starts a new ovsdb-client (in ekarlson's case the old client is not killed, but a new one does get started). The new ovsdb-client doesn't re-add the flows, so no traffic flows.

When n-o-a is restarted, the flows are recreated.

The brute-force workaround is to restart neutron-openvswitch-agent each time ovsdb-server is restarted; a better solution might be for the new ovsdb-client process be a trigger for checking and re-adding the required flows.

summary:

- neutron-openvswitch-agent must be restarted after ovsdb-server failure
- in order to pass traffic
+ neutron-openvswitch-agent does not recreate flows after ovsdb-server
+ restarts

Revision history for this message

Maru Newby (maru) wrote on 2014-04-03:

#14

James: The client monitor should always trigger polling on respawn, but I don't think there is a functional test for that condition. I'll work on addressing that oversight.

Revision history for this message

Marios Andreou (marios-b) wrote on 2014-04-25:

#15

Do we actually have a valid reproducer for this? I have set up a test environment to understand the bug and try and reproduce it. I have used dev/test up to getting the undercloud deployed. I then build and register my overcloud-control and overcloud-compute images.

Instead of deploying an overcloud I just booted overcloud-control as suggested by Jon-Paul in the original report:

[root@undercloud-undercloud-asurl7euj5oa ~]# nova boot --flavor baremetal --image overcloud-control bm-test1

OK. I then ssh into undercloud and run tcpdump to monitor dhcp as suggested by James [1] on both br-ctlplane
and the netns tap:

tcpdump -i br-ctlplane -vvv -s 1500 '(port 67 or port 68)'

ip netns exec qdhcp-f6ec58e6-601b-4ed2-9c1c-512dfccbe0a9 tcpdump -i tap7fb1f038-88 -vvv -s 1500 '(port 67 or port 68)'

I then did a nova reboot --hard on my bm-test1 node. I can see the requests on br-ctlplane and the requests&replies on the netns tap. Fine so far.

I try to induce the reproducer suggested by James @ [2] - on the undercloud I kill ovsdb-server:

ps ax | grep ovsdb-server
kill -9 ...

Repeat the nova reboot --hard and I can still see request/replies for dhcp. Have I done something wrong with my setup above? Most of the 'reproduced' comments above suggest an element of time 'when i came back to my setup' etc.

thanks, marios

[1] https://bugs.launchpad.net/neutron/+bug/1290486/comments/5
[2] https://bugs.launchpad.net/neutron/+bug/1290486/comments/7

Eugene Nikanorov (enikanorov) on 2014-04-28

Changed in neutron:
status:	New → Confirmed

Revision history for this message

James Polley (tchaypo) wrote on 2014-04-29:

#16

I've run through my testing again; similar to what Marios did, except that I let the full devtest build happen, and ran "nova reboot --hard overcloud-NovaCompute0-bd3lkfo6ta2h "; I can't imagine that small difference in procedure would matter.

I don't even need to hard-kill ovsdb-server; a simple "service openvswitch-switch restart" is enough to put my setup into the error state, where traffic is seen on br-ctlplane but not the netns interface, and there are no flows listed on br-int:

    root@undercloud-undercloud-6taqd6dgghrg:~# ovs-ofctl dump-flows br-int
    NXST_FLOW reply (xid=0x4):
     cookie=0x0, duration=303.916s, table=0, n_packets=172, n_bytes=10292, idle_age=1, priority=0 actions=NORMAL

After a "service neutron-openvswitch-agent restart", the flows come back:

    root@undercloud-undercloud-6taqd6dgghrg:~# ovs-ofctl dump-flows br-int
    NXST_FLOW reply (xid=0x4):
     cookie=0x0, duration=1.124s, table=0, n_packets=2, n_bytes=160, idle_age=1, priority=3,in_port=1,vlan_tci=0x0000 actions=mod_vlan_vid:1,NORMAL
     cookie=0x0, duration=2.034s, table=0, n_packets=3, n_bytes=258, idle_age=1, priority=2,in_port=1 actions=drop
     cookie=0x0, duration=2.821s, table=0, n_packets=0, n_bytes=0, idle_age=2, priority=1 actions=NORMAL

and I can see traffic on the interface.

I'll be online just before the TripleO meeting at 1900UTC April 29th; I'll ping Marios to see if we can figure out what's different between our setups.

Revision history for this message

Michael Kazakov (gnomino) wrote on 2014-04-29:

#17

I have same bug:
network freezes for some seconds after service neutron-plugin-openvswitch-agent restart
and flows does not recreates after service openvswitch-switch restart

Revision history for this message

Marios Andreou (marios-b) wrote on 2014-04-29:

#18

OK after a quick chat with tchaypo on irc, he has an ubuntu dev environment whilst I'm on F20. I am still unable to replicate on F20x86

06:39 < tchaypo> marios: fwiw I've replicated on both i686 and amd64 - but only saucy, and only running on trusty

I *think* I replicated the bug but not using the 'kill -9 ovsdb-server' as above but rather by killing ovs-vswitchd directly. Then I messed up my setup so I am rebuilding now to try and confirm this.

Revision history for this message

Marios Andreou (marios-b) wrote on 2014-04-30:

#19

OK, so for the record, in a F20 environment, this cannot be reproduced by just restarting ovsdb-server as documented by by James @ [1] . I *could* reproduce it but only by restarting the openvswitch service altogether.

NOTE: one thing that confused my is that in fedora, the service is 'openvswitch' and not 'openvswitch-switch' as in ubuntu, please someone correct me if I'm wrong.

After doing 'service openvswitch restart' I could no longer see the DHCP traffic on the tap interface of the internal bridge (only on br-ctlplane for example). As suggested above, doing a 'service neutron-openvswitch-agent restart' fixed the problem and traffic again flows through/to the internal bridge.

thanks! marios

[1] https://bugs.launchpad.net/neutron/+bug/1290486/comments/7

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-05-01:

#20

Thanks for looking into this one a bit more Marios! I'm going to try again recreating this without tripleo as well, I suspect this bug happens even without tripleo. I'll report back once I've done that.

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2014-05-01:

#21

This bug was also observed by our deployers, that's why i have transferred it to 'confirmed'

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-05-01:

#22

I am continuing to have trouble reproducing this issue locally. Here's what I've tried:

1. Single node instance running devstack. Have tried with both Ubuntu 12.04 and 13.10.
2. OVS versions 1.4.6 (with ubuntu 12.04) and 1.10.2 (ubuntu 13.10).
3. Bring the instance up with devstack with the latest upstream master.
4. Boot a VM, verify it gets an IP address.
5. Stop openvswitch (all services).
6. Verify the OVS agent begins to fail connecting.
7. Restart openvswitch.
8. Boot another VM.

For step 8, the VM continues to get an IP address. So, I'm wondering what is different in what I'm trying vs. what happens with the tripleo setup.

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-05-01:

#23

One other note here: I never see anything other than a "NORMAL" flow on my br-int. I'm curious to know how are configs are different such that you are getting mod_flows with VLANs on yours. Can you share a bit more? I even tried with multiple different networks as well.

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-05-01:

#24

The difference here is that I was using GRE tunnels and tripleo is using VLANs underneath. I will know later tonight if I can recreate this now, stay tuned.

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-05-02:

#25

I've been able to confirm with using VLAN networks I can recreate this now with a single node devstack instance. The flows are not programmed. It appears to be the case the rpc_loop() code is not detecting an OVS restart as a signal to program flows for ports.

Revision history for this message

Endre Karlson (endre-karlson) wrote on 2014-05-14:

#26

This is actually extremely annoying and doesn't seem like a tripleo bug rather then a neutron bug.

I got 2 compute nodes out of 4 where this happens all the time randomly it seems (or at least I haven't been able to figure why yet).

Any suggestions? I can help with logs etc.

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-05-21:

#27

I have a patch I'm testing for this bug now. The basic idea is to bubble up the OVSDB restart to the agent so it can reprogram the bridges. I hope to push this out for review later today once I complete some additional testing.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-23: Fix proposed to neutron (master)

#28

Fix proposed to branch: master
Review: https://review.openstack.org/95060

Changed in neutron:
status:	Confirmed → In Progress

Kyle Mestery (mestery) on 2014-05-27

Changed in neutron:
milestone:	none → juno-1

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-05-28:

#29

I believe there is a red herring in this report: It's actually the restart of ovs-vswitchd which is causing the loss of all flows, not ovsdb-server. In my own testing, restarting ovsdb does not trigger the loss of flows. Restarting ovs-vswitchd does. I'm going to modify my patch to take that into account and resubmit.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-29: Fix merged to neutron (master)

#30

Reviewed: https://review.openstack.org/95060
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8e9f00a19dab98e5cfc7ca32beb9f17ebb5bc1bb
Submitter: Jenkins
Branch: master

commit 8e9f00a19dab98e5cfc7ca32beb9f17ebb5bc1bb
Author: Kyle Mestery <email address hidden>
Date: Fri May 16 04:21:32 2014 +0000

Reprogram flows when ovs-vswitchd restarts

    When OVS is restarted, by default it will not reprogram flows which were
    programmed. For the case of the OVS agent, this means a restart will cause
    all traffic to be switched using the NORMAL action. This is undesirable for
    a number of reasons, including obvious security reasons.

    This change provides a way for the agent to check if a restart of ovs-vswitchd
    has happened in the main agent loop. If a restart of ovs-vswitchd is detected,
    the agent will run through the setup of the bridges on the host and reprogram
    flows for all the ports connected.

    DocImpact
    This changes adds a new table (table 23) to the integration bridge, with a
    single 'drop' flow. This is used to monitor OVS restarts and to reprogram
    flows from the agent.

Change-Id: If9e07465c43115838de23e12a4e0087c9218cea2
Closes-Bug: #1290486

Changed in neutron:
status:	In Progress → Fix Committed

Revision history for this message

James Polley (tchaypo) wrote on 2014-05-30:

#31

In my cursory testing, this fix seems to have fixed the problems we saw.

Changed in tripleo:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-30: Fix proposed to neutron (stable/icehouse)

#32

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/96919

Revision history for this message

James Denton (james-denton) wrote on 2014-06-02:

#33

This patch has fixed the issue for us as well.

Alan Pevec (apevec) on 2014-06-04

tags:

added: security

Revision history for this message

Carlos Goncalves (cgoncalves) wrote on 2014-06-04:

#34

Kyle, as per your comment #8 is it valid to assume that if minimize_polling is set to False, we are not hit by this bug? If so, that would be a workaround for icehouse while review #96919 is not merged and distribution packages are not updated. My guess is that such backport won't make in time for 2014.1.1 which is due to tomorrow, unfortunately.

Alan Pevec (apevec) on 2014-06-04

tags:

removed: icehouse-backport-potential

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-06-05:

#35

Still seeing this with neutron as of commit 53b701a3f91530c9462a9cb0690aaf68efd45f6d
(ubuntu saucy, linux 3.11, openvswitch-server 1.10.2)

Steps to reproduce:
1. Run devtest.sh
2. Start pinging a user VM using the floating ip.
3. ssh to the controller node.
4. Do: sudo service openvswitch-switch restart

The user VM becomes unreachable until neutron-openvswitch-agent is restarted on the controller node.

Changed in tripleo:
status:	Fix Committed → Triaged

Roman Podoliaka (rpodolyaka) on 2014-06-05

Changed in neutron:
status:	Fix Committed → Confirmed

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-06-05:

#36

Looks like ovs periodically crashes: http://paste.openstack.org/show/82970/

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-06-06:

#37

Roman, I think this is a different bug, because you're using a floating IP, which means the L3 agent needs to recreate it's flows as well. Can you file a separate one to track that issue? I'll assign that one to myself and address this in the L3 agent as well.

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-06-06:

#38

Roman, I actually just tried to reproduce this with a single-node setup. What I did was this:

1. Run devstack to setup an all-in one with ML2 and VLANs.
2. Create a VM. Assign a floating IP.
3. Ping the floating IP from the host.
4. Restart OVS.
5. The ping keeps working.

So, I'm wondering what's different here. I'll setup a multi-node devstack and verify now, but I think my prior comments about the L3 agent from #37 were incorrect. Are you sure you're running up to commit 53b701a3f91530c9462a9cb0690aaf68efd45f6d on all of your nodes?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-07: Fix merged to neutron (stable/icehouse)

#39

Reviewed: https://review.openstack.org/96919
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d00446be1739c93921e3b88763e05fc194ea9b2b
Submitter: Jenkins
Branch: stable/icehouse

commit d00446be1739c93921e3b88763e05fc194ea9b2b
Author: Kyle Mestery <email address hidden>
Date: Fri May 16 04:21:32 2014 +0000

Reprogram flows when ovs-vswitchd restarts

    When OVS is restarted, by default it will not reprogram flows which were
    programmed. For the case of the OVS agent, this means a restart will cause
    all traffic to be switched using the NORMAL action. This is undesirable for
    a number of reasons, including obvious security reasons.

    This change provides a way for the agent to check if a restart of ovs-vswitchd
    has happened in the main agent loop. If a restart of ovs-vswitchd is detected,
    the agent will run through the setup of the bridges on the host and reprogram
    flows for all the ports connected.

    DocImpact
    This changes adds a new table (table 23) to the integration bridge, with a
    single 'drop' flow. This is used to monitor OVS restarts and to reprogram
    flows from the agent.

Conflicts:
neutron/plugins/openvswitch/common/constants.py

    Change-Id: If9e07465c43115838de23e12a4e0087c9218cea2
    Closes-Bug: #1290486
    (cherry picked from commit 8e9f00a19dab98e5cfc7ca32beb9f17ebb5bc1bb)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-06-10:

#40

Kyle, just reproduced this on Neutron master (d6634da6eb073e4a17d8b877c2662a15cbf0a4be) on two-nodes setup: 1 control + 1 compute node. Restart of neutron-openvswitch-agent on the compute node fixes the problem.

Here is what I see in neutron-openvswitch-agent logs on the compute node: http://paste.openstack.org/show/83533/ (14:31 is the moment I restarted ovs service).

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-06-10:

#41

Roman, can you file a new bug to track this issue? I am not sure this is the same issue. Also, please put in detailed steps of how you reproduced this. I've verified this fix does indeed work. You are 100% sure you're running the latest code on the control node as well?

Changed in neutron:
status:	Confirmed → Fix Committed

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-06-10:

#42

Moving back to "Fix Committed" state. Roman will file a new bug to track the new issue with the L3 agent.

Thierry Carrez (ttx) on 2014-06-12

Changed in neutron:
status:	Fix Committed → Fix Released

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-06-19:

#43

Sorry for a long reply, wasn't subscribed for the notifications and missed you comments :(

So I've double-checked I'm running the neutron master. This doesn't seem to be an L3-agent issue. I'm running Neutron master as of commit 24718e6f1764e95f0c393ba042546e3584981b31 (Ubuntu 14.04, 3.13.0-29-generic, OVS 2.0.1+git20140120-0ubuntu2).

Steps to reproduce:

1. Run tripleo devtest story. This will give you a 3 node cluster - 1 controller node + 2 compute nodes. Neutron ML2 plugin is used, OVS agents are run on each node.
2. SSH to a controller node.
3. Start pinging the VM using its private IP address from a DHCP agent namespace.
4. SSH to a compute node running the VM.
5. Restart the OVS.
6. The ping stops working until neutron-openvswitch-agent is restarted on the compute node.

Right after OVS restart I see this in neutron-openvswitch-agent log: http://paste.openstack.org/show/84472/
The complete log is here http://paste.openstack.org/show/84473/ (there are some errors, but I'm not sure they are related to this problem, 'WAS HERE' is a string I log to ensure the code of your fix is executed)

Eugene Nikanorov (enikanorov) on 2014-06-19

Changed in neutron:
assignee:	Kyle Mestery (mestery) → Eugene Nikanorov (enikanorov)
status:	Fix Released → New

James Polley (tchaypo) on 2014-06-20

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-06-30:

#44

Roman and James, you're both working in TripleO, and yet one of you is saying this bug isn't fixed, and one is saying it is fixed. We need to coordinate on IRC to see what's going on here, as I can no longer reproduce this at all.

Kyle Mestery (mestery) on 2014-07-10

Changed in neutron:
status:	New → Incomplete

Revision history for this message

James Polley (tchaypo) wrote on 2014-07-11:

#45

I've tried to follow Roman's steps above, but I can't reproduce this problem. I'm not sure if this is because I didn't quite do exactly the same thing though.

I ran a standard devtest build from trunk, but with --no-undercloud, so I've only got a seed and the overcloud.

I sshed to 192.0.2.4 and used:

sudo ip netns exec qdhcp-8b8a6df3-f19f-4fa5-bed5-b13e5cbbe70c ping 192.0.2.3

to try to ping out the right interface.

To restart the OVS, I sshed into 192.0.2.3 and ran "service openvswitch-switch restart"

The ping running on the controller node didn't see any packets get dropped.

Roman, am I missing some step from your process?

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-07-11:

#46

Thanks for trying this out James! Roman, I'm also keen to see what we may have missed in your steps to reproduce, as I can't reproduce this either.

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2014-07-29:

#47

Roman, can you try to do repro once again? The issue should be fixed by https://review.openstack.org/#/c/101447/

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2014-08-01:

#48

The issue that Roman complains about is tracked as bug https://bugs.launchpad.net/tripleo/+bug/1292105 and resolved by patch https://review.openstack.org/#/c/101447/

Changed in neutron:
status:	Incomplete → Fix Committed

Thierry Carrez (ttx) on 2014-10-01

Changed in neutron:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2014-10-16

Changed in neutron:
milestone:	juno-1 → 2014.2

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2014-12-25:

#49

are you sure the patch https://review.openstack.org/#/c/101447/ can resolve this question ? I still can hit this problem after confirming my env has contained this patch.

Revision history for this message

James Polley (tchaypo) wrote on 2014-12-26: Re: [Bug 1290486] Re: neutron-openvswitch-agent does not recreate flows after ovsdb-server restarts

#50

It definitely solved my problem - but it seems to have caused problems for
other people.

There's a thread starting at
http://lists.openstack.org/pipermail/openstack-dev/2014-October/049311.html
- it continues into november and there's one last post from me in December.

On Thu, Dec 25, 2014 at 11:19 AM, Hua Zhang <email address hidden>
wrote:

> are you sure the patch https://review.openstack.org/#/c/101447/ can
> resolve this question ? I still can hit this problem after confirming
> my env has contained this patch.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1290486
>
> Title:
> neutron-openvswitch-agent does not recreate flows after ovsdb-server
> restarts
>
> Status in OpenStack Neutron (virtual network service):
> Fix Released
> Status in neutron icehouse series:
> Fix Released
> Status in tripleo - openstack on openstack:
> Fix Released
>
> Bug description:
> The DHCP requests were not being responded to after they were seen on
> the undercloud network interface. The neutron services were restarted
> in an attempt to ensure they had the newest configuration and knew
> they were supposed to respond to the requests.
>
> Rather than using the heat stack create (called in
> devtest_overcloud.sh) to test, it was simple to use the following to
> directly boot a baremetal node.
>
> nova boot --flavor $(nova flavor-list | grep
> "|[[:space:]]*baremetal[[:space:]]*|" | awk '{print $2}) \
> --image $(nova image-list | grep
> "|[[:space:]]*overcloud-control[[:space:]]*|" | awk '{print $2}') \
> bm-test1
>
> Whilst the baremetal node was attempting to pxe boot a restart of the
> neutron services was performed. This allowed the baremetal node to
> boot.
>
> It has been observed that a neutron restart was needed for each
> subsequent reboot of the baremetal nodes to succeed.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1290486/+subscriptions
>

It definitely solved my problem - but it seems to have caused problems for
other people.

There's a thread starting at
http://lists.openstack.org/pipermail/openstack-dev/2014-October/049311.html
- it continues into november and there's one last post from me in December.

On Thu, Dec 25, 2014 at 11:19 AM, Hua Zhang <joshua.zhang@canonical.com>
wrote:

> are you sure the patch https://review.openstack.org/#/c/101447/ can
> resolve this question ?  I still can hit this problem after confirming
> my env has contained this patch.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1290486
>
> Title:
>   neutron-openvswitch-agent does not recreate flows after ovsdb-server
>   restarts
>
> Status in OpenStack Neutron (virtual network service):
>   Fix Released
> Status in neutron icehouse series:
>   Fix Released
> Status in tripleo - openstack on openstack:
>   Fix Released
>
> Bug description:
>   The DHCP requests were not being responded to after they were seen on
>   the undercloud network interface.  The neutron services were restarted
>   in an attempt to ensure they had the newest configuration and knew
>   they were supposed to respond to the requests.
>
>   Rather than using the heat stack create (called in
>   devtest_overcloud.sh) to test, it was simple to use the following to
>   directly boot a baremetal node.
>
>       nova boot --flavor $(nova flavor-list | grep
> "|[[:space:]]*baremetal[[:space:]]*|" | awk '{print $2}) \
>             --image $(nova image-list | grep
> "|[[:space:]]*overcloud-control[[:space:]]*|" | awk '{print $2}') \
>             bm-test1
>
>   Whilst the baremetal node was attempting to pxe boot a restart of the
>   neutron services was performed.  This allowed the baremetal node to
>   boot.
>
>   It has been observed that a neutron restart was needed for each
>   subsequent reboot of the baremetal nodes to succeed.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1290486/+subscriptions
>

neutron

neutron-openvswitch-agent does not recreate flows after ovsdb-server restarts

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
neutron	Fix Released	High	Eugene Nikanorov	neutron 2014.2 "juno"
Icehouse	Fix Released	High	Kyle Mestery	neutron 2014.1.1
tripleo	Fix Released	Critical	James Polley