Restart neutron-openvswitch-agent get ERROR "Switch connection timeout"

Bug #1611237 reported by yujie
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
devstack
Invalid
Undecided
Unassigned
neutron
Fix Released
High
IWAMOTO Toshihiro

Bug Description

Environment: devstack master, ubuntu 14.04

After ./stack.sh finished, kill the neutron-openvswitch-agent process and then start it by /usr/bin/python /usr/local/bin/neutron-openvswitch-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini

The log shows :
2016-08-08 11:02:06.346 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ryu/lib/hub.py", line 54, in _launch
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/ryu/controller/controller.py", line 97, in __call__
    self.ofp_ssl_listen_port)
  File "/usr/local/lib/python2.7/dist-packages/ryu/controller/controller.py", line 120, in server_loop
    datapath_connection_factory)
  File "/usr/local/lib/python2.7/dist-packages/ryu/lib/hub.py", line 117, in __init__
    self.server = eventlet.listen(listen_info)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/convenience.py", line 43, in listen
    sock.bind(addr)
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 98] Address already in use

and
ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [-] Switch connection timeout

In kilo I could start ovs-agent in this way correctly, I do not know it is right to start ovs-agent in master.

Revision history for this message
Brian Haley (brian-haley) wrote :

Are you doing this from inside the screen session?

$ screen -r
$ ctrl-shift-A-"
find q-agt
^c
(last command, return)

That worked fine for me just now on a 2-3 day old devstack from master.

Revision history for this message
Nate Johnston (nate-johnston) wrote :

I just fired up a fresh devstack from master and had the same experience as Brian.

Revision history for this message
Brian Haley (brian-haley) wrote :

Oh, you are using Ryu, I am just running the "regular" OVS agent.

Revision history for this message
yujie (16189455-d) wrote :

@Brian Haley, I use ps -ef|grep neutron-openvswitch-agent to find the process and kill it, not from screen.

I just want to running regular ovs agent, but after devstack ./stack.sh finish, it uses ryu auto.

Revision history for this message
Shashank Kumar Shankar (shashank-kumar-shankar) wrote :

I had the same error, a fresh ./unstack and ./stack fixed the issue.

Revision history for this message
yujie (16189455-d) wrote :

Thanks Shashank, sometimes I need change code and look at the effect. If every change should execute ./stack, it will be diffcute to use.

Revision history for this message
Brian Haley (brian-haley) wrote :

@Yujie - since it is the RYU code complaining you will need to work with them, that code is not in the neutron tree.

Changed in neutron:
status: New → Invalid
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I reopen the bug; we use Ryu local controller by default for OVS agent, we probably forget to kill the controller before exiting from the agent.

Changed in neutron:
status: Invalid → New
Revision history for this message
Brian Haley (brian-haley) wrote :

If there is more information the submitter has can you please add it so we can try and track down this failure?

Changed in neutron:
status: New → Incomplete
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Brian said in IRC he could not reproduce the issue locally; please provide detailed steps to reproduce the issue.

Revision history for this message
Victor Morales (electrocucaracha) wrote :
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Changed in neutron:
status: Incomplete → Confirmed
importance: Undecided → High
tags: added: ovs
Revision history for this message
Victor Morales (electrocucaracha) wrote :
Download full text (4.0 KiB)

We have detected the same issue during the execution of a grenade job. The conclusions are when the neutron-openvswitch-agent was starting

2016-11-07 19:37:42.028 28353 ERROR neutron.agent.linux.utils [req-76bce4d1-4005-4227-bfe4-c5d46ebe29b1 - -] Exit code: 255; Stdin: ; Stdout: ; Stderr: sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-arptables: No such file or directory
...
2016-11-07 19:49:37.564 28353 ERROR ryu.lib.hub [req-76bce4d1-4005-4227-bfe4-c5d46ebe29b1 - -] hub: uncaught exception: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ryu/lib/hub.py", line 54, in _launch
    return func(*args, **kwargs)
  File "/opt/stack/old/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_ryuapp.py", line 37, in agent_main_wrapper
    ovs_agent.main(bridge_classes)
  File "/opt/stack/old/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2175, in main
    agent.daemon_loop()
  File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 154, in wrapper
    return f(*args, **kwargs)
  File "/opt/stack/old/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2096, in daemon_loop
    self.rpc_loop(polling_manager=pm)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/opt/stack/old/neutron/neutron/agent/linux/polling.py", line 42, in get_polling_manager
    pm.stop()
  File "/opt/stack/old/neutron/neutron/agent/linux/polling.py", line 61, in stop
    self._monitor.stop()
  File "/opt/stack/old/neutron/neutron/agent/linux/async_process.py", line 129, in stop
    self._kill(kill_signal)
  File "/opt/stack/old/neutron/neutron/agent/linux/async_process.py", line 163, in _kill
    pid = self.pid
  File "/opt/stack/old/neutron/neutron/agent/linux/async_process.py", line 159, in pid
    run_as_root=self.run_as_root)
  File "/opt/stack/old/neutron/neutron/agent/linux/utils.py", line 240, in get_root_helper_child_pid
    pid = find_child_pids(pid)[0]
  File "/opt/stack/old/neutron/neutron/agent/linux/utils.py", line 181, in find_child_pids
    return []
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/old/neutron/neutron/agent/linux/utils.py", line 173, in find_child_pids
    log_fail_as_error=False)
  File "/opt/stack/old/neutron/neutron/agent/linux/utils.py", line 144, in execute
    raise ProcessExecutionError(msg, returncode=returncode)
ProcessExecutionError: Exit code: -15; Stdin: ; Stdout: ; Stderr: Signal 15 (TERM) caught by ps (procps-ng version 3.3.10).
ps:display.c:66: please report this bug

2016-11-07 19:50:50.460 28353 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!

This seems to keep the address of the port and affects the target phase in grenade

2016-11-07 19:54:54.806 11875 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last):
  File "/usr/local/lib/pyth...

Read more...

Revision history for this message
Brian Haley (brian-haley) wrote :

I found this bug:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=732410

And I'm assuming Ubuntu has picked that up in their 3.3.10 ?

I guess it's starting to look less like a neutron bug when ps is failing.

Revision history for this message
Brian Haley (brian-haley) wrote :

And this is the commit from the git repo:

commit d06aaaaf2bd8f3b5f0235e75f4f04c0ad69c7d6d
Author: Craig Small <email address hidden>
Date: Tue Jan 14 22:23:58 2014 +1100

    ps: ignore SIGCONT

    SIGCONT is a continue signal. It seems that some zsh setups can send
    this signal, causing ps to abort. This is not what "continue" means.
    This change just uses the default handler which will continue a stopped
    process.

    References:
      http://bugs.debian.org/732410
      http://www.zsh.org/cgi-bin/mla/redirect?WORKERNUMBER=32251

    Signed-off-by: Craig Small <email address hidden>

diff --git a/NEWS b/NEWS
index 1c710a3..a2afaa3 100644
--- a/NEWS
+++ b/NEWS
@@ -1,6 +1,7 @@
 procps-ng-3.3.10
 ----------------
   * sysctl --system loads default config file - Debian #732920
+ * ps doesn't exit on SIGCONT

 procps-ng-3.3.9
 ---------------
diff --git a/ps/display.c b/ps/display.c
index c20285d..693154b 100644
--- a/ps/display.c
+++ b/ps/display.c
@@ -563,6 +563,7 @@ int main(int argc, char *argv[]){
     default:
       sigaction(i,&sa,NULL);
     case 0:
+ case SIGCONT:
     case SIGINT: /* ^C */
     case SIGTSTP: /* ^Z */
     case SIGTTOU: /* see stty(1) man page */

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Probably related Ubuntu bug: https://bugs.launchpad.net/ubuntu/+source/procps/+bug/1055551 Sadly, there seems to be no traction there for years.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

This will probably be fixed by switch to Xenial where we have a fixed 2:3.3.10-4ubuntu2 package.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I have checked if EL7 includes the fix. It does, as any 3.3.10+ packages because it was included in the upstream tarball.

Changed in neutron:
status: Confirmed → Invalid
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Changed in neutron:
status: Invalid → Confirmed
Revision history for this message
Sean M. Collins (scollins) wrote :
Revision history for this message
Brian Haley (brian-haley) wrote :

Also, the newest error is with SIGTERM:

ProcessExecutionError: Exit code: -15; Stdin: ; Stdout: ; Stderr: Signal 15 (TERM) caught by ps (procps-ng version 3.3.10).
ps:display.c:66: please report this bug

It could be that procps-ng needs to catch that as well, latest code doesn't do it.

And we're not the first to see it:

http://www.freelists.org/post/procps/procpsng-nit

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

The following script, which mimics how devstack and grenade stops q-agt, can reproduce this bug with a few percent of probability.

The ovs agent must be explicitly shut down (see neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp.agent_main_wrapper). If the thread dies of an exception, the agent fails to terminate. It can be confirmed by the guru meditation in the #19 log.

sudo pkill -9 -f /usr/local/bin/neutron-rootwrap-daemon
pkill -g $(cat /opt/stack/status/stack/q-agt.pid )
sleep 1
pkill -g $(cat /opt/stack/status/stack/q-agt.pid )

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/400581

Changed in neutron:
assignee: nobody → IWAMOTO Toshihiro (iwamoto)
status: Confirmed → In Progress
Changed in neutron:
assignee: IWAMOTO Toshihiro (iwamoto) → Jakub Libosvar (libosvar)
Changed in neutron:
assignee: Jakub Libosvar (libosvar) → IWAMOTO Toshihiro (iwamoto)
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Why does grenade/devstack stop processes with SIGKILL? I believe it should be SIGTERM, no?

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Added devstack project to consider changing the way we kill services.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/400581
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3e45c19eedccd800d7c6e5f0fce57295cf3a7b02
Submitter: Jenkins
Branch: master

commit 3e45c19eedccd800d7c6e5f0fce57295cf3a7b02
Author: IWAMOTO Toshihiro <email address hidden>
Date: Tue Nov 22 16:59:54 2016 +0900

    ovs-agent: Catch exceptions in agent_main_wrapper

    When of_interface=native, the ovs agent code is run as a ryuapp thread,
    which means it must be properly shut down or the process fails to
    terminate. Catch exceptions and make sure that the agent terminates,
    even if in unlucky cases.

    Change-Id: I7aebeaa00e2416a275d9ecd940eb28c819349656
    Closes-Bug: #1611237

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/401907

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

If an agent were killed by SIGKILL, we wouldn't see this bug.

Correct shutdown sequence:

1. ovs-agent receives SIGTERM, _handle_sigterm is called a signal handler, setting the catch_sigterm flag.
2. Control exits from rpc_loop's while loop, causing main() to terminate.
3. app_manager.AppManager.get_instance().close is called to clean up all ryu threads.

The bug situation:

1. devstack sends SIGTERM to the process *group* of ovs-agent (note this happens twice).
2. The ovs-agent is respawning its helper processes, probably due to the above SIGTERM.
3. As you are very unlucky, a ps process to confirm process existence is killed, causing an ProcessExecutionError.
4. The thread running agent_main_wrapper is terminated due to the exception, without cleaning other ryu threads.
   Note: if of_interface=ovs-ofctl, an exception will terminate the agent.
5. As SIGTERM is handled by a signal handler, the ovs-agent fails to terminate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/402569

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/401907
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c0ef390676e385169f56335c256aef736d7bb946
Submitter: Jenkins
Branch: stable/newton

commit c0ef390676e385169f56335c256aef736d7bb946
Author: IWAMOTO Toshihiro <email address hidden>
Date: Tue Nov 22 16:59:54 2016 +0900

    ovs-agent: Catch exceptions in agent_main_wrapper

    When of_interface=native, the ovs agent code is run as a ryuapp thread,
    which means it must be properly shut down or the process fails to
    terminate. Catch exceptions and make sure that the agent terminates,
    even if in unlucky cases.

    Change-Id: I7aebeaa00e2416a275d9ecd940eb28c819349656
    Closes-Bug: #1611237
    (cherry picked from commit 3e45c19eedccd800d7c6e5f0fce57295cf3a7b02)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/404026

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/402569
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=949bce960d04ca541387aac9c787ca04af496df8
Submitter: Jenkins
Branch: master

commit 949bce960d04ca541387aac9c787ca04af496df8
Author: Jakub Libosvar <email address hidden>
Date: Fri Nov 25 08:15:43 2016 -0500

    ovs-agent: Close ryu app on all exceptions

    Previous patch closes app only when ovs-agent raises an exception. This
    leaves some corner cases where exceptions inheriting from BaseException
    are raised. It's better to be defensive and always close app on error.

    Change-Id: Icaaaecc4d00e3a280c3af2e403499bb7ac9e8aa6
    Related-bug: 1611237

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/404026
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e4890a37e43af0c84ebbfc5484e398771d0e91db
Submitter: Jenkins
Branch: stable/newton

commit e4890a37e43af0c84ebbfc5484e398771d0e91db
Author: Jakub Libosvar <email address hidden>
Date: Fri Nov 25 08:15:43 2016 -0500

    ovs-agent: Close ryu app on all exceptions

    Previous patch closes app only when ovs-agent raises an exception. This
    leaves some corner cases where exceptions inheriting from BaseException
    are raised. It's better to be defensive and always close app on error.

    Change-Id: Icaaaecc4d00e3a280c3af2e403499bb7ac9e8aa6
    Related-bug: 1611237

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0b2

This issue was fixed in the openstack/neutron 10.0.0.0b2 development milestone.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/421953

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/421955

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/421953
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a1ca8ee5a1cca549cd294295a6af2a396b9b3957
Submitter: Jenkins
Branch: stable/mitaka

commit a1ca8ee5a1cca549cd294295a6af2a396b9b3957
Author: IWAMOTO Toshihiro <email address hidden>
Date: Tue Nov 22 16:59:54 2016 +0900

    ovs-agent: Catch exceptions in agent_main_wrapper

    When of_interface=native, the ovs agent code is run as a ryuapp thread,
    which means it must be properly shut down or the process fails to
    terminate. Catch exceptions and make sure that the agent terminates,
    even if in unlucky cases.

    Change-Id: I7aebeaa00e2416a275d9ecd940eb28c819349656
    Closes-Bug: #1611237
    (cherry picked from commit 3e45c19eedccd800d7c6e5f0fce57295cf3a7b02)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/421955
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7e616cb39cd3394aae454ee6a7869811de0f9ec3
Submitter: Jenkins
Branch: stable/mitaka

commit 7e616cb39cd3394aae454ee6a7869811de0f9ec3
Author: Jakub Libosvar <email address hidden>
Date: Fri Nov 25 08:15:43 2016 -0500

    ovs-agent: Close ryu app on all exceptions

    Previous patch closes app only when ovs-agent raises an exception. This
    leaves some corner cases where exceptions inheriting from BaseException
    are raised. It's better to be defensive and always close app on error.

    Change-Id: Icaaaecc4d00e3a280c3af2e403499bb7ac9e8aa6
    Related-bug: 1611237
    (cherry picked from commit 949bce960d04ca541387aac9c787ca04af496df8)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.2.0

This issue was fixed in the openstack/neutron 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.4.0

This issue was fixed in the openstack/neutron 8.4.0 release.

Revision history for this message
Robert Davidson (rdavidso) wrote :

Hmm - we're running Neutron 9.2.0 but still having this problem:

2017-04-19 20:14:48.481 25633 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ryu/controller/controller.py", line 97, in __call__
    self.ofp_ssl_listen_port)
  File "/usr/lib/python2.7/site-packages/ryu/controller/controller.py", line 120, in server_loop
    datapath_connection_factory)
  File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 117, in __init__
    self.server = eventlet.listen(listen_info)
  File "/usr/lib/python2.7/site-packages/eventlet/convenience.py", line 43, in listen
    sock.bind(addr)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 98] Address already in use

[root@hostname network-scripts]# rpm -qa | grep neutron
python-neutron-lib-0.4.0-1.el7.noarch
openstack-neutron-metering-agent-9.2.0-1.el7.noarch
openstack-neutron-fwaas-9.0.0-1.el7.noarch
python-neutron-9.2.0-1.el7.noarch
openstack-neutron-9.2.0-1.el7.noarch
openstack-neutron-lbaas-9.2.0-1.el7.noarch
python2-neutronclient-6.0.0-2.el7.noarch
openstack-neutron-common-9.2.0-1.el7.noarch
python-neutron-lbaas-9.2.0-1.el7.noarch
openstack-neutron-openvswitch-9.2.0-1.el7.noarch
python-neutron-fwaas-9.0.0-1.el7.noarch
openstack-neutron-ml2-9.2.0-1.el7.noarch

And we definitely have the fix from this ticket when I go and look at the ovs_ryuapp.py, so it's not a case of us having a bad package.

Revision history for this message
Sean Dague (sdague) wrote :

Automatically discovered version kilo in description. If this is incorrect, please update the description to include 'https://api.launchpad.net/1.0/devstack version: ...'

Revision history for this message
Sean Dague (sdague) wrote :

Automatically discovered version kilo in description. If this is incorrect, please update the description to include 'devstack version: ...'

Sean Dague (sdague)
tags: added: openstack-version.kilo
Changed in devstack:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.