brutal stop of ovs-agent doesn't kill ryu controller

Bug #1663458 reported by Emilien Macchi on 2017-02-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Ihar Hrachyshka
tripleo
High
Unassigned

Bug Description

It seems like when we kill neutron-ovs-agent and start it again, the ryu controller fails to start because the previous instance (in eventlet) is still running.

(... ovs agent is failing to start and is brutally killed)

Trying to start the process 5 minutes later:
INFO neutron.common.config [-] /usr/bin/neutron-openvswitch-agent version 10.0.0.0rc2.dev33
INFO ryu.base.app_manager [-] loading app neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp
INFO ryu.base.app_manager [-] loading app ryu.app.ofctl.service
INFO ryu.base.app_manager [-] loading app ryu.controller.ofp_handler
INFO ryu.base.app_manager [-] instantiating app neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp of OVSNeutronAgentRyuApp
INFO ryu.base.app_manager [-] instantiating app ryu.controller.ofp_handler of OFPHandler
INFO ryu.base.app_manager [-] instantiating app ryu.app.ofctl.service of OfctlService
ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ryu/controller/controller.py", line 97, in __call__
    self.ofp_ssl_listen_port)
  File "/usr/lib/python2.7/site-packages/ryu/controller/controller.py", line 120, in server_loop
    datapath_connection_factory)
  File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 117, in __init__
    self.server = eventlet.listen(listen_info)
  File "/usr/lib/python2.7/site-packages/eventlet/convenience.py", line 43, in listen
    sock.bind(addr)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 98] Address already in use
INFO neutron.agent.ovsdb.native.vlog [-] tcp:127.0.0.1:6640: connecting...
INFO neutron.agent.ovsdb.native.vlog [-] tcp:127.0.0.1:6640: connected
INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [-] Bridge br-int has datapath-ID 0000badb62a6184f
ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [-] Switch connection timeout

I haven't figured out yet how the previous instance of ovs agent was killed (my theory is that Puppet killed it but I don't have the killing code yet, I'll update the bug asap).

Emilien Macchi (emilienm) wrote :

On TripleO side, I'm trying to move neutron ovs agent to step 5, so we're sure it starts after neutron-server (which is run at step 4): https://review.openstack.org/#/c/431725/

Changed in tripleo:
status: New → Triaged
assignee: nobody → Emilien Macchi (emilienm)
milestone: none → ocata-rc1
importance: Undecided → Critical
tags: added: alert ci

The problem here is that the backoff RPC client raises tunnel_sync timeout up to 480 secs, and the agent does not respond to SIGTERM immediately, but only after tunnel_sync running at the moment exits; so systemd decides (after 90 secs of waiting) to kill the agent with SIGKILL, which makes ryu green thread to stay unkilled. The problem is that the default systemd wait time for SIGTERM processing (90s) is a lot shorter than the max time to wait for RPC reply (TRANSPORT.conf.rpc_response_timeout * 10). It's either we raise systemd wait time to that level, or we change the max ceiling for backoff. Probably the former, which would mean Won't Fix for neutron, and a change in RDO, but I need to think about it a bit more.

tags: added: needs-attention ovs
removed: alert ci
Changed in neutron:
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)

Another note: backoff client is present in Mitaka+, so all branches are affected. Not sure why you hit it with master only. Apparently TripleO changed service startup ordering so that the issue is now exposed?

Reviewed: https://review.openstack.org/431725
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=bb63f514d22ea82d17947a5972b4da16e66b5a36
Submitter: Jenkins
Branch: master

commit bb63f514d22ea82d17947a5972b4da16e66b5a36
Author: Emilien Macchi <email address hidden>
Date: Thu Feb 9 14:34:13 2017 -0500

    Run nova-cell_v2-discover_hosts at step 5

    We need to run nova-cell_v2-discover_hosts at the very end of the
    deployment because nova database needs to be aware of all registred
    compute hosts.

    1. Move keystone resources management at step 3.
    2. Move nova-compute service at step 4.
    3. Move nova-placement-api at step 3.
    5. Run nova-cell_v2-discover_hosts at step 5 on one nova-api node.
    6. Run neutron-ovs-agent at step 5 to avoid racy deployments where
       it starts before neutron-server when doing HA deployments.

    With that change, we expect Nova aware of all compute services deployed
    in TripleO during an initial deployment.

    Depends-On: If943157b2b4afeb640919e77ef0214518e13ee15
    Change-Id: I6f2df2a83a248fb5dc21c2bd56029eb45b66ceae
    Related-Bug: #1663273
    Related-Bug: #1663458

Changed in tripleo:
importance: Critical → High
milestone: ocata-rc1 → pike-1
tags: added: juno-backport-potential
Changed in tripleo:
assignee: Emilien Macchi (emilienm) → nobody

I removed juno-backport-potential because Juno is not supported by upstream. I hope it does not get in the way of how tripleo manages backports (?).

tags: removed: juno-backport-potential

From Neutron perspective, there are several points to look at:

1. it seems like the process does not give away the Ryu port/exit after SIGKILL received. Is it normal kernel behavior? I would expect process nuked, and hence all resources it held freed. Did it get into some state that doesn't guarantee resources freed? (Zombie?) Can we do anything with it on Neutron side, like monitoring parent state in Ryu thread, and self-exit if a parent 'heartbeat' stops?

2. it's probably wrong that the agent doesn't get to exit because of waiting for RPC reply. We should explore ways to abrupt the agent without waiting for the backed off timeout to occur.

3. do we use the right backoff ceiling? isn't 480 seconds too much to lock the whole agent?

From RDO perspective, we may need to look at tuning systemd unit term/kill timeouts to reflect default backoff behavior.

I checked whether other services expose similar behavior in regards to postponing shutdown until backoff timeout is up. I was hoping that by using oslo.service and relying on its signal handlers, we could avoid waiting for so long to shut down a service. Sadly, it doesn't seem to be the case: L3 agent that already uses oslo.service library still hangs for 480 secs after signal received to exit.

Related fix proposed to branch: master
Review: https://review.openstack.org/433276

Changed in neutron:
status: New → Confirmed
status: Confirmed → In Progress
importance: Undecided → High
IWAMOTO Toshihiro (iwamoto) wrote :

SIGKILL leaves neutron-rootwrap-daemon started by ovs-agent, which is still listening on 6633.

It has been fixed in oslo.rootwrap.

https://bugs.launchpad.net/oslo.rootwrap/+bug/1658973
https://review.openstack.org/#/c/425069/

IWAMOTO Toshihiro (iwamoto) wrote :

It is easily reproducible with devstack and sigkilling ovs-agent.

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/433276
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Reviewed: https://review.openstack.org/432481
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=aac17c6be90c2640ce9df4b02027d8fc01944fd8
Submitter: Jenkins
Branch: master

commit aac17c6be90c2640ce9df4b02027d8fc01944fd8
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Feb 9 06:43:55 2017 +0000

    Reconcile quitting_rpc_timeout with backoff RPC client

    With backoff client, setting .timeout property on it doesn't take any
    effect. It means that starting from Mitaka, we broke
    quitting_rpc_timeout option.

    Now, when the TERM signal is received, we reset the dict capturing
    per-method timeouts; and we cap waiting times by the value of the
    option. This significantly reduces time needed for the agent to
    gracefully shut down.

    Change-Id: I2d86ed7a6f337395bfcfdb0698ec685cf384f172
    Related-Bug: #1663458

Changed in neutron:
milestone: none → pike-1

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/433276
Reason: Not feeling great about the approach. We may want to work with oslo.messaging folks to make RPC communication interruptable (bugs already reported). With rootwrap fix it should be less of an issue.

The agent now reduces timeout for RPC requests. It doesn't affect existing requests, and so there is still an issue, but that should be (first) solved in oslo.messaging, for which bug 1672836 was reported. We may revisit how we interrupt RPC communication in the future when we have support for that in oslo. For now, let's close the bug.

Changed in neutron:
status: In Progress → Fix Released
tags: removed: needs-attention

Reviewed: https://review.openstack.org/448728
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=33a898ad13a2efbc4bdeb90a06f94b3550c1f8d0
Submitter: Jenkins
Branch: stable/newton

commit 33a898ad13a2efbc4bdeb90a06f94b3550c1f8d0
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Feb 9 06:43:55 2017 +0000

    Reconcile quitting_rpc_timeout with backoff RPC client

    With backoff client, setting .timeout property on it doesn't take any
    effect. It means that starting from Mitaka, we broke
    quitting_rpc_timeout option.

    Now, when the TERM signal is received, we reset the dict capturing
    per-method timeouts; and we cap waiting times by the value of the
    option. This significantly reduces time needed for the agent to
    gracefully shut down.

    Change-Id: I2d86ed7a6f337395bfcfdb0698ec685cf384f172
    Related-Bug: #1663458
    (cherry picked from commit aac17c6be90c2640ce9df4b02027d8fc01944fd8)

tags: added: in-stable-newton

Reviewed: https://review.openstack.org/448727
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=10f6e89e82edbdf8b44829527ca12bc2574ee472
Submitter: Jenkins
Branch: stable/ocata

commit 10f6e89e82edbdf8b44829527ca12bc2574ee472
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Feb 9 06:43:55 2017 +0000

    Reconcile quitting_rpc_timeout with backoff RPC client

    With backoff client, setting .timeout property on it doesn't take any
    effect. It means that starting from Mitaka, we broke
    quitting_rpc_timeout option.

    Now, when the TERM signal is received, we reset the dict capturing
    per-method timeouts; and we cap waiting times by the value of the
    option. This significantly reduces time needed for the agent to
    gracefully shut down.

    Change-Id: I2d86ed7a6f337395bfcfdb0698ec685cf384f172
    Related-Bug: #1663458
    (cherry picked from commit aac17c6be90c2640ce9df4b02027d8fc01944fd8)

tags: added: in-stable-ocata
Changed in tripleo:
milestone: pike-1 → pike-2
Changed in tripleo:
milestone: pike-2 → pike-3
Changed in tripleo:
milestone: pike-3 → pike-rc1
Ben Nemec (bnemec) wrote :

It looks like this was fixed a while ago. Feel free to reopen if I'm mistaken.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers