rootwrap daemon may return output of previous command

Bug #1654287 reported by Daniel Alvarez
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Daniel Alvarez
oslo.rootwrap
Fix Released
Critical
IWAMOTO Toshihiro

Bug Description

The functional test for netns_cleanup has failed in the gate today [0].

Apparently, when trying to get the list of devices (ip_lib.get_devices() 'find /sys/class/net -maxdepth 1 -type 1 -printf %f') through rootwrap_daemon, it's getting the output of the previous command instead ('netstat -nlp'). This causes that the netns_cleanup module tries to unplug random devices which correspond to the actual output of the 'netstat' command.

This bug doesn't look related to the test itself but to rootwrap_daemon? Maybe due to long output to the netstat command?

Relevant part of the log

2017-01-05 12:17:04.609 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'netstat', '-nlp'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-01-05 12:17:04.613 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149
2017-01-05 12:17:04.614 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'find', '/sys/class/net', '-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-01-05 12:17:04.645 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 14 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
2017-01-05 12:17:04.686 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149
2017-01-05 12:17:04.688 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 'Active'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-01-05 12:17:04.746 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149
2017-01-05 12:17:04.747 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 'Internet'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-01-05 12:17:04.758 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 14 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
2017-01-05 12:17:04.815 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 14 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 7 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): InterfaceToBridgeCommand(name=Internet) do_commit neutron/agent/ovsdb/impl_idl.py:100
2017-01-05 12:17:04.823 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction aborted do_commit neutron/agent/ovsdb/impl_idl.py:124
2017-01-05 12:17:04.824 27615 DEBUG neutron.cmd.netns_cleanup [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for device: Internet unplug_device neutron/cmd/netns_cleanup.py:138
2017-01-05 12:17:04.824 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 'connections'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
....
2017-01-05 12:17:06.388 27615 DEBUG neutron.cmd.netns_cleanup [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for device: Path unplug_device neutron/cmd/netns_cleanup.py:138
2017-01-05 12:17:06.389 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', '-o', 'netns', 'list'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-01-05 12:17:06.454 27615 ERROR neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot find device "Path"

2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Error unable to destroy namespace: qrouter-cf2030c6-c924-45bb-b13b-6774d275b394
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup Traceback (most recent call last):
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup File "neutron/cmd/netns_cleanup.py", line 250, in destroy_namespace
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup ip.garbage_collect_namespace()
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup File "neutron/agent/linux/ip_lib.py", line 222, in garbage_collect_namespace
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup if self.namespace and self.netns.exists(self.namespace):
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup File "neutron/agent/linux/ip_lib.py", line 888, in exists
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup run_as_root=cfg.CONF.AGENT.use_helper_for_ns_read)
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup File "neutron/agent/linux/ip_lib.py", line 107, in _execute
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup log_fail_as_error=log_fail_as_error)
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup File "neutron/agent/linux/utils.py", line 147, in execute
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup raise ProcessExecutionError(msg, returncode=returncode)
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot find device "Path"

[0] http://logs.openstack.org/51/396651/15/check/gate-neutron-dsvm-functional-ubuntu-xenial/5d268f0/logs/dsvm-functional-logs/neutron.tests.functional.cmd.test_netns_cleanup.NetnsCleanupTest.test_cleanup_network_namespaces_cleans_dhcp_and_l3_namespaces.txt.gz

Changed in neutron:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/416998

Revision history for this message
Boden R (boden) wrote : Re: functional test netns_cleanup failing in gate

I was trying to determine how frequently this is occurring, but my logstash skills are weak.
If I search for "Error unable to destroy namespace" I get no hits in the last 30 days [1].
If I search for "Stderr: Cannot find device" I get a number of hits starting on 01/02/2017 [2].
It seems neither of these queries are showing a good view of the occurrences of this issue.

In either case, this is a gate failure and as such marking as high importance. If others want further justification as to the importance I dig deeper into the failures/logstash.

[1] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Error%20unable%20to%20destroy%20namespace%5C%22
[2] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Stderr%3A%20Cannot%20find%20device%5C%22

Changed in neutron:
importance: Undecided → High
Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

After adding further logs we can see this at [1]. So looks like rootwrap-daemon is actually returning the output of the previous command:

2017-01-11 17:15:38.179 27681 DEBUG neutron.agent.linux.utils [req-45c79637-b3f6-4a55-87e9-4974410bf578 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-65b9d72c-685d-4646-9d54-19a63d9d968f', 'find', '/sys/class/net', '-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-01-11 17:15:38.250 27681 DEBUG neutron.agent.linux.utils [req-45c79637-b3f6-4a55-87e9-4974410bf578 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149
2017-01-11 17:15:38.251 27681 DEBUG neutron.agent.linux.ip_lib [req-45c79637-b3f6-4a55-87e9-4974410bf578 - - - - -] get_devices(): Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
Active UNIX domain sockets (only servers)
Proto RefCnt Flags Type State I-Node PID/Program name Path
 get_devices neutron/agent/linux/ip_lib.py:140

[1] http://logs.openstack.org/73/373973/13/check/gate-neutron-dsvm-functional-ubuntu-xenial/20b6a30/logs/dsvm-functional-logs/neutron.tests.functional.cmd.test_netns_cleanup.NetnsCleanupTest.test_cleanup_network_namespaces_cleans_dhcp_and_l3_namespaces.txt.gz

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/419541

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/421325

Changed in neutron:
assignee: nobody → Daniel Alvarez (dalvarezs)
status: Confirmed → In Progress
Revision history for this message
Daniel Alvarez (dalvarezs) wrote : Re: functional test netns_cleanup failing in gate

I've been working with @jlibosva and we've determined that if eventlet.timeout.Timeout exception was raised when a process is reading the output of a command executed through rootwrap-daemon, then this output will be read by the next command. All susbsequent calls to that instance of rootwrap-daemon will read the previous command output instead of their own.

Although we have implemented a workaround for this specific case of netns-cleanup, we should make sure that there are no calls to rootwrap-daemon inside a wait_until_true loop and/or fix rootwrap-daemon to allow processes to be interrupted before they're done with reading the output.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/419541

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/422984

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/421325
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3f9f740d81b51be5e563069c720366fa90ade9ee
Submitter: Jenkins
Branch: master

commit 3f9f740d81b51be5e563069c720366fa90ade9ee
Author: Daniel Alvarez <email address hidden>
Date: Thu Jan 12 01:06:01 2017 +0000

    Fix netns_cleanup interrupted on rwd I/O

    Functional tests for netns_cleanup have been failing a few times
    in the gate lately. After thorough tests we've seen that the issue was
    related to using rootwrap-daemon inside a wait_until_true loop. When
    timeout fired while utils.execute() was reading from rootwrap-daemon,
    it got interrupted and the output of the last command was not read.
    Therefore, next calls to utils.execute() would read the output of
    their previous command rather than their own, leading to unexpected
    results.

    This fix will poll existing processes in the namespace without making
    use of the wait_until_true loop. Instead, it will check elapsed time
    and raise the exception if timeout is exceeded.

    Also, i'm removing debug traces introduced in
    327f7fc4d54bbaaed3778b5eb3c51a037a9a178f which helped finding the root
    cause of this bug.

    Change-Id: Ie233261e4be36eecaf6ec6d0532f0f5e2e996cd2
    Closes-Bug: #1654287

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/422984
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5ff38b7f306d1aa1046c129f317c962c6b7b93ea
Submitter: Jenkins
Branch: stable/newton

commit 5ff38b7f306d1aa1046c129f317c962c6b7b93ea
Author: Daniel Alvarez <email address hidden>
Date: Thu Jan 12 01:06:01 2017 +0000

    Fix netns_cleanup interrupted on rwd I/O

    Functional tests for netns_cleanup have been failing a few times
    in the gate lately. After thorough tests we've seen that the issue was
    related to using rootwrap-daemon inside a wait_until_true loop. When
    timeout fired while utils.execute() was reading from rootwrap-daemon,
    it got interrupted and the output of the last command was not read.
    Therefore, next calls to utils.execute() would read the output of
    their previous command rather than their own, leading to unexpected
    results.

    This fix will poll existing processes in the namespace without making
    use of the wait_until_true loop. Instead, it will check elapsed time
    and raise the exception if timeout is exceeded.

    Also, i'm removing debug traces introduced in
    327f7fc4d54bbaaed3778b5eb3c51a037a9a178f which helped finding the root
    cause of this bug.

    (cherry picked from commit 3f9f740d81b51be5e563069c720366fa90ade9ee)
    Change-Id: Ie233261e4be36eecaf6ec6d0532f0f5e2e996cd2
    Closes-Bug: #1654287

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0b3

This issue was fixed in the openstack/neutron 10.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.2.0

This issue was fixed in the openstack/neutron 9.2.0 release.

tags: removed: neutron-proactive-backport-potential
summary: - functional test netns_cleanup failing in gate
+ rootwrap daemon may return output of previous command
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/510161

Changed in oslo.rootwrap:
importance: Undecided → Critical
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Changed in oslo.rootwrap:
status: New → Confirmed
Ken Giusti (kgiusti)
Changed in oslo.rootwrap:
assignee: nobody → ChangBo Guo(gcb) (glongwave)
Revision history for this message
ChangBo Guo(gcb) (glongwave) wrote :

From the issue description, it shows the caller can't flush the reading socket used by rootwrap-daemon when caller has exception. A Test case in oslo.rootwrap is helpful to reproduce the issue. then we can try to fix it rootwrap by providing a way to flush data or document how to avoid the issues.

Changed in oslo.rootwrap:
assignee: ChangBo Guo(gcb) (glongwave) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.rootwrap (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/514547

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

In reply to #15,

the root cause is that a connection to a rootwrap daemon is shared among eventlet threads.
Instructing users to flush data after an eventlet context switch is not a real solution.

multiprocessing seems to have been written in multithreading in mind, but unfortunately it is written for threading, which is a fork based thread implementation and isn't compatible with eventlet.

I think we have 2 options:

1. make a new connection every time when Client.execute is called.
   This is not easy as multiprocessing.BaseProxy has a class variable to share the connection, and it's buried deep in the multiprocessing lib.

2. protect Client.execute with a eventlet mutex against simultaneous calls.
   My assumption is that oslo.rootwrap doesn't want to depend on eventlet. eventlet detection can be a bit tricky.

Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

Toshihiro,

for the option #2, why eventlet detection is necessary?
i guess you can use an arbitrary mechanism, say threading.Lock, and let eventlet patch it.

Changed in oslo.rootwrap:
assignee: nobody → IWAMOTO Toshihiro (iwamoto)
status: Confirmed → In Progress
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

Not directly related. FYI:

https://github.com/eventlet/eventlet/issues/210
multiprocessing compatibility

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

Possibly due to this bug.

http://logs.openstack.org/04/492404/17/check/legacy-neutron-dsvm-fullstack/849d6f6/logs/testr_results.html.gz

  File "neutron/tests/common/net_helpers.py", line 196, in get_unused_port
    end = int(port_range.split()[0]) - 1
ValueError: invalid literal for int() with base 10: 'State'

There are some more "invalid literal for int" query hits at logstash.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/510161
Reason: Abandoning in favor of https://review.openstack.org/#/c/514547/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.rootwrap (master)

Reviewed: https://review.openstack.org/514547
Committed: https://git.openstack.org/cgit/openstack/oslo.rootwrap/commit/?id=7711a6ce31c3ffa8249bfeac53a4d9e306a7299f
Submitter: Zuul
Branch: master

commit 7711a6ce31c3ffa8249bfeac53a4d9e306a7299f
Author: IWAMOTO Toshihiro <email address hidden>
Date: Tue Oct 24 16:27:13 2017 +0900

    Protect rootwrap daemon socket against multiple threads

    Wrap the call with eventlet.Semaphore. Simultaneous Client.execute
    calls can fail badly. Alternatively, rootwrap daemon connections
    could be made every time when Client.execute is called, without
    using a semaphore.

    Change-Id: Id9d38832c67f2d81d382cda797a48fee943a27f1
    Closes-bug: #1654287

Changed in oslo.rootwrap:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.rootwrap 5.11.0

This issue was fixed in the openstack/oslo.rootwrap 5.11.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.rootwrap (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/520753

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.rootwrap (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/520754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.rootwrap (stable/pike)

Reviewed: https://review.openstack.org/520753
Committed: https://git.openstack.org/cgit/openstack/oslo.rootwrap/commit/?id=801938bbf0e061343dfb64786af52aae911e7b2f
Submitter: Zuul
Branch: stable/pike

commit 801938bbf0e061343dfb64786af52aae911e7b2f
Author: IWAMOTO Toshihiro <email address hidden>
Date: Tue Oct 24 16:27:13 2017 +0900

    Protect rootwrap daemon socket against multiple threads

    Wrap the call with eventlet.Semaphore. Simultaneous Client.execute
    calls can fail badly. Alternatively, rootwrap daemon connections
    could be made every time when Client.execute is called, without
    using a semaphore.

    Change-Id: Id9d38832c67f2d81d382cda797a48fee943a27f1
    Closes-bug: #1654287
    (cherry picked from commit 7711a6ce31c3ffa8249bfeac53a4d9e306a7299f)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.rootwrap (stable/ocata)

Reviewed: https://review.openstack.org/520754
Committed: https://git.openstack.org/cgit/openstack/oslo.rootwrap/commit/?id=3ab92c6091fb175973a96244c5c1a4d3226f7373
Submitter: Zuul
Branch: stable/ocata

commit 3ab92c6091fb175973a96244c5c1a4d3226f7373
Author: IWAMOTO Toshihiro <email address hidden>
Date: Tue Oct 24 16:27:13 2017 +0900

    Protect rootwrap daemon socket against multiple threads

    Wrap the call with eventlet.Semaphore. Simultaneous Client.execute
    calls can fail badly. Alternatively, rootwrap daemon connections
    could be made every time when Client.execute is called, without
    using a semaphore.

    Change-Id: Id9d38832c67f2d81d382cda797a48fee943a27f1
    Closes-bug: #1654287
    (cherry picked from commit 7711a6ce31c3ffa8249bfeac53a4d9e306a7299f)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.rootwrap 5.4.2

This issue was fixed in the openstack/oslo.rootwrap 5.4.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.rootwrap 5.9.2

This issue was fixed in the openstack/oslo.rootwrap 5.9.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.