Fullstack linux bridge agent sometimes refuses to die during test clean up, failing the test

Bug #1558819 reported by Assaf Muller
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Assaf Muller

Bug Description

Paste of failure:
http://paste.openstack.org/show/491014/

When looking at the LB agent logs, you start seeing RPC errors as neutron-server is unable to access the DB. What's happening is that fullstack times out trying to kill the LB agent and moves on to other clean ups. It deletes the DB for the test, but the agents and neutron-server live on, resulting in errors trying to access the DB. The DB errors are essentially unrelated - The root cause is that the agent refuses to die for an unknown reason.

The code that tries to stop the agent is AsyncProcess.stop(block=True, signal=9).
Another detail that might be relevant is that the agent lives in a namespace.

To reproduce locally, go to the VM running the fullstack tests and load all CPUs to 100%, then run:
tox -e dsvm-fullstack TestLinuxBridgeConnectivitySameNetwork

Revision history for this message
Assaf Muller (amuller) wrote :

We had the same symptom with the OVS agent, that was fixed via https://review.openstack.org/#/c/234770/. I have absolutely no idea if it's related at this point.

tags: added: linuxbridge
Changed in neutron:
milestone: none → newton-1
Assaf Muller (amuller)
description: updated
Revision history for this message
Assaf Muller (amuller) wrote :

I started looking in to this. When trying to kill the Linuxbridge agent, we use AsyncProcess.stop, which calls _kill_process. It looks like for failed runs, it's trying to kill the wrong pid. I added a print statement, and the pid it's trying to kill is not there, while the linuxbridge agent is alive and well with a different pid. We get a Runtime error saying that the process was not found, then utils.wait_until_true(lambda: not self.is_active()) times out as it does wait on the correct pid.

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Revision history for this message
John Schwarz (jschwarz) wrote :

Interesting that the pid doesn't exist. Out of the top of my head, AFAIR, async_process has something that can trace child processes, etc. Perhaps the LinuxBridge agent is daemonizing, killing the parent process in the progress and preventing async_process to work properly?

Revision history for this message
Assaf Muller (amuller) wrote :

@John, keep in mind that this sometimes works but sometimes doesn't.

Revision history for this message
Assaf Muller (amuller) wrote :

Under normal operation, pstree returns 'sudo ip netns...' > 'linuxbridge agent'. Typically get_root_helper_child_pid returns the child pid (linuxbridge agent), we kill that pid, and all is well. I added a bunch of prints and found out that when it fails, get_root_helper_child_pid returns a child of linuxbridge_agent (??), 'ps -p %s' output for it doesn't find anything, we try to kill it (Which obviously fails) and the agent lives on. Currently not sure what child is get_root_helper_child_pid finding. Could it be just any process that the LB agent happens to have executed when we checked for its children, like 'ip' / 'brctl' commands and so on?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/294798

Changed in neutron:
assignee: Slawek Kaplonski (slaweq) → Assaf Muller (amuller)
status: Confirmed → In Progress
Revision history for this message
Assaf Muller (amuller) wrote :

I pushed https://review.openstack.org/#/c/294798/ , I can no longer reproduce this bug in my local environment.

Revision history for this message
John Schwarz (jschwarz) wrote :

@Assaf, it isn't clear from your comment - can you no longer reproduce with the specified patch (hinting it might fix it), or even without the specified patch (something else changed)?

Also, I added Kuba as a subscriber - IIRC he worked on the whole get_root_helper_child_pid (or something around that neighborhood) and might be able to provide a "why this was implemented like this". I'm thinking about if that function always return the child pid, this should popped up a lot sooner (ie. ovs agent with ip monitor, l3 agent with keepalived...)

Perhaps the cause isn't get_root_helper_child_pid, but the way linuxbridge spawn new processes?

Revision history for this message
Assaf Muller (amuller) wrote :

I mean that the patch fixes the bug.

Revision history for this message
John Schwarz (jschwarz) wrote :

A deeper look at 'ps -ef' shows that most (if not all) processes spawned by other agents (l3 agent, ovs agent...) are daemonized (keepalived, ip monitor... have their PPID == 1), so that explains why get_root_helper_child_pid doesn't see them and why it didn't reproduce sooner. I'm fine with modifying the function as Assaf proposed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/299920

Changed in neutron:
assignee: Assaf Muller (amuller) → Jakub Libosvar (libosvar)
Changed in neutron:
assignee: Jakub Libosvar (libosvar) → Assaf Muller (amuller)
Changed in neutron:
assignee: Assaf Muller (amuller) → Jakub Libosvar (libosvar)
Changed in neutron:
assignee: Jakub Libosvar (libosvar) → Assaf Muller (amuller)
Changed in neutron:
assignee: Assaf Muller (amuller) → Jakub Libosvar (libosvar)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/299920
Reason: Its purpose has been fulfilled.

Changed in neutron:
assignee: Jakub Libosvar (libosvar) → Assaf Muller (amuller)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/294798
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fd93e19f2a415b3803700fc491749daba01a4390
Submitter: Jenkins
Branch: master

commit fd93e19f2a415b3803700fc491749daba01a4390
Author: Assaf Muller <email address hidden>
Date: Fri Mar 18 16:29:26 2016 -0400

    Change get_root_helper_child_pid to stop when it finds cmd

    get_root_helper_child_pid recursively finds the child of pid,
    until it can no longer find a child. However, the intention is
    not to find the deepest child, but to strip away root helpers.
    For example 'sudo neutron-rootwrap x' is supposed to find the
    pid of x. However, in cases 'x' spawned quick lived children of
    its own (For example: ip / brctl / ovs invocations),
    get_root_helper_child_pid returned those pids if called in
    the wrong time.

    Change-Id: I582aa5c931c8bfe57f49df6899445698270bb33e
    Closes-Bug: #1558819

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/321768

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/321770

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b1

This issue was fixed in the openstack/neutron 9.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/321770
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d7c8e53d9008ae9a9338caea84634e1ad57bed0c
Submitter: Jenkins
Branch: stable/liberty

commit d7c8e53d9008ae9a9338caea84634e1ad57bed0c
Author: Assaf Muller <email address hidden>
Date: Fri Mar 18 16:29:26 2016 -0400

    Change get_root_helper_child_pid to stop when it finds cmd

    get_root_helper_child_pid recursively finds the child of pid,
    until it can no longer find a child. However, the intention is
    not to find the deepest child, but to strip away root helpers.
    For example 'sudo neutron-rootwrap x' is supposed to find the
    pid of x. However, in cases 'x' spawned quick lived children of
    its own (For example: ip / brctl / ovs invocations),
    get_root_helper_child_pid returned those pids if called in
    the wrong time.

    Conflicts:
     neutron/tests/contrib/functional-testing.filters

    Change-Id: I582aa5c931c8bfe57f49df6899445698270bb33e
    Closes-Bug: #1558819
    (cherry picked from commit fd93e19f2a415b3803700fc491749daba01a4390)
    (cherry picked from commit 1d714c35add69ba1237ba63a5725e336892f3b9f)

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/321768
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1d714c35add69ba1237ba63a5725e336892f3b9f
Submitter: Jenkins
Branch: stable/mitaka

commit 1d714c35add69ba1237ba63a5725e336892f3b9f
Author: Assaf Muller <email address hidden>
Date: Fri Mar 18 16:29:26 2016 -0400

    Change get_root_helper_child_pid to stop when it finds cmd

    get_root_helper_child_pid recursively finds the child of pid,
    until it can no longer find a child. However, the intention is
    not to find the deepest child, but to strip away root helpers.
    For example 'sudo neutron-rootwrap x' is supposed to find the
    pid of x. However, in cases 'x' spawned quick lived children of
    its own (For example: ip / brctl / ovs invocations),
    get_root_helper_child_pid returned those pids if called in
    the wrong time.

    Conflicts:
     neutron/tests/contrib/functional-testing.filters

    Change-Id: I582aa5c931c8bfe57f49df6899445698270bb33e
    Closes-Bug: #1558819
    (cherry picked from commit fd93e19f2a415b3803700fc491749daba01a4390)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 7.2.0

This issue was fixed in the openstack/neutron 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 7.2.0

This issue was fixed in the openstack/neutron 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.