functional tests random failure on TestProcessMonitor.test_respawn_handler trying to kill a process

Bug #1513771 reported by Miguel Angel Ajo
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Attila Czira

Bug Description

2015-11-06 08:16:42.407 | 2015-11-06 08:16:42.364 | ==============================
2015-11-06 08:16:42.408 | 2015-11-06 08:16:42.366 | Failed 1 tests - output below:
2015-11-06 08:16:42.408 | 2015-11-06 08:16:42.367 | ==============================
2015-11-06 08:16:42.408 | 2015-11-06 08:16:42.369 |
2015-11-06 08:16:42.408 | 2015-11-06 08:16:42.370 | neutron.tests.functional.agent.linux.test_process_monitor.TestProcessMonitor.test_respawn_handler
2015-11-06 08:16:42.408 | 2015-11-06 08:16:42.372 | -------------------------------------------------------------------------------------------------
2015-11-06 08:16:42.409 | 2015-11-06 08:16:42.373 |
2015-11-06 08:16:42.409 | 2015-11-06 08:16:42.374 | Captured traceback:
2015-11-06 08:16:42.409 | 2015-11-06 08:16:42.376 | ~~~~~~~~~~~~~~~~~~~
2015-11-06 08:16:42.409 | 2015-11-06 08:16:42.377 | Traceback (most recent call last):
2015-11-06 08:16:42.409 | 2015-11-06 08:16:42.378 | File "neutron/tests/functional/agent/linux/test_process_monitor.py", line 108, in test_respawn_handler
2015-11-06 08:16:42.409 | 2015-11-06 08:16:42.380 | self._kill_last_child()
2015-11-06 08:16:42.410 | 2015-11-06 08:16:42.381 | File "neutron/tests/functional/agent/linux/test_process_monitor.py", line 77, in _kill_last_child
2015-11-06 08:16:42.410 | 2015-11-06 08:16:42.383 | self._child_processes[-1].disable()
2015-11-06 08:16:42.410 | 2015-11-06 08:16:42.385 | File "neutron/agent/linux/external_process.py", line 109, in disable
2015-11-06 08:16:42.410 | 2015-11-06 08:16:42.386 | utils.execute(cmd, run_as_root=True)
2015-11-06 08:16:42.410 | 2015-11-06 08:16:42.388 | File "neutron/agent/linux/utils.py", line 159, in execute
2015-11-06 08:16:42.411 | 2015-11-06 08:16:42.389 | raise RuntimeError(m)
2015-11-06 08:16:42.411 | 2015-11-06 08:16:42.391 | RuntimeError:
2015-11-06 08:16:42.411 | 2015-11-06 08:16:42.393 | Command: ['sudo', 'kill', '-9', 'None']
2015-11-06 08:16:42.411 | 2015-11-06 08:16:42.394 | Exit code: 1
2015-11-06 08:16:42.413 | 2015-11-06 08:16:42.396 | Stdin:
2015-11-06 08:16:42.415 | 2015-11-06 08:16:42.397 | Stdout:
2015-11-06 08:16:42.416 | 2015-11-06 08:16:42.399 | Stderr: kill: failed to parse argument: 'None'
2015-11-06 08:16:42.418 | 2015-11-06 08:16:42.400 |
2015-11-06 08:16:42.419 | 2015-11-06 08:16:42.402 |
2015-11-06 08:16:42.440 | 2015-11-06 08:16:42.403 |
2015-11-06 08:16:42.441 | 2015-11-06 08:16:42.405 | Captured pythonlogging:
2015-11-06 08:16:42.441 | 2015-11-06 08:16:42.407 | ~~~~~~~~~~~~~~~~~~~~~~~
2015-11-06 08:16:42.441 | 2015-11-06 08:16:42.408 | 2015-11-06 08:14:13,232 ERROR [neutron.agent.linux.utils] Unable to convert value in /tmp/tmpPzSZY1/tmpLtA4wt/external/pids/test-uuid-1.pid
2015-11-06 08:16:42.441 | 2015-11-06 08:16:42.410 | 2015-11-06 08:14:13,243 ERROR [neutron.agent.linux.utils]
2015-11-06 08:16:42.441 | 2015-11-06 08:16:42.411 | Command: ['sudo', 'kill', '-9', 'None']
2015-11-06 08:16:42.441 | 2015-11-06 08:16:42.413 | Exit code: 1
2015-11-06 08:16:42.442 | 2015-11-06 08:16:42.414 | Stdin:
2015-11-06 08:16:42.442 | 2015-11-06 08:16:42.416 | Stdout:
2015-11-06 08:16:42.442 | 2015-11-06 08:16:42.418 | Stderr: kill: failed to parse argument: 'None'
2015-11-06 08:16:42.442 | 2015-11-06 08:16:42.420 |
2015-11-06 08:16:42.442 | 2015-11-06 08:16:42.421 |
2015-11-06 08:16:42.442 | 2015-11-06 08:16:42.423 |
2015-11-06 08:16:42.443 | 2015-11-06 08:16:42.425 |

Changed in neutron:
importance: Undecided → Low
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
tags: added: gate-failure
Changed in neutron:
assignee: Miguel Angel Ajo (mangelajo) → nobody
importance: Low → High
status: New → Confirmed
Revision history for this message
Assaf Muller (amuller) wrote :

Two hits today. One failure was with the same error message, but a different test.

1)
http://logs.openstack.org/00/263400/6/check/gate-neutron-dsvm-functional/881aa9c/testr_results.html.gz
neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase.test_keepalived_respawns

2)
http://logs.openstack.org/00/263400/6/check/gate-neutron-dsvm-functional/580b5f0/testr_results.html.gz
neutron.tests.functional.agent.linux.test_process_monitor.TestProcessMonitor.test_respawn_handler

Changed in neutron:
assignee: nobody → John Perkins (john-d-perkins)
Changed in neutron:
assignee: John Perkins (john-d-perkins) → nobody
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Ok, reading the test to see if I understand why could this be happening.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Initial analysis:

Looking at [1] and [2] this smells of a race condition, or a bug in the daemon code used by simple-daemon (dummy daemon) code.

In [1], simple-daemon fails to create the pid file (I suppose any of the top directories did not exist) and in [2] we're trying to kill one of the processes to expect a spawn, but we can't read the pid file to kill the right pid.

[1] http://logs.openstack.org/00/263400/6/check/gate-neutron-dsvm-functional/580b5f0/console.html.gz#_2016-01-05_12_03_11_407
[2] http://logs.openstack.org/00/263400/6/check/gate-neutron-dsvm-functional/580b5f0/console.html.gz#_2016-01-05_12_04_59_571

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

How can the state_path directory go away in the middle of a test?

Could two different test runs using the same directory and somehow deleting it?

I could make the simple daemon code more robust and create the whole path up to the pid file, but, I guess this is probably affecting other tests.

@amuller, I any idea?, let's talk about this this monday.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

0 hits during the last week.

Changed in neutron:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/446991

Changed in neutron:
assignee: nobody → Attila Czira (aczira)
status: Expired → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/446991
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9049c32c7c7314e05250551807b4164082da4695
Submitter: Jenkins
Branch: master

commit 9049c32c7c7314e05250551807b4164082da4695
Author: Attila Czira <email address hidden>
Date: Fri Mar 17 11:43:11 2017 +0100

    Stabilizing process monitor function test case

    The problem seems to be that the TC is starting new processes
    parallelly but it does not wait for them to start properly instead
    it tries to kill one of them immediately but in some cases if the
    startup phase is delayed the process that is about to be killed is
    not started yet.
    The fix is to introduce a waiting phase until all the child
    processes are up and running right after the initial spawning
    phase.
    As the problem is not functional the fix is difficult to test.
    I executed this TC in a loop with this fix for a 5000 times and
    the problem has not appeared. Without the fix I had 3.8% failure
    rate.

    Change-Id: I19a8a4c35efc35a002478204fd24910d73c9c8c2
    closes-bug: #1513771

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0b1

This issue was fixed in the openstack/neutron 11.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.