utah test is hanging and does not timeout

Bug #1194533 reported by Para Siva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
UTAH
Fix Released
High
Max Brustkern

Bug Description

Starting from 20130624, the floodlight server tests on amd64 are hanging when run using utah. The test does not appear to timeout either. I could not figure out the reason for the hang, but killing the utah process completes the jenkins job.

This job has been being impacted by another bug, bug 1181315 but the current hang does not appear to be related to this kernel bug.

Please see the run #58 for saucy-server-amd64-smoke-floodlight test in the internal instance for the impacted job.
In https://jenkins.qa.ubuntu.com/view/Saucy/view/Smoke%20Testing/job/saucy-server-amd64-smoke-floodlight/56/ I had to log into the VM concerned and killed the utah process to make the job complete.

The issue does not appear to occur in i386.

Related branches

Para Siva (psivaa)
description: updated
Andy Doan (doanac)
Changed in utah:
importance: Undecided → High
assignee: nobody → Max Brustkern (nuclearbob)
Revision history for this message
Andy Doan (doanac) wrote :
Revision history for this message
Max Brustkern (nuclearbob) wrote : Re: [Bug 1194533] Re: utah test is hanging and does not timeout

I've got this recreated on alderamin. The next process in the list after
the utah client process is a zombie process for sh. I'm guessing the
refusal of that process to exit is what's causing the problem, so I'll see
if I can figure out why that's happening.

On Thu, Jun 27, 2013 at 6:48 AM, Andy Doan <email address hidden>wrote:

> this seems like a similar failure:
>
> http://10.98.0.1:8080/job/sru_utah_kernel-raring-virtual_amd64-kvm-
> virtual/5/consoleText
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1194533
>
> Title:
> utah test is hanging and does not timeout
>
> Status in Ubuntu Test Automation Harness:
> New
>
> Bug description:
> Starting from 20130624, the floodlight server tests on amd64 are
> hanging when run using utah. The test does not appear to timeout
> either. I could not figure out the reason for the hang, but killing
> the utah process completes the jenkins job.
>
> This job has been being impacted by another bug, bug 1181315 but the
> current hang does not appear to be related to this kernel bug.
>
> Please see the run #58 for saucy-server-amd64-smoke-floodlight test in
> the internal instance for the impacted job.
> In
> https://jenkins.qa.ubuntu.com/view/Saucy/view/Smoke%20Testing/job/saucy-server-amd64-smoke-floodlight/56/I had to log into the VM concerned and killed the utah process to make the
> job complete.
>
> The issue does not appear to occur in i386.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/utah/+bug/1194533/+subscriptions
>

Revision history for this message
Max Brustkern (nuclearbob) wrote :

The cases I've recreated so far using all of utah all end up with a zombie sh process. I can drill down further on why that fails to timeout, but I'm also not able to execute one of the test cases to completion. test_openflow appears to hang when I run it directly in python without utah.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

Awesomely, the bug disappears when I try to use pdb.

Revision history for this message
Para Siva (psivaa) wrote :

I think the reason the test_openflow hang is the kernel but that is mentioned in this bug description, bug 1181315.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

Any chance that could cause the process to become a zombie? I'll read that bug a little better.

Revision history for this message
Para Siva (psivaa) wrote :

I don't think that test alone is the cause for the zombie process. When I run the test alone using 'sudo python test_openflow/test.py' the test completes without any problem. This I did after running setup.sh manually. But when i run the test again using 'sudo utah -r lp:ubuntu-test-case/server/runlists/floodlight.run' I get a zombie sh process. And this is what I get for pstree -p -s 15586
init(1)───sshd(980)───sshd(1252)───sshd(1379)───bash(1380)───sudo(15567)───utah(15568)───sh(15586)
 (where 15586 is the pid of the zombie sh process)
 Could this suggest that utah is reponsible for creating the zombie process?

root 15567 0.0 0.1 56388 1988 pts/0 S+ 10:57 0:00 sudo utah -r lp:ubuntu-test-cases/server/runlists/floodlight.run
root 15568 0.0 8.7 186232 89440 pts/0 S+ 10:57 0:00 /usr/bin/python /usr/bin/utah -r lp:ubuntu-test-cases/server/runlists/floodlight.run
root 15586 0.0 0.0 0 0 pts/0 Z+ 10:57 0:00 [sh] <defunct>

Revision history for this message
Para Siva (psivaa) wrote :

I believe that the setup.sh when run via utah is causing the hang. Here are some tests that I did,

1. Run setup.sh manually first and then use utah to run the whole tests, i.e. 'sudo utah -r lp:ubuntu-test-case/server/runlists/floodlight.run' ==> the tests completed without issues

2. Once the above is complete, now run 'sudo utah -r lp:ubuntu-test-case/server/runlists/floodlight.run' again. ==> the tests hang and the zombie sh is created
( when you run the whole suite using utah, the cleanup script would have uninstalled all those pkgs that I installed manually using setup.sh in 1. Hence the installation steps in setup.sh i.e. would now have been carried out which might cause the hang)

3. Just to confirm, run setup.sh, then cleanup.sh and then run the tests using 'sudo utah -r lp:ubuntu-test-case/server/runlists/floodlight.run' ==> Hang occurs

4. Tried step 1 again just to confirm and the test completed without any problem.

Note: Based on the kernel version that you are using, the test test_openflow may hang during the process (bug 1181315), just to make sure that it does not hang it's best to use a kernel that does not have the bug from http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc5-raring/

Revision history for this message
Max Brustkern (nuclearbob) wrote :

I've confirmed that setup hangs even if both test cases are removed, and I just run a '/bin/true' test. As in psivaa's, recreation, if setup is run again without running cleanup, things pass. I'm going to look into running the setup step with a timeout and see how that goes.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

Imposing a timeout on the ts_setup step appears to resolve the issue here. I'll discuss with the team how we can integrate that without breaking existing test setups.

Revision history for this message
Andy Doan (doanac) wrote :

So just simply giving this thing a timeout makes the whole bug go away? Could we simply just give an arbitrarily large value like 24hr and see this issue go away?

Revision history for this message
Max Brustkern (nuclearbob) wrote :

No, when I give it a timeout, it does timeout, but that doesn't really
solve the problem, it just solves what was scaring me, which was the idea
that the utah client would never time out when running something.

On Sat, Jun 29, 2013 at 12:39 AM, Andy Doan <email address hidden>wrote:

> So just simply giving this thing a timeout makes the whole bug go away?
> Could we simply just give an arbitrarily large value like 24hr and see
> this issue go away?
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1194533
>
> Title:
> utah test is hanging and does not timeout
>
> Status in Ubuntu Test Automation Harness:
> New
>
> Bug description:
> Starting from 20130624, the floodlight server tests on amd64 are
> hanging when run using utah. The test does not appear to timeout
> either. I could not figure out the reason for the hang, but killing
> the utah process completes the jenkins job.
>
> This job has been being impacted by another bug, bug 1181315 but the
> current hang does not appear to be related to this kernel bug.
>
> Please see the run #58 for saucy-server-amd64-smoke-floodlight test in
> the internal instance for the impacted job.
> In
> https://jenkins.qa.ubuntu.com/view/Saucy/view/Smoke%20Testing/job/saucy-server-amd64-smoke-floodlight/56/I had to log into the VM concerned and killed the utah process to make the
> job complete.
>
> The issue does not appear to occur in i386.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/utah/+bug/1194533/+subscriptions
>

Revision history for this message
Para Siva (psivaa) wrote :

Just to make sure that it's recorded.. this bug is blocking me in testing the kernels to help fix bug 1181315.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

I've run this reasonably extensively using utah.process.run, with and without a timeout, with streaming set to True and False. I cannot get the failure to occur running just the run function out of context. It's possible that it's interacting with the rest of utah in a way that breaks things, but at this point I suspect the failure may be elsewhere.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

To clarify, elsewhere in utah. I can still recreate the problem very easily by running the full utah client.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

That branch fixes the problem for me.

Andy Doan (doanac)
Changed in utah:
status: New → Fix Committed
Revision history for this message
Andy Doan (doanac) wrote :

I upgraded aldebaran, kicked off a job, and it seems to be working now:
  https://jenkins.qa.ubuntu.com/view/Saucy/view/Smoke%20Testing/job/saucy-server-amd64-smoke-floodlight/67/

Changed in utah:
status: Fix Committed → Fix Released
milestone: none → 0.14
Revision history for this message
Para Siva (psivaa) wrote :

Thank you for the fix guys. Given the complication in reproducing etc.., big help!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.