Job errors with "ErrorMessage: None", possibly due to job timeout

Bug #1226017 reported by Andrew McDermott
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
LAVA Dispatcher
Confirmed
Medium
Tyler Baker
linaro-networking
Won't Fix
Medium
Unassigned

Bug Description

The following job:

  https://validation.linaro.org/scheduler/job/72876

reports back with:

  Dispatcher error or test failure

  ErrorMessage: None

The job itself eventually runs the `jtreg' tests which can be seen in
the output starting at:

  11:50:45 AM INFO: [ACTION-B] lava_test_shell is started with {u'timeout': 180000, u'testdef_repos': [{u'git-repo': u'git...

and the timeout for that test is 50 hours which, as an aside, I'm
suspecting is not long enough for this particular test/job. I would have
expected LAVA to fail with a `timeout error' but I don't see that.

And looking through the test runs I see that test 'lava_test_shell #13'
fails with:

https://validation.linaro.org/dashboard/streams/public/team/linaro/pre-built-vexpress64/bundles/0a75f2ca9c78f77fbadf327f8371ea7e5f7287c5/344362fe-1dd9-11e3-8b3b-001e671608cd/result/13/

Lava failed at action lava_test_shell with error:None
Traceback (most recent call last):
  File "/srv/lava<email address hidden>/lava_dispatcher/job.py", line 254, in run
    action.run(**params)
  File "/srv/lava<email address hidden>/lava_dispatcher/actions/lava_test_shell.py", line 559, in run
    while self._keep_running(runner, timeout, signal_director):
  File "/srv/lava<email address hidden>/lava_dispatcher/actions/lava_test_shell.py", line 574, in _keep_running
    idx = runner._connection.expect(patterns, timeout=timeout)
  File "/srv/lava<email address hidden>/lava_dispatcher/utils.py", line 206, in expect
    return super(logging_spawn, self).expect(*args, **kw)
  File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line 1316, in expect
    return self.expect_list(compiled_pattern_list, timeout, searchwindowsize)
  File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line 1330, in expect_list
    return self.expect_loop(searcher_re(pattern_list), timeout, searchwindowsize)
  File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line 1383, in expect_loop
    c = self.read_nonblocking (self.maxread, timeout)
  File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line 820, in read_nonblocking
    r,w,e = self.__select([self.child_fd], [], [], timeout)
  File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line 1554, in __select
    return select.select (iwtd, owtd, ewtd, timeout)
KeyboardInterrupt

Where is KeyboardInterrupt coming from?

The other thing I notice is that the job itself seems to run for exactly
24 hours (and not 50 hours as specified for the jtreg test case).

From:

 https://validation.linaro.org/dashboard/streams/public/team/linaro/pre-built-vexpress64/bundles/0a75f2ca9c78f77fbadf327f8371ea7e5f7287c5/

 Started at: Sept. 15, 2013, 7:34 a.m.
 Finished at: Sept. 16, 2013, 7:34 a.m.

summary: - Job errors with "ErrorMessage: None", possibly due to job timeot
+ Job errors with "ErrorMessage: None", possibly due to job timeout
description: updated
Revision history for this message
Andrew Johnson (andy-johnson) wrote : Re: [Bug 1226017] Re: Job errors with "ErrorMessage: None", possibly due to job timeout
Download full text (9.2 KiB)

I can't find anywhere in the message which particular target is in effect
when this happens. Any clue?

On 16 September 2013 08:20, Andrew McDermott <email address hidden>wrote:

> ** Description changed:
>
> The following job:
>
> - https://validation.linaro.org/scheduler/job/72876
> + https://validation.linaro.org/scheduler/job/72876
>
> reports back with:
>
> - Dispatcher error or test failure
> + Dispatcher error or test failure
>
> - ErrorMessage: None
> + ErrorMessage: None
>
> The job itself eventually runs the `jtreg' tests which can be seen in
> the output starting at:
>
> - 11:50:45 AM INFO: [ACTION-B] lava_test_shell is started with
> + 11:50:45 AM INFO: [ACTION-B] lava_test_shell is started with
> {u'timeout': 180000, u'testdef_repos': [{u'git-repo': u'git...
>
> and the timeout for that test is 50 hours which, as an aside, I'm
> suspecting is not long enough for this particular test/job. I would have
> expected LAVA to fail with a `timeout error' but I don't see that.
>
> And looking through the test runs I see that test 'lava_test_shell #13'
> fails with:
>
> + https://validation.linaro.org/dashboard/streams/public/team/linaro/pre-
> + built-vexpress64/bundles/0a75f2ca9c78f77fbadf327f8371ea7e5f7287c5
> + /344362fe-1dd9-11e3-8b3b-001e671608cd/result/13/
> +
> Lava failed at action lava_test_shell with error:None
> Traceback (most recent call last):
> - File
> "/srv/lava<email address hidden>/lava_dispatcher/job.py",
> line 254, in run
> - action.run(**params)
> - File
> "/srv/lava<email address hidden>/lava_dispatcher/actions/lava_test_shell.py",
> line 559, in run
> - while self._keep_running(runner, timeout, signal_director):
> - File
> "/srv/lava<email address hidden>/lava_dispatcher/actions/lava_test_shell.py",
> line 574, in _keep_running
> - idx = runner._connection.expect(patterns, timeout=timeout)
> - File
> "/srv/lava<email address hidden>/lava_dispatcher/utils.py",
> line 206, in expect
> - return super(logging_spawn, self).expect(*args, **kw)
> - File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line
> 1316, in expect
> - return self.expect_list(compiled_pattern_list, timeout,
> searchwindowsize)
> - File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line
> 1330, in expect_list
> - return self.expect_loop(searcher_re(pattern_list), timeout,
> searchwindowsize)
> - File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line
> 1383, in expect_loop
> - c = self.read_nonblocking (self.maxread, timeout)
> - File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line
> 820, in read_nonblocking
> - r,w,e = self.__select([self.child_fd], [], [], timeout)
> - File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line
> 1554, in __select
> - return se...

Read more...

Revision history for this message
Tyler Baker (tyler-baker) wrote :
Download full text (13.7 KiB)

It appears the job is timing out. We will investigate.

On 16 September 2013 07:47, Andrew Johnson <email address hidden> wrote:

> I can't find anywhere in the message which particular target is in effect
> when this happens. Any clue?
>
>
> On 16 September 2013 08:20, Andrew McDermott
> <email address hidden>wrote:
>
> > ** Description changed:
> >
> > The following job:
> >
> > - https://validation.linaro.org/scheduler/job/72876
> > + https://validation.linaro.org/scheduler/job/72876
> >
> > reports back with:
> >
> > - Dispatcher error or test failure
> > + Dispatcher error or test failure
> >
> > - ErrorMessage: None
> > + ErrorMessage: None
> >
> > The job itself eventually runs the `jtreg' tests which can be seen in
> > the output starting at:
> >
> > - 11:50:45 AM INFO: [ACTION-B] lava_test_shell is started with
> > + 11:50:45 AM INFO: [ACTION-B] lava_test_shell is started with
> > {u'timeout': 180000, u'testdef_repos': [{u'git-repo': u'git...
> >
> > and the timeout for that test is 50 hours which, as an aside, I'm
> > suspecting is not long enough for this particular test/job. I would
> have
> > expected LAVA to fail with a `timeout error' but I don't see that.
> >
> > And looking through the test runs I see that test 'lava_test_shell #13'
> > fails with:
> >
> > +
> https://validation.linaro.org/dashboard/streams/public/team/linaro/pre-
> > + built-vexpress64/bundles/0a75f2ca9c78f77fbadf327f8371ea7e5f7287c5
> > + /344362fe-1dd9-11e3-8b3b-001e671608cd/result/13/
> > +
> > Lava failed at action lava_test_shell with error:None
> > Traceback (most recent call last):
> > - File
> >
> "/srv/lava<email address hidden>
> /lava_dispatcher/job.py",
> > line 254, in run
> > - action.run(**params)
> > - File
> >
> "/srv/lava<email address hidden>
> /lava_dispatcher/actions/lava_test_shell.py",
> > line 559, in run
> > - while self._keep_running(runner, timeout, signal_director):
> > - File
> >
> "/srv/lava<email address hidden>
> /lava_dispatcher/actions/lava_test_shell.py",
> > line 574, in _keep_running
> > - idx = runner._connection.expect(patterns, timeout=timeout)
> > - File
> >
> "/srv/lava<email address hidden>
> /lava_dispatcher/utils.py",
> > line 206, in expect
> > - return super(logging_spawn, self).expect(*args, **kw)
> > - File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line
> > 1316, in expect
> > - return self.expect_list(compiled_pattern_list, timeout,
> > searchwindowsize)
> > - File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line
> > 1330, in expect_list
> > - return self.expect_loop(searcher_re(pattern_list), timeout,
> > searchwindowsize)
> > - File "/srv/lava/.cache/eggs/pexpect-2.4-py2.7.egg/pexpect.py", line
> > 1383, in expect_loop
> > - c = self.read_nonblocking (self.ma...

Revision history for this message
Andrew McDermott (frobware) wrote :

AndyJ:

When you say target, that to me is a device. :)

So, this is running on the ARMv8 foundation model.

If your question is about which Java test is in effect then its the complete jtreg test that fails.

Revision history for this message
Andrew McDermott (frobware) wrote :

Tyler: when you say it appears the job is timing out, do you mean it is timing out, but on the 24 hour boundary?

Revision history for this message
Andrew McDermott (frobware) wrote :
Download full text (7.0 KiB)

Andy J:

Looking through the complete log output at:

  https://validation.linaro.org/scheduler/job/72876/log_file

the tests that were running or about to run were jdk_security:

FAILED: sample/mergesort/MergeSortTest.java
 TEST STATS: name=jdk_other run=75 pass=67 fail=8 excluded=2
 + rm -rf /tmp/jtreg/jdk_other.986
 + exit 0
 + test_dir=/lava/tests/0_jtreg/openjdk8-aarch64-test-archive/openjdk8-aarch64-port-snapshot/jdk/test
 + test_to_run=jdk_security1
 + t=/tmp/jtreg/jdk_security1.2541
 + rm -rf /tmp/jtreg/jdk_security1.2541
 + mkdir -p /tmp/jtreg/jdk_security1.2541
 + to=/tmp/jtreg/jdk_security1.2541/testoutput
 + '[' -n '' ']'
 + lava-test-case jdk_security1 --shell make -C /lava/tests/0_jtreg/openjdk8-aarch64-test-archive/openjdk8-aarch64-port-snapshot/jdk/test PRODUCT_HOME=/usr/lib/jvm/java-8-openjdk JTREG=/lava/tests/0_jtreg/jtreg-bin/jtreg/linux/bin/jtreg-lava JT_HOME=/lava/tests/0_jtreg/jtreg-bin/jtreg ALT_OUTPUTDIR=/tmp/jtreg/jdk_security1.2541 jdk_security1
 <LAVA_SIGNAL_STARTTC jdk_security1>
 echo LAVA_ACK
 make: Entering directory `/lava/tests/0_jtreg/openjdk8-aarch64-test-archive/openjdk8-aarch64-port-snapshot/jdk/test'
 echo "Running tests in agentvm mode: java/security"
 Running tests in agentvm mode: java/security
 make TEST_DEPENDENCIES="java/security" TESTDIRS="java/security" USE_JTREG_AGENTVM=true UNIQUE_DIR=jdk_security1 jtreg_tests
 make[1]: Entering directory `/lava/tests/0_jtreg/openjdk8-aarch64-test-archive/openjdk8-aarch64-port-snapshot/jdk/test'
 make[1]: Warning: File `/lava/tests/0_jtreg/jtreg-bin/jtreg/linux/bin/jtreg-lava' has modification time 278411 s in the future
 Excluding list contains 0 items
 ( ( JT_HOME=/lava/tests/0_jtreg/jtreg-bin/jtreg; export JT_HOME; /lava/tests/0_jtreg/jtreg-bin/jtreg/linux/bin/jtreg-lava -agentvm -a -ea -esa -v:fail,error,time -retain:fail,error -ignore:quiet -timeoutFactor:4 -J-Xmx512m -r:/tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/JTreport -w:/tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/JTwork -jdk:/usr/lib/jvm/java-8-openjdk -exclude:/tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/excludelist.txt java/security ) ; ( jtregExitCode=$? && _summary="/tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/JTreport/text/summary.txt"; rm -f -f /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/Stats.txt /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/runlist.txt /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/passlist.txt /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/faillist.txt /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/exitcode.txt; echo "${jtregExitCode}" > /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/exitcode.txt; if [ -r "${_summary}" ] ; then echo "Summary: jdk_security1" > /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/Stats.txt; expand ${_summary} | egrep -v ' Not run.' > /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/runlist.txt; egrep ' Passed.' /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/runlist.txt | egrep -v ' Error.' | egrep -v ' Failed.' > /tmp/jtreg/jdk_security1.2541/testoutput/jdk_security1/passlist.txt; ( egrep ' Failed.' /tmp/jtreg/jdk_security1.2541/test...

Read more...

Revision history for this message
Dave Pigott (dpigott) wrote :

We really just need a more useful error message when the timeout occurs

Changed in lava-dispatcher:
importance: Undecided → Medium
status: New → Confirmed
assignee: nobody → Dave Pigott (dpigott)
Revision history for this message
Andrew McDermott (frobware) wrote :

Dave, it's odd that the job runs for exactly 24 hours though, irrespective of the error message.

Changed in linaro-networking:
status: New → Confirmed
importance: Undecided → Medium
Changed in lava-dispatcher:
assignee: Dave Pigott (dpigott) → Tyler Baker (tyler-baker)
Revision history for this message
vishal (vishalbhoj) wrote :

Any update on this issue. Looks like this particular issue is again affected by the same issue:
http://validation.linaro.org/scheduler/job/88719

Changed in linaro-networking:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.