Slave went offline during log upload

Bug #1268732 reported by Steven Hardy on 2014-01-13
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Core Infrastructure
Fix Released
High
Jeremy Stanley

Bug Description

Encountered this issue, where it seems like a node went offline during log upload:

https://review.openstack.org/#/c/66302/

http://logs.openstack.org/02/66302/1/check/gate-tempest-dsvm-neutron-large-ops/5d47a5c/console.html

2014-01-13 14:50:16.795 | PASSED (id=0)
2014-01-13 14:50:16.795 | Slowest Tests
2014-01-13 14:50:16.795 | Test id Runtime (s)
2014-01-13 14:50:16.795 | ------------------------------------------------------------------------------------------- -----------
2014-01-13 14:50:16.796 | tempest.scenario.test_large_ops.TestLargeOpsScenario.test_large_ops_scenario[compute,image] 97.862
2014-01-13 14:50:16.835 | ___________________________________ summary ____________________________________
2014-01-13 14:50:16.835 | large-ops: commands succeeded
2014-01-13 14:50:16.835 | congratulations :)
2014-01-13 14:50:16.836 | /opt/stack/new/devstack-gate/devstack-vm-gate.sh: line 349: [[: stable/grizzly: division by 0 (error token is "grizzly")
2014-01-13 14:50:42.221 | Process leaked file descriptors. See http://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build for more information
2014-01-13 14:50:42.789 | [gate-tempest-dsvm-neutron-large-ops] $ /bin/sh /tmp/hudson4062194587775363172.sh
2014-01-13 14:50:42.824 | Detailed logs: http://logs.openstack.org/02/66302/1/check/gate-tempest-dsvm-neutron-large-ops/5d47a5c/
2014-01-13 14:50:43.237 | Looks like the node went offline during the build. Check the slave log for the details.FATAL: /var/lib/jenkins/logs/slaves/devstack-precise-hpcloud-az2-1073362/slave.log (No such file or directory)
2014-01-13 14:50:43.237 | java.io.FileNotFoundException: /var/lib/jenkins/logs/slaves/devstack-precise-hpcloud-az2-1073362/slave.log (No such file or directory)
2014-01-13 14:50:43.237 | at java.io.RandomAccessFile.open(Native Method)
2014-01-13 14:50:43.238 | at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
2014-01-13 14:50:43.238 | at org.kohsuke.stapler.framework.io.LargeText$FileSession.<init>(LargeText.java:397)
2014-01-13 14:50:43.238 | at org.kohsuke.stapler.framework.io.LargeText$2.open(LargeText.java:120)
2014-01-13 14:50:43.238 | at org.kohsuke.stapler.framework.io.LargeText.writeLogTo(LargeText.java:210)
2014-01-13 14:50:43.238 | at hudson.console.AnnotatedLargeText.writeHtmlTo(AnnotatedLargeText.java:159)
2014-01-13 14:50:43.238 | at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:599)
2014-01-13 14:50:43.239 | at hudson.model.Run.execute(Run.java:1593)
2014-01-13 14:50:43.239 | at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
2014-01-13 14:50:43.239 | at hudson.model.ResourceController.execute(ResourceController.java:88)
2014-01-13 14:50:43.239 | at hudson.model.Executor.run(Executor.java:247)

Dolph Mathews (dolph) on 2014-01-23
Changed in openstack-ci:
status: New → Confirmed
Jeremy Stanley (fungi) on 2014-01-23
Changed in openstack-ci:
status: Confirmed → In Progress
importance: Undecided → Critical
assignee: nobody → Jeremy Stanley (fungi)
milestone: none → icehouse
Jeremy Stanley (fungi) wrote :

Caught one of these on a full tempest run for a heat change...

http://logs.openstack.org/35/68135/1/gate/gate-tempest-dsvm-full/b17cdc5/console.html#_2014-01-23_16_26_41_918

I tried to hold the offending node but deletion was already in progress so it got yanked out from under me before I could spot any obvious local issue on it. THEN I spotted this in the log of the corresponding Jenkins master...

Jan 23, 2014 4:25:17 PM hudson.node_monitors.AbstractDiskSpaceMonitor markNodeOfflineIfDiskspaceIsTooLow
WARNING: Making devstack-precise-hpcloud-az2-1187454 offline temporarily due to the lack of disk space

Since this is happening between tests completing and log collection, I think we're filling up the remaining space on the very constrained root filesystem on these nodes when copying logs from /opt to the workspace in /home/jenkins. I'm trying to address it with...

https://review.openstack.org/68706

Jeremy Stanley (fungi) wrote :

Thinking through this, we don't make use of the ephemeral disk until the job starts, and by then it's too late to move the workspace. We'll need larger root filesystems or to use them less... or start using the ephemeral disk earlier (but still after node launch since we can't make use of it during image prep).

Changed in openstack-ci:
status: In Progress → Triaged
importance: Critical → High
assignee: Jeremy Stanley (fungi) → nobody

Fix proposed to branch: master
Review: https://review.openstack.org/68782

Changed in openstack-ci:
assignee: nobody → Jeremy Stanley (fungi)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/68782
Committed: https://git.openstack.org/cgit/openstack-infra/devstack-gate/commit/?id=db80d838770915b44df9c0a822ef9204b0b494f1
Submitter: Jenkins
Branch: master

commit db80d838770915b44df9c0a822ef9204b0b494f1
Author: Jeremy Stanley <email address hidden>
Date: Thu Jan 23 21:52:14 2014 +0000

    Keep logs in $BASE instead of $WORKSPACE

    Avoid moving logs into the jenkins homedir unnecessarily, since it
    has a tendency to fill up the root filesystem and the SCP publisher
    can follow symlinks out of the workspace anyway. We've seen slaves
    go offline due to full root filesystems recently, so this should
    hopefully reduce the incidence of that problem.

    Change-Id: I490ed53989137218f14464708a65969033c108d6
    Closes-Bug: #1268732

Changed in openstack-ci:
status: In Progress → Fix Released
Steven Hardy (shardy) wrote :

Looks like this (or something very similar) is still happening:

http://logs.openstack.org/32/67532/3/check/gate-heat-python27/2173da6/console.html

[gate-heat-python27] $ /bin/bash /tmp/hudson8031709905825416665.sh
2014-01-28 12:44:39.587 | The following un-ignored files were created during the test:
2014-01-28 12:44:39.587 | subunit_log.txt.gz
2014-01-28 12:44:39.589 | testr_results.html.gz
2014-01-28 12:44:39.978 | Looks like the node went offline during the build. Check the slave log for the details.FATAL: /var/lib/jenkins/logs/slaves/bare-precise-hpcloud-az2-1223783/slave.log (No such file or directory)
2014-01-28 12:44:39.978 | java.io.FileNotFoundException: /var/lib/jenkins/logs/slaves/bare-precise-hpcloud-az2-1223783/slave.log (No such file or directory)
2014-01-28 12:44:39.978 | at java.io.RandomAccessFile.open(Native Method)
2014-01-28 12:44:39.979 | at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
2014-01-28 12:44:39.979 | at org.kohsuke.stapler.framework.io.LargeText$FileSession.<init>(LargeText.java:397)
2014-01-28 12:44:39.979 | at org.kohsuke.stapler.framework.io.LargeText$2.open(LargeText.java:120)
2014-01-28 12:44:39.979 | at org.kohsuke.stapler.framework.io.LargeText.writeLogTo(LargeText.java:210)
2014-01-28 12:44:39.979 | at hudson.console.AnnotatedLargeText.writeHtmlTo(AnnotatedLargeText.java:156)
2014-01-28 12:44:39.979 | at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:579)
2014-01-28 12:44:39.979 | at hudson.model.Run.execute(Run.java:1678)
2014-01-28 12:44:39.979 | at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
2014-01-28 12:44:39.979 | at hudson.model.ResourceController.execute(ResourceController.java:88)
2014-01-28 12:44:39.979 | at hudson.model.Executor.run(Executor.java:231)

Jeremy Stanley (fungi) wrote :

The console log says that job ran on bare-precise-hpcloud-az2-1223783 (a single-use slave) assigned to jenkins04, and in the jenkins.log on that master I find:

WARNING: Making bare-precise-hpcloud-az2-1223783 offline temporarily due to the lack of disk space

So it looks like gate-heat-python27 jobs are sometimes generating gigabytes of files on disk, enough to use up the available disk space on a bare-precise slave (hovers around 3.5GiB presently on the provider where this one ran). Any guesses as to how or why this is happening, and how we might trim that down?

Obviously the fixes we applied for devstack-based jobs to use /opt for git repository caches and log collection don't apply in this case.

Steven Hardy (shardy) wrote :

Thanks for the info - atm I have no idea why our gate job could produce that much data - do we have any way of instrumenting the job so we can see what files are taking up all the space?

Thomas Herve (therve) wrote :

I've looked at logstack, and it seems that heat is only one of the project getting this. It seems to happen even more with nova.

If we can access a machine in that state we could get a better idea of what's going on.

Reviewed: https://review.openstack.org/80295
Committed: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/commit/?id=de34e5ca6ca3be8b433d57d691016dc249b5b6e5
Submitter: Jenkins
Branch: master

commit de34e5ca6ca3be8b433d57d691016dc249b5b6e5
Author: Sean Dague <email address hidden>
Date: Thu Mar 13 10:16:18 2014 -0400

    remove duplicate bug

    This is really a duplicate signature for 1284371

    Change-Id: I9c67d85dc7c5988665af87487353081feac37754
    Related-Bug: #1268732

no longer affects: heat
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers