Recurrent jenkins slave agent failures

Bug #1267364 reported by Sergey Lukjanov on 2014-01-09
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Core Infrastructure
Fix Released
Critical
Jeremy Stanley

Bug Description

Sometimes Jenkins slaves failing due to the agent initialization error.

There are several possible traces that you can see in such case:

* Caused by: java.lang.NoClassDefFoundError: Could not initialize class jenkins.model.Jenkins$MasterComputer (full stack trace: http://paste.openstack.org/show/60883/)
* Caused by: java.lang.NoClassDefFoundError: Could not initialize class hudson.Util (full stack trace: http://paste.openstack.org/show/60885/)

Here is a list of slaves affected by this bug:

* precise37 (disabled, https://jenkins01.openstack.org/computer/precise37)
* precise34 (disabled, https://jenkins02.openstack.org/computer/precise34)
* precise39 (disabled, https://jenkins01.openstack.org/computer/precise39)

This bug could be used to perform recheck/reverify for changes with failed jobs.

description: updated
description: updated
Sergey Lukjanov (slukjanov) wrote :

Ihar, precise34 is already disabled too, you can use this bug for recheck/reverify.

summary: - precise37 failing due to the jenkins init
+ Recurrent jenkins slave agent failures
Jeremy Stanley (fungi) wrote :

We think this may be https://issues.jenkins-ci.org/browse/JENKINS-19453 (the tracebacks look identical to me) but we're also seeing it for slaves attached to jenkins02 which was recently upgraded to a version (1.543) which should carry the fix for that (remoting 2.33 is supposedly included starting with 1.540).

We're making inroads toward no longer reusing general-purpose slaves. We've been running some openstack-infra jobs on single-use "bare" Ubuntu Precise nodepool servers (similar to how we run devstack-gate jobs, but with a simpler image), and a couple hours ago began to also run openstack/nova pep8, docs and python27 jobs on them as well starting with the https://review.openstack.org/65620 change. This is showing promise--jobs are running successfully--so I'll continue getting us ready so shove more projects over to the same model.

Changed in openstack-ci:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Jeremy Stanley (fungi)
milestone: none → icehouse
Jeremy Stanley (fungi) wrote :

The majority of the current change volume in the gate pipeline looks like it would be covered by moving the following to nodepool nodes as well (changes already proposed):

    https://review.openstack.org/65732 cinder
    https://review.openstack.org/65733 glance
    https://review.openstack.org/65734 keystone
    https://review.openstack.org/65735 heat
    https://review.openstack.org/65736 horizon
    https://review.openstack.org/65737 ceilometer
    https://review.openstack.org/65738 swift

Jeremy Stanley (fungi) wrote :

We're seeing this far less often after merging the above changes. It does still happen with some frequency to nodepool-managed nodes, but when it does it now only affects one job run before the broken slave is destroyed.

James E. Blair (corvus) wrote :

Hrm, it should not affect nodepool-managed nodes; we should look into that if it does.

Jeremy Stanley (fungi) wrote :

Agreed--I haven't seen any recently, so those may have been mischaracterizations of the variability in how Jenkins deals with aborted jobs.

Changed in openstack-ci:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers