Comment 3 for bug 1260654

Revision history for this message
Jeremy Stanley (fungi) wrote :

devstack-precise-hpcloud-az2-850860 would have been deleted right away after this impacted it... chances are it was reused after the jenkins01 crash and restart.

As for precise14 and precise20, they were exhibiting Jenkins slave agent communication issues. Most likely something happened to agent communication between those two and jenkins02 after we performed a planned restart of it to prevent the JVM out-of-memory condition which had caused jenkins01 to shoot itself in the head.

I took both affected slaves out of service in jenkins02, rebooted them for good measure, then disconnected and relaunched the slave agent making sure it succeeded on both. Then I watched a job run to completion successfully on each so they should be okay at this point.

Near term measures for prevention are already underway, migrating our current long-term-slave jobs to single-use bare (non-devstack) slaves managed by nodepool. We have already moved some infra jobs to them as dogfood, so hopefully this issue with long-term slaves going into rapid-fire job failure will soon be behind us.