functional-api job causes the jenkins slave agent to die
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Magnum |
Fix Released
|
Critical
|
Corey O'Brien |
Bug Description
Quoted from @Corey:
When the functional-api job runs, it frequently destroys the VM causing the jenkins slave agent to die. Example: http://
When this happens, zuul re-queues a new build from the start on a new VM. This can happen many times in a row before the job completes.
I chatted with openstack-infra about this and after taking a look at one of the VMs, it looks like memory over consumption leading to thrashing was a possible culprit. The sshd daemon was also dead but the console showed things like "INFO: task kswapd0:77 blocked for more than 120 seconds". A cursory glance and following some of the jobs seems to indicate that this doesn't happen on RAX VMs which have swap devices unlike the OVH VMs as well.
Changed in magnum: | |
importance: | Undecided → Critical |
Changed in magnum: | |
assignee: | nobody → hongbin (hongbin034) |
status: | New → In Progress |
Changed in magnum: | |
assignee: | hongbin (hongbin034) → Corey O'Brien (coreypobrien) |
Reviewed: https:/ /review. openstack. org/276958 /git.openstack. org/cgit/ openstack/ magnum/ commit/ ?id=dca02f1d98c 82df7cfcd43786e a97210b9136389
Committed: https:/
Submitter: Jenkins
Branch: master
commit dca02f1d98c82df 7cfcd43786ea972 10b9136389
Author: Hongbin Lu <email address hidden>
Date: Fri Feb 5 18:53:34 2016 -0500
Reduce memory consumption of gate tests
Recently, the gate jobs took too long to complete (between 2 to 8
hours). The reason is jenkins slave agent die during the test, which
cause the CI to re-start the whole test in a new VM.
The failure mainly occurred at magnum-api pipeline, but also occurred
at other pipelines. In term of distribution of test nodes, this
failure mainly occurred at OVH nodes, in which there is no dedicated
swap device. As a result, at OVH nodes, local disk is used for swap
when memory is over-consuming. It looks this leads to resource
starvation, which cause the failure.
This patch attempted to reduce the memory consumption of the gate swarm/mesos) , the memory of worker node
tests. In the api test, the number of worker nodes was reduced from
2 to 1. In all tests (api/k8s/
was reduced from 1G to 512M.
Closes-Bug: #1542386 935a8763b92f038 f10cf07b5ca
Change-Id: If7822d07f95ebc