functional-api job causes the jenkins slave agent to die

Bug #1542386 reported by hongbin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Magnum
Fix Released
Critical
Corey O'Brien

Bug Description

Quoted from @Corey:

When the functional-api job runs, it frequently destroys the VM causing the jenkins slave agent to die. Example: http://logs.openstack.org/03/275003/6/check/gate-functional-dsvm-magnum-api/a9a0eb9//console.html
When this happens, zuul re-queues a new build from the start on a new VM. This can happen many times in a row before the job completes.
I chatted with openstack-infra about this and after taking a look at one of the VMs, it looks like memory over consumption leading to thrashing was a possible culprit. The sshd daemon was also dead but the console showed things like "INFO: task kswapd0:77 blocked for more than 120 seconds". A cursory glance and following some of the jobs seems to indicate that this doesn't happen on RAX VMs which have swap devices unlike the OVH VMs as well.

hongbin (hongbin034)
Changed in magnum:
importance: Undecided → Critical
Changed in magnum:
assignee: nobody → hongbin (hongbin034)
status: New → In Progress
Changed in magnum:
assignee: hongbin (hongbin034) → Corey O'Brien (coreypobrien)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to magnum (master)

Reviewed: https://review.openstack.org/276958
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=dca02f1d98c82df7cfcd43786ea97210b9136389
Submitter: Jenkins
Branch: master

commit dca02f1d98c82df7cfcd43786ea97210b9136389
Author: Hongbin Lu <email address hidden>
Date: Fri Feb 5 18:53:34 2016 -0500

    Reduce memory consumption of gate tests

    Recently, the gate jobs took too long to complete (between 2 to 8
    hours). The reason is jenkins slave agent die during the test, which
    cause the CI to re-start the whole test in a new VM.

    The failure mainly occurred at magnum-api pipeline, but also occurred
    at other pipelines. In term of distribution of test nodes, this
    failure mainly occurred at OVH nodes, in which there is no dedicated
    swap device. As a result, at OVH nodes, local disk is used for swap
    when memory is over-consuming. It looks this leads to resource
    starvation, which cause the failure.

    This patch attempted to reduce the memory consumption of the gate
    tests. In the api test, the number of worker nodes was reduced from
    2 to 1. In all tests (api/k8s/swarm/mesos), the memory of worker node
    was reduced from 1G to 512M.

    Closes-Bug: #1542386
    Change-Id: If7822d07f95ebc935a8763b92f038f10cf07b5ca

Changed in magnum:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/magnum 2.0.0

This issue was fixed in the openstack/magnum 2.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.