Magnum

functional-api job causes the jenkins slave agent to die

Bug #1542386 reported by hongbin on 2016-02-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Magnum	Fix Released	Critical	Corey O'Brien

Bug Description

Quoted from @Corey:

When the functional-api job runs, it frequently destroys the VM causing the jenkins slave agent to die. Example: http://logs.openstack.org/03/275003/6/check/gate-functional-dsvm-magnum-api/a9a0eb9//console.html
When this happens, zuul re-queues a new build from the start on a new VM. This can happen many times in a row before the job completes.
I chatted with openstack-infra about this and after taking a look at one of the VMs, it looks like memory over consumption leading to thrashing was a possible culprit. The sshd daemon was also dead but the console showed things like "INFO: task kswapd0:77 blocked for more than 120 seconds". A cursory glance and following some of the jobs seems to indicate that this doesn't happen on RAX VMs which have swap devices unlike the OVH VMs as well.

hongbin (hongbin034) on 2016-02-05

Changed in magnum:
importance:	Undecided → Critical

OpenStack Infra (hudson-openstack) on 2016-02-06

Changed in magnum:
assignee:	nobody → hongbin (hongbin034)
status:	New → In Progress

OpenStack Infra (hudson-openstack) on 2016-02-09

Changed in magnum:
assignee:	hongbin (hongbin034) → Corey O'Brien (coreypobrien)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-12: Fix merged to magnum (master)

Reviewed: https://review.openstack.org/276958
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=dca02f1d98c82df7cfcd43786ea97210b9136389
Submitter: Jenkins
Branch: master

commit dca02f1d98c82df7cfcd43786ea97210b9136389
Author: Hongbin Lu <email address hidden>
Date: Fri Feb 5 18:53:34 2016 -0500

Reduce memory consumption of gate tests

    Recently, the gate jobs took too long to complete (between 2 to 8
    hours). The reason is jenkins slave agent die during the test, which
    cause the CI to re-start the whole test in a new VM.

    The failure mainly occurred at magnum-api pipeline, but also occurred
    at other pipelines. In term of distribution of test nodes, this
    failure mainly occurred at OVH nodes, in which there is no dedicated
    swap device. As a result, at OVH nodes, local disk is used for swap
    when memory is over-consuming. It looks this leads to resource
    starvation, which cause the failure.

    This patch attempted to reduce the memory consumption of the gate
    tests. In the api test, the number of worker nodes was reduced from
    2 to 1. In all tests (api/k8s/swarm/mesos), the memory of worker node
    was reduced from 1G to 512M.

Closes-Bug: #1542386
Change-Id: If7822d07f95ebc935a8763b92f038f10cf07b5ca

Changed in magnum:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-10: Fix included in openstack/magnum 2.0.0

This issue was fixed in the openstack/magnum 2.0.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.