CI: low-memory template doesn't apply and jobs are killed by oom

Bug #1642429 reported by Sagi (Sergey) Shnaidman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

A few last jobs failing because oom killer. It seems that low-memory template doesn't apply and amount of workers are huge.
http://logs.openstack.org/94/381094/37/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha-mitaka/e47d90b/console.html#_2016-11-16_18_48_05_529385

It's set before running tripleo.sh:
++ /opt/stack/new/tripleo-ci/scripts/deploy.sh::L167: export 'OVERCLOUD_DEPLOY_ARGS= --libvirt-type=qemu -t 80 -e /opt/stack/new/tripleo-ci/test-environments/enable-tls.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /opt/stack/new/tripleo-ci/test-environments/inject-trust-anchor-hiera.yaml --ceph-storage-scale 1 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /opt/stack/new/tripleo-ci/test-environments/worker-config.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml'

but not when running tripleo.sh:

tripleo.sh -- Deploy command arguments: --libvirt-type=qemu -t 80 -e /opt/stack/new/tripleo-ci/test-environments/enable-tls.yaml -e /opt/stack/new/tripleo-ci/test-environments/tls-endpoints-public-ip.yaml -e /opt/stack/new/tripleo-ci/test-environments/inject-trust-anchor.yaml --ceph-storage-scale 1 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml --templates --validation-errors-fatal --validation-warnings-fatal

Maybe it's issue of soursing deploy.env each time we run tripleo.sh:

https://github.com/openstack-infra/tripleo-ci/commit/67deb95ed15870113ebb856629cbb0a1bd4eee55

After we increased CPUs number on nodes it started to be a problem with current workers amount.

Tags: ci
description: updated
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
Ben Nemec (bnemec) wrote :

We might also need https://review.openstack.org/#/c/398652/ to get this working on mitaka again. I pushed it in series so they'll be tested together.

Changed in tripleo:
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Ben Nemec (bnemec) wrote :

Alternate patch that should also allow the correct params to be passed: https://review.openstack.org/#/c/399146/

Revision history for this message
Ben Nemec (bnemec) wrote :

We've also dropped the cpus in the baremetal nodes from 8 to 4 in an attempt to reduce the memory usage until the workers patch can merge.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

I wonder if we need to increase resources so much. We won't detect some kind of problems if we'll do it. Like bug with redis, gnocchi bombing logs and reconnecting 10 times in millisecond, memory leaks and other problems that could be critical. Moderate resource will help to discover such issues in CI, and sooner we'll detect the problem the cheaper will be the fix.

tags: removed: alert
Ben Nemec (bnemec)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.