[gates] OOM failures on CI

Bug #1623394 reported by Timur Nurlygayanov
50
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Fuel Sustaining
Mitaka
Fix Committed
High
Fuel Sustaining

Bug Description

The gates are broken, example:

https://review.openstack.org/#/c/369427/3/

in the logs we can see the following error:

2016-09-14T06:26:54.030000+00:00 warning: [ 2687.871970] nova-api invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
2016-09-14T06:26:54.030367+00:00 err: [ 2687.872346] Out of memory: Kill process 411 (mysqld) score 45 or sacrifice child
2016-09-14T06:26:54.030367+00:00 err: [ 2687.872419] Killed process 411 (mysqld) total-vm:2547180kB, anon-rss:115368kB, file-rss:0kB

The root of the issue:
We discussed the issue with Maksim Malchuk and he suggested to increase the size of RAM on controller nodes in this job to make gate more stable.

Note:
We need to review all gates where we run BVT/SWARM tests and make sure we increase the RAM size for all of them to avoid random false-negative fails.

Also failed tests due to out of memory issue:

https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4628

https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.smoke_neutron/656/
https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.smoke_neutron/652/

https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4678/
https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4683/
https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4681/
https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4679/

https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4695/

Tags: area-ci
description: updated
tags: added: area-ci
Changed in fuel:
assignee: nobody → Fuel CI (fuel-ci)
importance: Undecided → High
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

We mustn't increase RAM for tests! First step is check that gate use *23 version of MySQL, second - investigate a real cause of failure.

Revision history for this message
Roman Vyalov (r0mikiam) wrote :

please update the iso on the fuel-ci

summary: - The gates are broken: need to increase the size of RAM, MySQL crashed
+ [gates] Cluster is not deployed: some nodes are in the Error state
Revision history for this message
Roman Vyalov (r0mikiam) wrote : Re: [gates] Cluster is not deployed: some nodes are in the Error state

for stabel/mitaka yesterday we downgraded the mysql version, and now we are updating the iso on fuel-ci.

but for master you can discuss with qa team about the memory on hw and etc

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Roman, we should downgrade the version of package in both master and stable.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Nastya, *23 version needed only for mitaka to solve split-brain issues, the master affected by the another problem oom-killer kills not only the mysqld.

no longer affects: fuel/newton
Revision history for this message
Roman Vyalov (r0mikiam) wrote :

we cannot increase the RAM for tests! please solve the problem with mysql.
also fyi the configuration for the env (RAM etc) is managing in the fuel-devops/qa code

Changed in fuel:
assignee: Dmitry Kaigarodеsev (dkaiharodsev) → nobody
status: Confirmed → New
Changed in fuel:
assignee: nobody → Fuel QA Team (fuel-qa)
status: New → Confirmed
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

What the real reason of such behavior? I am really sorry, but from description it is absolutely unclear.

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Maksim Malchuk (mmalchuk)
status: Confirmed → Won't Fix
status: Won't Fix → Incomplete
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

ok, I will update the description and add the bunch of failed tests killed by the out of memory.

description: updated
Changed in fuel:
status: Incomplete → Confirmed
assignee: Maksim Malchuk (mmalchuk) → Dmitry Kaigarodеsev (dkaiharodsev)
Revision history for this message
Georgy Kibardin (gkibardin) wrote :
Revision history for this message
Roman Vyalov (r0mikiam) wrote :

please discuss with QA (in the code fuel-qa/devops) team about increate RAM in the tests!

Changed in fuel:
assignee: Dmitry Kaigarodеsev (dkaiharodsev) → Maksim Malchuk (mmalchuk)
status: Confirmed → New
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/370174

Changed in fuel:
status: New → In Progress
Revision history for this message
Nikita Karpin (mkarpin) wrote : Re: [gates] Cluster is not deployed: some nodes are in the Error state
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Nikita, this fix not for one simple job, the memory increased for all slaves in default configuration for almost all tests.

Revision history for this message
Vasyl Saienko (vsaienko) wrote :

According to node-1 logs from: https://ci.fuel-infra.org/job/master.fuel-library.pkgs.ubuntu.smoke_neutron/7842/

The server has been swapped, in the atop logs we can see that starting from 2016/09/19 11:52:31 all swap was used. https://paste.mirantis.net/show/2659/

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Yes, Vasyl, using swap also affects IO that caused the another issues with MySQL services.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Revision history for this message
Roman Vyalov (r0mikiam) wrote :

please investigate , where the problem with deployment 9.1. the new version of mysql was removed , ISO was updated. The problem dont related to infra team

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Roman, JFYI, we have another new version of the mysql, and this bug is not related to the split-brain issues, this one is related to the lack of the memory on CI for deployment.

Let's wait for the results from the latest SWARM and if this issue is reproduced again so bug will be reassigned.

Revision history for this message
Nikita Karpin (mkarpin) wrote :
Download full text (3.2 KiB)

so i made small investigation about this:

1) Locally deployed bvt-like environment (3 controllers, 3 Computes + Ceph, 2cpu, 3G, radosgw)
ISO - http://seed.fuel-infra.org/fuelweb-iso/fuel-10.0-community-704-2016-09-22_23-14-00.iso
   Ran OSTF tests on it (only one test failed - "Update stack actions: inplace, replace and update whole template"), but the bug is reproduced successfully, after OSTF mysqld was explicitly killed on 2 controllers, on third one it was not running (no message in logs that it was killed ), moreover on one of the controllers even nova-api was killed:

<3>Sep 26 13:00:04 node-5 kernel: [ 9758.142888] Killed process 5669 (mysqld) total-vm:2533636kB, anon-rss:102244kB, file-rss:0kB
<3>Sep 26 13:30:05 node-5 kernel: [11558.395347] Killed process 11566 (nova-api) total-vm:762948kB, anon-rss:138896kB, file-rss:1984kB
<3>Sep 26 13:00:12 node-4 kernel: [ 9766.350272] Killed process 18205 (nova-api) total-vm:761180kB, anon-rss:141460kB, file-rss:3276kB
<3>Sep 26 13:03:08 node-4 kernel: [ 9942.369298] Killed process 21134 (mysqld) total-vm:2570044kB, anon-rss:106276kB, file-rss:0kB
<3>Sep 26 13:30:04 node-4 kernel: [11558.377802] Killed process 18209 (nova-api) total-vm:759576kB, anon-rss:140708kB, file-rss:1604kB

   Resource usage before OSTF(it is similar on all controllers):
              total used free shared buff/cache available
Mem: 2.9G 2.6G 114M 69M 252M 105M
Swap: 3.0G 2.3G 766M

   Resource usage during OSTF(it is similar on all controllers):
              total used free shared buff/cache available
Mem: 2.9G 2.7G 79M 64M 153M 22M
Swap: 3.0G 3.0G 0B

2) Locally deployed smoke_neutron-like environment (1 controller, 2 Computes + Cinder, 2cpu, 3G)
ISO - http://seed.fuel-infra.org/fuelweb-iso/fuel-10.0-community-704-2016-09-22_23-14-00.iso
   Ran OSTF tests on it, and reproduced this bug, mysqld was killed, all symptoms are the same as in bvt-like case except that more OSTF tests failed

Summary:

In both cases - I checked also if some process could eat anomalous memory volume, and could not find such a process, it looks like all processes in summary are eating it [1]. If you look at mysql memory usage it is not high, because real allocated memory is RSS and it is always around 100M or during OSTF it is 150 MB. Its killed only because of OOM killer algorythm.

So looks like indeed all issues are related to lack of RAM on fuel slave nodes. Currently this issues are being reproduced (on different CIs[2,3] and mostly on non-HA cases). It is not permanent but floating issue, I guess high IO operations speed and overall low load level of CI slave allow sometimes eliminate this issue. I had not the chance yet to look at 9.1 case but in 10.0 we are using Ubuntu 16.04 and in 9.1 we are still on Ubuntu 14.04. It can make great difference because of OS memory usage. Will try soon to take a look at 9.1

[1] - http://paste.openstack.org/show/582970/
[2] - https://product-ci.infra.mirantis.net/job/10.0.main.ubuntu.smoke_neutron/712/
[3] - https:/...

Read more...

summary: - [gates] Cluster is not deployed: some nodes are in the Error state
+ [gates] OOM failures on CI
no longer affects: fuel/newton
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Yes, Nikita, you are right, the algorithm kills the ONE BIG process instead of reducing usage of the forked processes which uses much more memory.

As an example, this is from the node which was deployed hours ago and actually stays in idle state:

root@node-1:~# ps -C neutron-server -orss= | awk '{ count ++; size += $1 }; END {print "Number of processes =",count; print "Memory usage per process =",size/1024/count, "MB"; print "Total memory usage =", size/1024, "MB"}'
Number of processes = 4
Memory usage per process = 72.3037 MB
Total memory usage = 289.215 MB

root@node-1:~# ps -C mysqld -orss= | awk '{ count ++; size += $1 }; END {print "Number of processes =",count; print "Memory usage per process =",size/1024/count, "MB"; print "Total memory usage =", size/1024, "MB"}'
Number of processes = 1
Memory usage per process = 189 MB
Total memory usage = 189 MB

as you can see the Neutron, for example, eats more memory and have more processes, but MySQL overall size is much bigger, but it only one, and killing it leads to failures as expected.

Revision history for this message
Nikita Karpin (mkarpin) wrote :

today was unable to successfully upgrade from 9.0 to 9.1, so just some comparison for 9.0 against 10.0 resource usage:

MOS 10.0 smoke_neutron like environment:
               total used free shared buff/cache available
Mem: 2.9G 2.6G 118M 76M 276M 124M
Swap: 3.0G 2.9G 138M

   Memory and Swap usage per process:
   http://paste.openstack.org/show/583069/

MOS 9.0, smoke_neutron like environment, OSTF passed.

 before OSTF:
             total used free shared buffers cached
Mem: 2.9G 2.8G 176M 16M 5.2M 93M
-/+ buffers/cache: 2.7G 274M
Swap: 3.0G 1.5G 1.5G

during ostf:
             total used free shared buffers cached
Mem: 2.9G 2.8G 137M 8.8M 3.6M 73M
-/+ buffers/cache: 2.7G 214M
Swap: 3.0G 2.3G 751M

   Memory and Swap usage per process:
   http://paste.openstack.org/show/583070/

So currently from my POV - we need at least increase memory for Fuel slaves for all 10.0 jobs, as it looks like simply Openstack Newton eats a bit more memory than Mitaka

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Nikita, as you can see the fix for 10.0 is on review since September 14th. Feel free to raise the problem in infra-team.

Revision history for this message
Nikita Karpin (mkarpin) wrote :

basically we are lucky that BVT passes, but it could stop passing any time, because BVT run analyse show that memory usage is near critical we have 2G swapped - https://product-ci.infra.mirantis.net/job/10.0.main.ubuntu.bvt_2/722/console:

2016-09-28 00:59:48 - INFO fuel_web_client.py:938 -- Node status:
 slave-03:
    Host node-3
    Roles:
       Hiera:
           - controller
       Nailgun:
           - controller
    Memory:
       RAM:
          used 2697
          cached 79
          free 106
          shared 58
          total 3008
          buffers 203
       SWAP:
          total 3071
          free 885
          used 2186

Changed in fuel:
assignee: Maksim Malchuk (mmalchuk) → Fuel Sustaining (fuel-sustaining-team)
status: In Progress → Confirmed
Revision history for this message
Alisa Tselovalnikova (atselovalnikova) wrote :
Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

The amount of RAM was increased for OpenStack nodes.

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on fuel-infra/jenkins-jobs (master)

Change abandoned by Vladimir Khlyunev <email address hidden> on branch: master
Review: https://review.fuel-infra.org/27152
Reason: already implemented

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-qa (master)

Change abandoned by Vladimir Khlyunev (<email address hidden>) on branch: master
Review: https://review.openstack.org/370174
Reason: Updated 6 months ago
not needed anymore

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.