neutron-ns-meta invokes oom-killer during gate runs

Bug #1362347 reported by Matthew Treinish on 2014-08-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Salvatore Orlando

Bug Description

Occasionally a neutron gate job fails because the node runs out of memory. oom-killer is invoked and it starts killing processes to save the node. (which just causes cascading issues) The kernel logs show that oom-killer is being invoked by neutron-ns-meta.

An example of one such failure is:

http://logs.openstack.org/75/116775/2/check/check-tempest-dsvm-neutron-full/ab17a70/

With the kernel log:

http://logs.openstack.org/75/116775/2/check/check-tempest-dsvm-neutron-full/ab17a70/logs/syslog.txt.gz#_Aug_26_04_59_03

Using logstash this failure can be isolated to only neutron gate jobs. So there is probably something triggering neutron to occasionally make the job consume in excess of 8GB of ram.

I also noted in the neutron svc log that first out of memory error came from using keystone-middleware:

http://logs.openstack.org/75/116775/2/check/check-tempest-dsvm-neutron-full/ab17a70/logs/screen-q-svc.txt.gz#_2014-08-26_04_56_39_602

but that may just be a red herring.

tags: added: gate-failure
summary: - neutron-ns-meta invokes oom-killer during gate runs
+ gneutron-ns-meta invokes oom-killer during gate runs
summary: - gneutron-ns-meta invokes oom-killer during gate runs
+ neutron-ns-meta invokes oom-killer during gate runs
Changed in neutron:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
milestone: none → juno-3

Failure analysis here: http://blog.kortar.org/?p=52

While I was looking at these failures I noticed that they only occurred with mysql as a db backend.
There are not enough data points to confirm this, but probably the root cause might have something to do either with the DBMS itself, with python-mysqldb, or with the sqlalchemy backend for mysql.

Or this might be a red herring and it's just neutron exhausting memory.

However, oom usually occur 30 to 40 minutes into the test runs. Tempest runs usually last more. Consider this is not a frequent failure, if it was a progressive memory leak due to system load it oom messages should have occurred closer to the end of the test.

Changed in neutron:
importance: Undecided → Medium
importance: Medium → High
Thierry Carrez (ttx) on 2014-09-03
Changed in neutron:
milestone: juno-3 → juno-rc1
Changed in neutron:
milestone: juno-rc1 → kilo-1
Matt Riedemann (mriedem) wrote :

We haven't seen this in 10 days.

Changed in neutron:
status: New → Incomplete
Joe Gordon (jogo) wrote :

Since we haven't seen this in a while, closing the bug. If this is seen again please re-open.

Changed in neutron:
status: Incomplete → Invalid
Thierry Carrez (ttx) on 2014-11-25
Changed in neutron:
milestone: kilo-1 → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers