Fuel for OpenStack

nova-api consumes more memory which causes OOM on controller nodes

Bug #1822388 reported by Jan Wasilewski on 2019-03-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	Medium	Oleksiy Molchanov	Fuel for OpenStack 9.2-mu-12

Bug Description

With provided nova packages version: 13.0.0-7~u14.04+mos0-<customer-specific>, due to soft-deleted nova database design customer is experiencing below issues:
1. `soft-deleted` instances are kept inside the database which is leading to a situation that DB is really huge and needs to be cleaned by that solution: https://access.redhat.com/solutions/3239481 . Customer claim it needs to be executed every two minutes due to current observation, so maybe it would be better to provide some automatic functionality for that cleaning process.
2. nova-api memory grows over time and even if we will execute that recommendation from point 1 it seems it's not reclaiming memory and it leads to nova-api crash by OOM when host memory is exhausted -> that looks like a blocker from a customer perspective as it can lead to some issues with already on-going nova operations. So restart from time to time is not acceptable by the customer, even if it’s a part of workaround.

If we have a solution which is released in a newer version of Mitaka for Nova, it would be nice to mention, currently, I was not able to find that, but customer version of nova is pretty old. Additionally, the customer has some specific version which was compacted by them.

Tags:

Oleksiy Molchanov (omolchanov) on 2019-03-31

Changed in fuel:
assignee:	nobody → Oleksiy Molchanov (omolchanov)

Oleksiy Molchanov (omolchanov) on 2019-04-02

Changed in fuel:
status:	New → In Progress
importance:	Undecided → Medium
milestone:	none → 9.2-mu-12

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2019-04-10:

We need to consider backporting of https://review.openstack.org/#/c/409943/

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2019-04-10:

Backport: https://review.fuel-infra.org/#/c/40949/

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2019-04-16: Fix merged to openstack/nova (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/40949
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: fe1f542a476b929375585b39ff7cb268ff20fc85
Author: Jay Pipes <email address hidden>
Date: Wed Apr 10 15:45:43 2019

Only return latest instance fault for instances

This patch addresses slowness that can occur when doing a list servers
API operation when there are many thousands of records in the
instance_faults table.

Previously, in the Instance.fill_faults() method, we were getting all
instance fault records for a set of instances having one of a set of
supplied instance UUIDs and then iterating over those faults and
returning a dict of instance UUID to the first fault returned (which
happened to be the latest fault because of ordering the SQL query by
created_at).

This patch adds a new InstanceFaultList.get_latest_by_instance_uuids()
method that does some SQL-fu to only return the latest fault records for
each instance being inspected.

Closes-Bug: #1822388

Co-Authored-By: Roman Podoliaka <email address hidden>
Change-Id: I8f2227b3969791ebb2d04d74a316b9d97a4b1571

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Mikhail Samoylov (msamoylov) wrote on 2019-05-24:

Verified.
Connection to node-3 closed.
[root@nailgun ~]# fuel nodes
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---+--------+---------------------------+---------+------------+-------------------+-------------------+---------------+--------+---------
2 | ready | slave-04_compute_ceph-osd | 1 | 10.109.0.6 | 64:f1:86:74:e3:d0 | ceph-osd, compute | | 1 | 1
1 | ready | slave-01_controller | 1 | 10.109.0.3 | 64:55:a9:ec:15:64 | controller | | 1 | 1
6 | ready | slave-02_controller | 1 | 10.109.0.4 | 64:70:75:42:22:8a | controller | | 1 | 1
4 | ready | slave-03_controller | 1 | 10.109.0.5 | 64:01:57:e9:9d:cd | controller | | 1 | 1
3 | ready | slave-05_compute_ceph-osd | 1 | 10.109.0.7 | 64:90:24:71:e0:2b | ceph-osd, compute | | 1 | 1
5 | ready | slave-06_compute_ceph-osd | 1 | 10.109.0.8 | 64:57:05:15:69:fb | ceph-osd, compute | | 1 | 1

root@node-3:~# grep 'def instance_fault_get_by_instance_uuids(context, instance_uuids,' /usr/lib/python2.7/dist-packages/nova/db/api.py
def instance_fault_get_by_instance_uuids(context, instance_uuids,
root@node-3:~# exit
logout

Changed in fuel:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.