nova-api consumes more memory which causes OOM on controller nodes

Bug #1822388 reported by Jan Wasilewski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Medium
Oleksiy Molchanov

Bug Description

With provided nova packages version: 13.0.0-7~u14.04+mos0-<customer-specific>, due to soft-deleted nova database design customer is experiencing below issues:
1. `soft-deleted` instances are kept inside the database which is leading to a situation that DB is really huge and needs to be cleaned by that solution: https://access.redhat.com/solutions/3239481 . Customer claim it needs to be executed every two minutes due to current observation, so maybe it would be better to provide some automatic functionality for that cleaning process.
2. nova-api memory grows over time and even if we will execute that recommendation from point 1 it seems it's not reclaiming memory and it leads to nova-api crash by OOM when host memory is exhausted -> that looks like a blocker from a customer perspective as it can lead to some issues with already on-going nova operations. So restart from time to time is not acceptable by the customer, even if it’s a part of workaround.

If we have a solution which is released in a newer version of Mitaka for Nova, it would be nice to mention, currently, I was not able to find that, but customer version of nova is pretty old. Additionally, the customer has some specific version which was compacted by them.

Changed in fuel:
assignee: nobody → Oleksiy Molchanov (omolchanov)
Changed in fuel:
status: New → In Progress
importance: Undecided → Medium
milestone: none → 9.2-mu-12
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

We need to consider backporting of https://review.openstack.org/#/c/409943/

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/40949
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: fe1f542a476b929375585b39ff7cb268ff20fc85
Author: Jay Pipes <email address hidden>
Date: Wed Apr 10 15:45:43 2019

Only return latest instance fault for instances

This patch addresses slowness that can occur when doing a list servers
API operation when there are many thousands of records in the
instance_faults table.

Previously, in the Instance.fill_faults() method, we were getting all
instance fault records for a set of instances having one of a set of
supplied instance UUIDs and then iterating over those faults and
returning a dict of instance UUID to the first fault returned (which
happened to be the latest fault because of ordering the SQL query by
created_at).

This patch adds a new InstanceFaultList.get_latest_by_instance_uuids()
method that does some SQL-fu to only return the latest fault records for
each instance being inspected.

Closes-Bug: #1822388

Co-Authored-By: Roman Podoliaka <email address hidden>
Change-Id: I8f2227b3969791ebb2d04d74a316b9d97a4b1571

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Mikhail Samoylov (msamoylov) wrote :

Verified.
Connection to node-3 closed.
[root@nailgun ~]# fuel nodes
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---+--------+---------------------------+---------+------------+-------------------+-------------------+---------------+--------+---------
 2 | ready | slave-04_compute_ceph-osd | 1 | 10.109.0.6 | 64:f1:86:74:e3:d0 | ceph-osd, compute | | 1 | 1
 1 | ready | slave-01_controller | 1 | 10.109.0.3 | 64:55:a9:ec:15:64 | controller | | 1 | 1
 6 | ready | slave-02_controller | 1 | 10.109.0.4 | 64:70:75:42:22:8a | controller | | 1 | 1
 4 | ready | slave-03_controller | 1 | 10.109.0.5 | 64:01:57:e9:9d:cd | controller | | 1 | 1
 3 | ready | slave-05_compute_ceph-osd | 1 | 10.109.0.7 | 64:90:24:71:e0:2b | ceph-osd, compute | | 1 | 1
 5 | ready | slave-06_compute_ceph-osd | 1 | 10.109.0.8 | 64:57:05:15:69:fb | ceph-osd, compute | | 1 | 1

root@node-3:~# grep 'def instance_fault_get_by_instance_uuids(context, instance_uuids,' /usr/lib/python2.7/dist-packages/nova/db/api.py
def instance_fault_get_by_instance_uuids(context, instance_uuids,
root@node-3:~# exit
logout

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.