Fuel for OpenStack

Make OpenvSwitch more protected from OOM-killer

Bug #1814046 reported by Alexander Rubtsov on 2019-01-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	New	Medium	MOS Maintenance	Fuel for OpenStack 9.x-updates

Bug Description

MOS: 9.2

In case of a memory leak on compute node, OOM-killer is invoked and it sometimes kills ovs-vswitchd process even if it's not a source of the leak. It impacts all VM instances which are running on this compute node.

Is it possible to adjust score (e.g. oom_score_adj) of OpenvSwitch in order to reduce the chances that it will be killed?

Tags:

Revision history for this message

Alexander Rubtsov (arubtsov) wrote on 2019-01-31:

sla2 for 9.0-updates

Changed in fuel:
importance:	Undecided → Medium
assignee:	nobody → MOS Maintenance (mos-maintenance)
milestone:	none → 9.x-updates
tags:	added: customer-found sla2

Revision history for this message

Roman Lubianyi (rlubianyi) wrote on 2019-01-31:

Hi Alexander,

You can add a large negative score to the /proc/[ovs-vswitchd-PID]/oom_score_adj file to ensure that your process gets a lower chance of being picked and terminated by OOM killer e.g "echo -500 > /proc/[ovs-vswitchd-PID]/oom_score_adj". The oom_score_adj can vary from -1000 to 1000. If you assign -1000 to it, it can use 100% memory and still avoid getting terminated by OOM killer. Be aware that this change valid until reboot or service restart. If you want that the value in oom_score_adj persists after a reboot or service restart than add "oom score -500" to the /etc/init/openvswitch-switch.conf file and restart the openvswitch-switch service.

Revision history for this message

Alexander Rubtsov (arubtsov) wrote on 2019-02-04:

Hi Roman,

The purpose of this bug report is having the optimal value of score out of the box (in MOS generally rather than customization of particular environments).
In order to Mirantis OpenStack will deploy environments with this score by default.

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2019-02-04:

Alexander, setting any OOM-killer value is tightly connected to the real environment's characteristics. Since we find memory leaks as a rare-to-none occurring event we don't see any reason to change anything there. Moreover, neither a Controller node or a Compute nodes have any sacrificial processes that are not important for proper OpenStack operations. You just cannot save everything, since making priorities for every process effectively means that no process has a priority (exactly what we have now). Additionally, a memory leak should be fixed, not worked around or masked, otherwise it may be hiding until it's too late and all the data is lost. OpenStack is designed to have a failover in case if one of the nodes is failed, this idea allows cloud operators to troubleshoot the failed node, learn the lesson and avoid it re-occurrence in the future.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.