Low-latency worker node reboots when pods under heavy load

Bug #1830297 reported by Brent Rowsell on 2019-05-24
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
This is likely related to
https://bugs.launchpad.net/starlingx/+bug/1830296

I was running a pod with exclusive cpu's running a process in a busy loop. After a few mins it rebooted.

Severity
--------
Critical

Steps to Reproduce
------------------
See above

Expected Behavior
------------------
No reboot

Actual Behavior
----------------
Reboot

Reproducibility
---------------
100%

System Configuration
--------------------
Standard config

Branch/Pull Time/Commit
-----------------------
2019-05-22 17:57:16 -0400

Last Pass
---------
Don;t know

Timestamp/Logs
--------------
There were no useful logs. All we see is the loss of mgmt/cluster network heartbeat on the controller. It appears the worker simply stopped responding.

Test Activity
-------------
Other ]

https://bugs.launchpad.net/starlingx/+bug/1830296

Changed in starlingx:
status: New → Triaged
importance: Undecided → Critical
Ghada Khalil (gkhalil) wrote :

Marking as release gating; container testing taking the node down

Changed in starlingx:
assignee: nobody → Brent Rowsell (brent-rowsell)
tags: added: stx.2.0 stx.containers
summary: - Low-latency worker node reboots
+ Low-latency worker node reboots when pods under heavy load
Ghada Khalil (gkhalil) on 2019-05-24
Changed in starlingx:
assignee: Brent Rowsell (brent-rowsell) → Jim Gauld (jgauld)
Ghada Khalil (gkhalil) on 2019-05-24
Changed in starlingx:
importance: Critical → High
Jim Gauld (jgauld) wrote :

There are few changes to kubelet that require upstream fixes. Will open upstream issues to track. eg., prevent throttling of CFS shares for Guaranteed pods; ability to isolate linux platform from kubepods. Usage of 'isolcpus' has already been disabled for low-latency, so we no longer get tasks stuck on specific cores.

Frank Miller (sensfan22) wrote :

After discussion with Brent (containers TL), we agreed to re-gate this issue to stx.3.0 as the fixes required will need to be implemented in the upstream kubernetes package.

tags: added: stx.3.0
removed: stx.2.0
Frank Miller (sensfan22) wrote :

Lowered priority to medium as issue is only seen under very heavy load.

Changed in starlingx:
importance: High → Medium
Frank Miller (sensfan22) wrote :

This issue has not yet been addressed in the kubernetes package. Removing the stx.3.0 tag.

tags: added: stx.4.0
removed: stx.3.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers