Low-latency worker node reboots when pods under heavy load

Bug #1830297 reported by Brent Rowsell
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Low
Jim Gauld

Bug Description

Brief Description
-----------------
This is likely related to
https://bugs.launchpad.net/starlingx/+bug/1830296

I was running a pod with exclusive cpu's running a process in a busy loop. After a few mins it rebooted.

Severity
--------
Critical

Steps to Reproduce
------------------
See above

Expected Behavior
------------------
No reboot

Actual Behavior
----------------
Reboot

Reproducibility
---------------
100%

System Configuration
--------------------
Standard config

Branch/Pull Time/Commit
-----------------------
2019-05-22 17:57:16 -0400

Last Pass
---------
Don;t know

Timestamp/Logs
--------------
There were no useful logs. All we see is the loss of mgmt/cluster network heartbeat on the controller. It appears the worker simply stopped responding.

Test Activity
-------------
Other ]

https://bugs.launchpad.net/starlingx/+bug/1830296

Changed in starlingx:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; container testing taking the node down

Changed in starlingx:
assignee: nobody → Brent Rowsell (brent-rowsell)
tags: added: stx.2.0 stx.containers
summary: - Low-latency worker node reboots
+ Low-latency worker node reboots when pods under heavy load
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Brent Rowsell (brent-rowsell) → Jim Gauld (jgauld)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Critical → High
Revision history for this message
Jim Gauld (jgauld) wrote :

There are few changes to kubelet that require upstream fixes. Will open upstream issues to track. eg., prevent throttling of CFS shares for Guaranteed pods; ability to isolate linux platform from kubepods. Usage of 'isolcpus' has already been disabled for low-latency, so we no longer get tasks stuck on specific cores.

Revision history for this message
Frank Miller (sensfan22) wrote :

After discussion with Brent (containers TL), we agreed to re-gate this issue to stx.3.0 as the fixes required will need to be implemented in the upstream kubernetes package.

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
Frank Miller (sensfan22) wrote :

Lowered priority to medium as issue is only seen under very heavy load.

Changed in starlingx:
importance: High → Medium
Revision history for this message
Frank Miller (sensfan22) wrote :

This issue has not yet been addressed in the kubernetes package. Removing the stx.3.0 tag.

tags: added: stx.4.0
removed: stx.3.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Lowering the priority as this may require k8s upstream fixes. We will not hold up stx.4.0 for this given that the issue was reported a year ago and is present in previous releases.

tags: removed: stx.4.0
Changed in starlingx:
importance: Medium → Low
Revision history for this message
Ramaswamy Subramanian (rsubrama) wrote :

No progress on this bug for more than 2 years. Candidate for closure.

If there is no update, this issue is targeted to be closed as 'Won't Fix' in 2 weeks.

Revision history for this message
Ramaswamy Subramanian (rsubrama) wrote :

Changing the status to 'Won't Fix' as there is no activity.

Changed in starlingx:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.