StarlingX

Low-latency worker node reboots when pods under heavy load

Bug #1830297 reported by Brent Rowsell on 2019-05-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	Low	Jim Gauld

Bug Description

Brief Description
-----------------
This is likely related to
https://bugs.launchpad.net/starlingx/+bug/1830296

I was running a pod with exclusive cpu's running a process in a busy loop. After a few mins it rebooted.

Severity
--------
Critical

Steps to Reproduce
------------------
See above

Expected Behavior
------------------
No reboot

Actual Behavior
----------------
Reboot

Reproducibility
---------------
100%

System Configuration
--------------------
Standard config

Branch/Pull Time/Commit
-----------------------
2019-05-22 17:57:16 -0400

Last Pass
---------
Don;t know

Timestamp/Logs
--------------
There were no useful logs. All we see is the loss of mgmt/cluster network heartbeat on the controller. It appears the worker simply stopped responding.

Test Activity
-------------
Other ]

https://bugs.launchpad.net/starlingx/+bug/1830296

Tags:

Brent Rowsell (brent-rowsell) on 2019-05-24

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-24:

Marking as release gating; container testing taking the node down

Changed in starlingx:
assignee:	nobody → Brent Rowsell (brent-rowsell)
tags:	added: stx.2.0 stx.containers
summary:	- Low-latency worker node reboots + Low-latency worker node reboots when pods under heavy load

Ghada Khalil (gkhalil) on 2019-05-24

Changed in starlingx:
assignee:	Brent Rowsell (brent-rowsell) → Jim Gauld (jgauld)

Ghada Khalil (gkhalil) on 2019-05-24

Changed in starlingx:
importance:	Critical → High

Revision history for this message

Jim Gauld (jgauld) wrote on 2019-07-15:

There are few changes to kubelet that require upstream fixes. Will open upstream issues to track. eg., prevent throttling of CFS shares for Guaranteed pods; ability to isolate linux platform from kubepods. Usage of 'isolcpus' has already been disabled for low-latency, so we no longer get tasks stuck on specific cores.

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-07-19:

After discussion with Brent (containers TL), we agreed to re-gate this issue to stx.3.0 as the fixes required will need to be implemented in the upstream kubernetes package.

tags:

added: stx.3.0
removed: stx.2.0

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-09-06:

Lowered priority to medium as issue is only seen under very heavy load.

Changed in starlingx:
importance:	High → Medium

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-11-25:

This issue has not yet been addressed in the kubernetes package. Removing the stx.3.0 tag.

tags:

added: stx.4.0
removed: stx.3.0

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-05-26:

Lowering the priority as this may require k8s upstream fixes. We will not hold up stx.4.0 for this given that the issue was reported a year ago and is present in previous releases.

tags:	removed: stx.4.0
Changed in starlingx:
importance:	Medium → Low

Revision history for this message

Ramaswamy Subramanian (rsubrama) wrote on 2023-05-02:

No progress on this bug for more than 2 years. Candidate for closure.

If there is no update, this issue is targeted to be closed as 'Won't Fix' in 2 weeks.

Revision history for this message

Ramaswamy Subramanian (rsubrama) wrote on 2023-05-16:

Changing the status to 'Won't Fix' as there is no activity.

Changed in starlingx:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1830296

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.