Low-latency worker node reboots when pods under heavy load
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Won't Fix
|
Low
|
Jim Gauld |
Bug Description
Brief Description
-----------------
This is likely related to
https:/
I was running a pod with exclusive cpu's running a process in a busy loop. After a few mins it rebooted.
Severity
--------
Critical
Steps to Reproduce
------------------
See above
Expected Behavior
------------------
No reboot
Actual Behavior
----------------
Reboot
Reproducibility
---------------
100%
System Configuration
-------
Standard config
Branch/Pull Time/Commit
-------
2019-05-22 17:57:16 -0400
Last Pass
---------
Don;t know
Timestamp/Logs
--------------
There were no useful logs. All we see is the loss of mgmt/cluster network heartbeat on the controller. It appears the worker simply stopped responding.
Test Activity
-------------
Other ]
Changed in starlingx: | |
status: | New → Triaged |
importance: | Undecided → Critical |
Changed in starlingx: | |
assignee: | Brent Rowsell (brent-rowsell) → Jim Gauld (jgauld) |
Changed in starlingx: | |
importance: | Critical → High |
Marking as release gating; container testing taking the node down