crash kernel fails to boot with ice network hw
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Jim Somerville |
Bug Description
Brief Description
-----------------
On a kernel crash, such as the watchdog timer firing, kexec tries booting the crash recovery kernel in order to capture a vmcore so that the issue can be debugged. This normally succeeds unless the platform has ice network hardware. Why? Because the crash recovery kernel has only a small amount of memory set aside for it, and the ice driver allocates enough memory to cause memory exhaustion. This causes the crash recovery kernel's startup to fail, leading to complete platform hang which requires a hardware reset or power cycle to get out of.
Severity
--------
Critical, as a kernel crash/watchdog timeout can cause a complete hang of the node.
Steps to Reproduce
------------------
Boot system that has ice network hardware in at least one node. Login to that node, get a root shell via sudo -s, and then force a crash by doing echo c >/proc/
Expected Behavior
------------------
The node recovers and a vmcore is left in /var/log/crash/
Actual Behavior
----------------
Node hangs, no vmcore is seen in /var/log/crash after the node is recovered via reset/power cycle
Reproducibility
---------------
100% so far from what I've seen. It may depend however on having enough ice network devices and/or their specific pci-id versions to cause enough memory consumption to force the issue.
System Configuration
-------
Doesn't matter, you just need ice network hardware.
Branch/Pull Time/Commit
-------
N/A as a code submission didn't break this
Last Pass
---------
N/A as a code submission didn't break this
Timestamp/Logs
--------------
You have to see or capture the console output while the crash recovery kernel boots. The cause is already known.
Test Activity
-------------
Seen by a customer when the hostwd timed out on a node with ice hardware.
Workaround
----------
sudo bash
echo 'dracut_args --omit-drivers "ice"' >> /etc/kdump.conf
systemctl restart kdump
exit
CVE References
tags: | added: stx.5.0 stx.config |
Changed in starlingx: | |
importance: | Undecided → High |
assignee: | nobody → Jim Somerville (jsomervi) |
status: | New → Triaged |
tags: | added: in-r-stx50 |
stx.5.0 / high - issue results in a node crash for hardware w/ Columbiaville NICs. Issue is introduced by StoryBoard: https:/ /storyboard. openstack. org/#!/ story/2008436