crash kernel fails to boot with ice network hw

Bug #1923879 reported by Jim Somerville
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Jim Somerville

Bug Description

Brief Description
-----------------
On a kernel crash, such as the watchdog timer firing, kexec tries booting the crash recovery kernel in order to capture a vmcore so that the issue can be debugged. This normally succeeds unless the platform has ice network hardware. Why? Because the crash recovery kernel has only a small amount of memory set aside for it, and the ice driver allocates enough memory to cause memory exhaustion. This causes the crash recovery kernel's startup to fail, leading to complete platform hang which requires a hardware reset or power cycle to get out of.

Severity
--------
Critical, as a kernel crash/watchdog timeout can cause a complete hang of the node.

Steps to Reproduce
------------------
Boot system that has ice network hardware in at least one node. Login to that node, get a root shell via sudo -s, and then force a crash by doing echo c >/proc/sysrq-trigger

Expected Behavior
------------------
The node recovers and a vmcore is left in /var/log/crash/

Actual Behavior
----------------
Node hangs, no vmcore is seen in /var/log/crash after the node is recovered via reset/power cycle

Reproducibility
---------------
100% so far from what I've seen. It may depend however on having enough ice network devices and/or their specific pci-id versions to cause enough memory consumption to force the issue.

System Configuration
--------------------
Doesn't matter, you just need ice network hardware.

Branch/Pull Time/Commit
-----------------------
N/A as a code submission didn't break this

Last Pass
---------
N/A as a code submission didn't break this

Timestamp/Logs
--------------
You have to see or capture the console output while the crash recovery kernel boots. The cause is already known.

Test Activity
-------------
Seen by a customer when the hostwd timed out on a node with ice hardware.

Workaround
----------

sudo bash
echo 'dracut_args --omit-drivers "ice"' >> /etc/kdump.conf
systemctl restart kdump
exit

CVE References

Ghada Khalil (gkhalil)
tags: added: stx.5.0 stx.config
Changed in starlingx:
importance: Undecided → High
assignee: nobody → Jim Somerville (jsomervi)
status: New → Triaged
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / high - issue results in a node crash for hardware w/ Columbiaville NICs. Issue is introduced by StoryBoard: https://storyboard.openstack.org/#!/story/2008436

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/786329

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/786329
Committed: https://opendev.org/starlingx/stx-puppet/commit/f46c154188b5d90bdd19ba2a5952b4f8c565d5d3
Submitter: "Zuul (22348)"
Branch: master

commit f46c154188b5d90bdd19ba2a5952b4f8c565d5d3
Author: Jim Somerville <email address hidden>
Date: Wed Apr 14 17:13:59 2021 -0400

    kdump config remove intel eth drivers from ramdisk

    Problem:
    On a kernel crash, such as the watchdog timer firing, kexec
    tries booting the crash recovery kernel in order to capture
    a vmcore so that the issue can be debugged. This normally
    succeeds unless the platform has ice network hardware. Why?
    Because the crash recovery kernel has only a small amount of
    memory set aside for it, and the ice driver allocates enough
    memory to cause memory exhaustion. This causes the crash
    recovery kernel's startup to fail, leading to complete platform
    hang. In order to break out of the hang, one needs to manually
    do a hardware reset or power cycle.

    Solution:
    Change kdump.conf to leave the ice driver module out of the
    initramfs that is used by the crash recovery kernel. In
    fact, leave all of the intel ethernet drivers out since they
    are not needed and increase the risk of memory exhaustion.
    Upon changing kdump.conf, the kdump service is restarted to
    regenerate the initramfs.

    Verification:
    Install, check the kdump.conf file and unpack the initramfs file
    making sure that those modules are gone. Check controller,
    worker, and storage node types. Reboot node, make sure things
    behave as expected ie. no extra kdump.conf mangling and no
    unexpected kdump service restarts.
    Also crash a node with intel ethernet hardware on it and make
    sure it comes back up with a vmcore left in /var/log/crash.

    Change-Id: I9112f722cee8e199d94393bca887d3bb9bb89b39
    Closes-Bug: 1923879
    Signed-off-by: Jim Somerville <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Jim Somerville, please cherrypick this change to the r/stx.5.0 release branch once it's open for submissions.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (r/stx.5.0)

Fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/786885

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/786885
Committed: https://opendev.org/starlingx/stx-puppet/commit/d5b4d570dc69e00c6818d0df8d9edc9e08808bb6
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit d5b4d570dc69e00c6818d0df8d9edc9e08808bb6
Author: Jim Somerville <email address hidden>
Date: Wed Apr 14 17:13:59 2021 -0400

    kdump config remove intel eth drivers from ramdisk

    Problem:
    On a kernel crash, such as the watchdog timer firing, kexec
    tries booting the crash recovery kernel in order to capture
    a vmcore so that the issue can be debugged. This normally
    succeeds unless the platform has ice network hardware. Why?
    Because the crash recovery kernel has only a small amount of
    memory set aside for it, and the ice driver allocates enough
    memory to cause memory exhaustion. This causes the crash
    recovery kernel's startup to fail, leading to complete platform
    hang. In order to break out of the hang, one needs to manually
    do a hardware reset or power cycle.

    Solution:
    Change kdump.conf to leave the ice driver module out of the
    initramfs that is used by the crash recovery kernel. In
    fact, leave all of the intel ethernet drivers out since they
    are not needed and increase the risk of memory exhaustion.
    Upon changing kdump.conf, the kdump service is restarted to
    regenerate the initramfs.

    Verification:
    Install, check the kdump.conf file and unpack the initramfs file
    making sure that those modules are gone. Check controller,
    worker, and storage node types. Reboot node, make sure things
    behave as expected ie. no extra kdump.conf mangling and no
    unexpected kdump service restarts.
    Also crash a node with intel ethernet hardware on it and make
    sure it comes back up with a vmcore left in /var/log/crash.

    Change-Id: I9112f722cee8e199d94393bca887d3bb9bb89b39
    Closes-Bug: 1923879
    Signed-off-by: Jim Somerville <email address hidden>
    (cherry picked from commit f46c154188b5d90bdd19ba2a5952b4f8c565d5d3)

Ghada Khalil (gkhalil)
tags: added: in-r-stx50
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

    Enable kubelet support for pod pid limit

    Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

    Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

    Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

    Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

    This dependency does not exist for configuring the VF rate-limits
    however. There is a cha...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.