Do not load unnecessary device drivers and do not start unnecessary services in the kdump initramfs

Bug #2038804 reported by Jiping Ma
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jiping Ma

Bug Description

Brief Description

Starting with Debian-based WRCP versions, the kdump initramfs has included device drivers such as ice, iavf, i40e, mlx5* and the like. Loading these drivers in a kdump environment is problematic, because these drivers are known to consume a lot of memory.

In addition, the systemd-sysctl.service systemd unit causes the following sysctl to be set, which forces the kernel to keep more than 1 GiB of free memory, which results in out-of-memory conditions: "vm.min_free_kbytes=1179648".
All of these introduce a possibility of not being able to collect a vmcore file after a kernel crash. This is a major issue.

The drivers loaded in kdump used to be different with CentOS-based WRCP, which used to exclude unnecessary drivers from the kdump initramfs. As an example: https://opendev.org/starlingx/stx-puppet/src/commit/87229289856743c6c5314bc786a5be979cfa0068/puppet-manifests/src/modules/platform/manifests/config.pp#L342

There is a need to exclude unnecessary NIC drivers from the kdump environment to make vmcore collection more reliable. This bug is created to prevent such drivers from being loaded in the kdump initramfs.

Finally, there is a need to prevent the sysctl settings from taking place in the kdump environment.

Severity

Major: vmcore generation is negatively impacted.

Steps to Reproduce

Triggering a kernel crash dump on systems that make use of ice and/or i40e and/or mlx5_en drivers might fail to generate a vmcore file.

A crash can be generated with "sudo tee <<<c /proc/sysrq-trigger"

Expected Behavior

No unnecessary drivers are loaded in the kdump environment.

Actual Behavior

ice, i40e, iavf and potentially other drivers are loaded in the kdump environment.

Reproducibility

Intermittent

System Configuration

Not applicable. Issue was encountered with customer issue.

Alarms

Not applicable.

Test Activity

Normal use.

Workaround

None.

Jiping Ma (jma11)
Changed in starlingx:
assignee: nobody → Jiping Ma (jma11)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/897670

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/897670
Committed: https://opendev.org/starlingx/integ/commit/2b1651f1d54b3d576bbb39d151b1741246b03a29
Submitter: "Zuul (22348)"
Branch: master

commit 2b1651f1d54b3d576bbb39d151b1741246b03a29
Author: Jiping Ma <email address hidden>
Date: Mon Oct 9 01:06:22 2023 -0700

    kdump-tools: fix oom issue during kdump

    The kdump initramfs has included NIC device drivers such as ice, iavf,
    i40e, mlx5* and the like. Loading these drivers in a kdump environment
    is problematic, because these drivers are known to consume a lot of
    memory.

    In addition, the systemd-sysctl.service systemd unit causes the
    following sysctl to be set, which forces the kernel to keep more than
    1 GiB of free memory, which results in out-of-memory conditions:
    "vm.min_free_kbytes=1179648".

    All of these introduce a possibility of not being able to collect a
    vmcore file after a kernel crash. There is a need to exclude
    unnecessary NIC drivers from the kdump environment and prevent the
    sysctl settings from taking place to make vmcore collection more
    reliable.

    Verification:
    - build-pkgs; build-iso; install and boot up on aio-sx lab.
    - All these backlist drivers are not loaded in kdump kernel.

    Closes-Bug: 2038804

    Signed-off-by: M. Vefa Bicakci <email address hidden>
    Signed-off-by: Jiping Ma <email address hidden>
    Change-Id: I820280b1674d09f42b6abfc25bfa07f44b4f7b44

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distro.other
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.