Some DUTs can't boot up after installing the proposed kernel on Mantic

Bug #2061940 reported by Kevin Yeh
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Ivan Hu
Mantic
Fix Committed
High
Unassigned

Bug Description

During the SRU testing for 6.5.0.33 kernel, I found some machines freeze at very early stage of booting process.
The last messages displayed on the screen are:
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Measured initrd data into PCR 9

There are no journal logs existed when it boot with the 6.5.0-33 kernel.

Here is the list of the impacted machines that I found so far
https://certification.canonical.com/hardware/202106-29206/
https://certification.canonical.com/hardware/202106-29207/
https://certification.canonical.com/hardware/201912-27623/
https://certification.canonical.com/hardware/202007-28059/
https://certification.canonical.com/hardware/202103-28762/
https://certification.canonical.com/hardware/202012-28510/

Changed in linux (Ubuntu):
assignee: nobody → Ivan Hu (ivan.hu)
Kevin Yeh (kevinyeh)
description: updated
description: updated
Revision history for this message
Ivan Hu (ivan.hu) wrote :

testing with Ubuntu-6.5.0-32-generic, it can't boot up as well.

Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → Critical
Revision history for this message
Ivan Hu (ivan.hu) wrote :

testing with Ubuntu-6.5.0-28-generic without this issue, bisecting kernel....

Revision history for this message
Ivan Hu (ivan.hu) wrote :

bisect kernel result,
the patch 2c7b4bfadef08cc0995c24a7b9eb120fe897165f causes this regression

    thermal: core: Store trip pointer in struct thermal_instance

    Replace the integer trip number stored in struct thermal_instance with
    a pointer to the relevant trip and adjust the code using the structure
    in question accordingly.

    The main reason for making this change is to allow the trip point to
    cooling device binding code more straightforward, as illustrated by
    subsequent modifications of the ACPI thermal driver, but it also helps
    to clarify the overall design and allows the governor code overhead to
    be reduced (through subsequent modifications).

    The only case in which it adds complexity is trip_point_show() that
    needs to walk the trips[] table to find the index of the given trip
    point, but this is not a critical path and the interface that
    trip_point_show() belongs to is problematic anyway (for instance, it
    doesn't cover the case when the same cooling devices is associated
    with multiple trip points).

    This is a preliminary change and the affected code will be refined by
    a series of subsequent modifications of thermal governors, the core and
    the ACPI thermal driver.

    The general functionality is not expected to be affected by this change.

    Signed-off-by: Rafael J. Wysocki <email address hidden>
    Reviewed-by: Daniel Lezcano <email address hidden>

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

6.6.x has this in addition

commit 3a3bbc6911f57e1c3b4eabf1d098cde7bf7baeb0
Author: Rafael J. Wysocki <email address hidden>
Date: Tue Sep 19 20:59:53 2023 +0200

    thermal: trip: Drop redundant trips check from for_each_thermal_trip()

    [ Upstream commit a15ffa783ea4210877886c59566a0d20f6b2bc09 ]

    It is invalid to call for_each_thermal_trip() on an unregistered thermal
    zone anyway, and as per thermal_zone_device_register_with_trips(), the
    trips[] table must be present if num_trips is greater than zero for the
    given thermal zone.

    Hence, the trips check in for_each_thermal_trip() is redundant and so it
    can be dropped.

    Signed-off-by: Rafael J. Wysocki <email address hidden>
    Acked-by: Daniel Lezcano <email address hidden>
    Stable-dep-of: e95fa7404716 ("thermal: gov_power_allocator: avoid inability to reset a cdev")
    Signed-off-by: Sasha Levin <email address hidden>

FYI, 6.1 didn't backport any of these, even though e95fa7404716 says "5.13+"

Revision history for this message
Ivan Hu (ivan.hu) wrote :

seems 6.6.x has this in addition for apply the "thermal: trip: Drop redundant trips check from for_each_thermal_trip()"

    thermal: core: Rework and rename __for_each_thermal_trip()

    Rework the currently unused __for_each_thermal_trip() to pass original
    pointers to struct thermal_trip objects to the callback, so it can be
    used for updating trip data (e.g. temperatures), rename it to
    for_each_thermal_trip() and make it available to modular drivers.

    Suggested-by: Daniel Lezcano <email address hidden>
    Signed-off-by: Rafael J. Wysocki <email address hidden>

Revision history for this message
Ivan Hu (ivan.hu) wrote :

Test latest master-next of mantic with below two patches,

thermal: trip: Drop redundant trips check from for_each_thermal_trip()
thermal: core: Rework and rename __for_each_thermal_trip()

still failed.

Revision history for this message
Roxana Nicolescu (roxanan) wrote :

@ivan.hu Can you please test if dropping these 2 patches work? This is a temporary solution to start the next cycle, we'll need to find a proper one though.

thermal: core: Store trip pointer in struct thermal_instance
thermal: trip: Drop lockdep assertion from thermal_zone_trip_id()

Revision history for this message
Ivan Hu (ivan.hu) wrote :

@Roxana,

Checkout tag Ubuntu-6.5.0-34.34 and revert two patches below, it can boot up without problem.

thermal: core: Store trip pointer in struct thermal_instance
thermal: trip: Drop lockdep assertion from thermal_zone_trip_id()

Ivan Hu (ivan.hu)
Changed in linux (Ubuntu):
status: In Progress → Triaged
Revision history for this message
Ivan Hu (ivan.hu) wrote :

Test with latest proposed kernel Ubuntu-6.5.0-40.40, can boot up without problem.

Revision history for this message
Ivan Hu (ivan.hu) wrote (last edit ):

Tested with the test kernel 6.5.0-42,
https://launchpad.net/~roxanan/+archive/ubuntu/thermal/+packages

unfortunately, got fail to boot up

Revision history for this message
Ivan Hu (ivan.hu) wrote (last edit ):

bisecting kernel from the 6.5.0-41 to latest master-next, the patch "x86/boot: Omit compression buffer from PE/COFF image memory footprint" also causes fail to boot up.

03651c884514d311fa3e628ac073afbad04e5c57 is the first bad commit
commit 03651c884514d311fa3e628ac073afbad04e5c57
Author: Ard Biesheuvel <email address hidden>
Date: Tue Sep 12 09:00:56 2023 +0000

    x86/boot: Omit compression buffer from PE/COFF image memory footprint

    BugLink: https://bugs.launchpad.net/bugs/2061814

    commit 8eace5b3555606e684739bef5bcdfcfe68235257 upstream.

    Now that the EFI stub decompresses the kernel and hands over to the
    decompressed image directly, there is no longer a need to provide a
    decompression buffer as part of the .BSS allocation of the PE/COFF
    image. It also means the PE/COFF image can be loaded anywhere in memory,
    and setting the preferred image base is unnecessary. So drop the
    handling of this from the header and from the build tool.

    Signed-off-by: Ard Biesheuvel <email address hidden>
    Signed-off-by: Ingo Molnar <email address hidden>
    Link: https://<email address hidden>
    Signed-off-by: Greg Kroah-Hartman <email address hidden>
    Signed-off-by: Portia Stephens <email address hidden>
    Signed-off-by: Roxana Nicolescu <email address hidden>

Revision history for this message
Roxana Nicolescu (roxanan) wrote :

Trying to revert this commit was not straightforward because it interfered with other commits. I tracked down the original submission and reverted all of them https://<email address hidden>/

Branch with the reverts on top of master-next: https://code.launchpad.net/~roxanan/ubuntu/+source/linux/+git/mantic/+ref/cranky/master-next-reverts/

@Ivan tested it and confirmed the machine now boots properly.

Stefan Bader (smb)
Changed in linux (Ubuntu Mantic):
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Critical → Undecided
status: Triaged → Invalid
Changed in linux (Ubuntu Mantic):
status: New → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/6.5.0-44.44 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-mantic-linux' to 'verification-done-mantic-linux'. If the problem still exists, change the tag 'verification-needed-mantic-linux' to 'verification-failed-mantic-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-mantic-linux-v2 verification-needed-mantic-linux
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.