Some DUTs can't boot up after installing the proposed kernel on Mantic

Bug #2061940 reported by Kevin Yeh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
Critical
Ivan Hu

Bug Description

During the SRU testing for 6.5.0.33 kernel, I found some machines freeze at very early stage of booting process.
The last messages displayed on the screen are:
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Measured initrd data into PCR 9

There are no journal logs existed when it boot with the 6.5.0-33 kernel.

Here is the list of the impacted machines that I found so far
https://certification.canonical.com/hardware/202106-29206/
https://certification.canonical.com/hardware/202106-29207/
https://certification.canonical.com/hardware/201912-27623/
https://certification.canonical.com/hardware/202007-28059/
https://certification.canonical.com/hardware/202103-28762/
https://certification.canonical.com/hardware/202012-28510/

Changed in linux (Ubuntu):
assignee: nobody → Ivan Hu (ivan.hu)
Kevin Yeh (kevinyeh)
description: updated
description: updated
Revision history for this message
Ivan Hu (ivan.hu) wrote :

testing with Ubuntu-6.5.0-32-generic, it can't boot up as well.

Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → Critical
Revision history for this message
Ivan Hu (ivan.hu) wrote :

testing with Ubuntu-6.5.0-28-generic without this issue, bisecting kernel....

Revision history for this message
Ivan Hu (ivan.hu) wrote :

bisect kernel result,
the patch 2c7b4bfadef08cc0995c24a7b9eb120fe897165f causes this regression

    thermal: core: Store trip pointer in struct thermal_instance

    Replace the integer trip number stored in struct thermal_instance with
    a pointer to the relevant trip and adjust the code using the structure
    in question accordingly.

    The main reason for making this change is to allow the trip point to
    cooling device binding code more straightforward, as illustrated by
    subsequent modifications of the ACPI thermal driver, but it also helps
    to clarify the overall design and allows the governor code overhead to
    be reduced (through subsequent modifications).

    The only case in which it adds complexity is trip_point_show() that
    needs to walk the trips[] table to find the index of the given trip
    point, but this is not a critical path and the interface that
    trip_point_show() belongs to is problematic anyway (for instance, it
    doesn't cover the case when the same cooling devices is associated
    with multiple trip points).

    This is a preliminary change and the affected code will be refined by
    a series of subsequent modifications of thermal governors, the core and
    the ACPI thermal driver.

    The general functionality is not expected to be affected by this change.

    Signed-off-by: Rafael J. Wysocki <email address hidden>
    Reviewed-by: Daniel Lezcano <email address hidden>

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

6.6.x has this in addition

commit 3a3bbc6911f57e1c3b4eabf1d098cde7bf7baeb0
Author: Rafael J. Wysocki <email address hidden>
Date: Tue Sep 19 20:59:53 2023 +0200

    thermal: trip: Drop redundant trips check from for_each_thermal_trip()

    [ Upstream commit a15ffa783ea4210877886c59566a0d20f6b2bc09 ]

    It is invalid to call for_each_thermal_trip() on an unregistered thermal
    zone anyway, and as per thermal_zone_device_register_with_trips(), the
    trips[] table must be present if num_trips is greater than zero for the
    given thermal zone.

    Hence, the trips check in for_each_thermal_trip() is redundant and so it
    can be dropped.

    Signed-off-by: Rafael J. Wysocki <email address hidden>
    Acked-by: Daniel Lezcano <email address hidden>
    Stable-dep-of: e95fa7404716 ("thermal: gov_power_allocator: avoid inability to reset a cdev")
    Signed-off-by: Sasha Levin <email address hidden>

FYI, 6.1 didn't backport any of these, even though e95fa7404716 says "5.13+"

Revision history for this message
Ivan Hu (ivan.hu) wrote :

seems 6.6.x has this in addition for apply the "thermal: trip: Drop redundant trips check from for_each_thermal_trip()"

    thermal: core: Rework and rename __for_each_thermal_trip()

    Rework the currently unused __for_each_thermal_trip() to pass original
    pointers to struct thermal_trip objects to the callback, so it can be
    used for updating trip data (e.g. temperatures), rename it to
    for_each_thermal_trip() and make it available to modular drivers.

    Suggested-by: Daniel Lezcano <email address hidden>
    Signed-off-by: Rafael J. Wysocki <email address hidden>

Revision history for this message
Ivan Hu (ivan.hu) wrote :

Test latest master-next of mantic with below two patches,

thermal: trip: Drop redundant trips check from for_each_thermal_trip()
thermal: core: Rework and rename __for_each_thermal_trip()

still failed.

Revision history for this message
Roxana Nicolescu (roxanan) wrote :

@ivan.hu Can you please test if dropping these 2 patches work? This is a temporary solution to start the next cycle, we'll need to find a proper one though.

thermal: core: Store trip pointer in struct thermal_instance
thermal: trip: Drop lockdep assertion from thermal_zone_trip_id()

Revision history for this message
Ivan Hu (ivan.hu) wrote :

@Roxana,

Checkout tag Ubuntu-6.5.0-34.34 and revert two patches below, it can boot up without problem.

thermal: core: Store trip pointer in struct thermal_instance
thermal: trip: Drop lockdep assertion from thermal_zone_trip_id()

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.