FWTS reports CRITICAL and HIGH failures in klog on server

Bug #1257353 reported by Jeff Lane on 2013-12-03
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Firmware Test Suite
Medium
Alex Hung

Bug Description

Ran this on a server and the klog test shows several critical errors. We need to know if these CRITICAL and HIGH errors should gate the certificate on this machine.

This test run on 26/11/13 at 14:51:31 on host Linux ubuntu-test1
3.5.0-23-generic #35~precise1-Ubuntu SMP Fri Jan 25 17:13:26 UTC 2013 x86_64.

Command: "fwts -q --stdout-summary -r /home/srini/.checkbox/fwts_results.log
klog".
Running tests: klog.

klog: Scan kernel log for errors and warnings.
--------------------------------------------------------------------------------
Test 1 of 1: Kernel log error check.
Kernel message: [ 0.039347] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'

ADVICE: This is not exactly a failure but a warning from the kernel. The
MSR_IA32_ENERGY_PERF_BIAS was initialized and defaulted to a high performance
bias setting. The kernel has detected this and changed it down to a 'normal'
bias setting.

Kernel message: [ 0.880215] [Firmware Bug]: ACPI: BIOS _OSI(Linux) query ignored

ADVICE: This is not exactly a failure mode but a warning from the kernel. The
_OSI() method has implemented a match to the 'Linux' query in the DSDT and this
is redundant because the ACPI driver matches onto the Windows _OSI strings by
default.

FAILED [CRITICAL] KlogPciAcpiOscRequestFailed: Test 1, CRITICAL Kernel message:
[ 0.902020] pci0000:7f: ACPI _OSC request failed (AE_NOT_FOUND), returned
control mask: 0x1d

ADVICE: The _OSC method evaluation failed, which will result in disabling PCIe
functionality, for example, the Linux kernel has to disable Active State Power
Management (ASPM) which means that PCIe power management is not optimally
configured.

FAILED [CRITICAL] KlogPciAcpiOscRequestFailed: Test 1, CRITICAL Kernel message:
[ 0.906466] pci0000:ff: ACPI _OSC request failed (AE_NOT_FOUND), returned
control mask: 0x1d

ADVICE: The _OSC method evaluation failed, which will result in disabling PCIe
functionality, for example, the Linux kernel has to disable Active State Power
Management (ASPM) which means that PCIe power management is not optimally
configured.

FAILED [LOW] KlogAcpiSystemIOConflict: Test 1, LOW Kernel message: [ 4.240803]
ACPI Warning: 0x0000000000000460-0x000000000000047f SystemIO conflicts with
Region \PMIO 1 (20120320/utaddress-251)

ADVICE: A resource conflict between an ACPI OperationRegion and a native driver
has been detected. By default the kernel will use a strict policy and will not
allow this region to conflict and -EBUSY will be returned to the caller that was
trying to allocate the already claimed region. If an ACPI driver is available
for this device then this should be used instead of a native driver, so
disabling the native driver may help. (Note that the lpc_ich driver can trigger
these warnings, in which case they can generally be ignored). One can specify
kernel boot parameter acpi_enforce_resources=lax to disable these checks but it
may lead to random problems and system instability. Alternatively, one can
specify acpi_enforce_resources=no and ACPI Operation Region resources will not
be registered.

FAILED [LOW] KlogAcpiSystemIOConflict: Test 1, LOW Kernel message: [ 4.240816]
ACPI Warning: 0x0000000000000460-0x000000000000047f SystemIO conflicts with
Region \_SB_.PCI0.HEC1.TCOS 2 (20120320/utaddress-251)

ADVICE: A resource conflict between an ACPI OperationRegion and a native driver
has been detected. By default the kernel will use a strict policy and will not
allow this region to conflict and -EBUSY will be returned to the caller that was
trying to allocate the already claimed region. If an ACPI driver is available
for this device then this should be used instead of a native driver, so
disabling the native driver may help. (Note that the lpc_ich driver can trigger
these warnings, in which case they can generally be ignored). One can specify
kernel boot parameter acpi_enforce_resources=lax to disable these checks but it
may lead to random problems and system instability. Alternatively, one can
specify acpi_enforce_resources=no and ACPI Operation Region resources will not
be registered.

FAILED [LOW] KlogAcpiSystemIOConflict: Test 1, LOW Kernel message: [ 4.240818]
ACPI Warning: 0x0000000000000460-0x000000000000047f SystemIO conflicts with
Region \_SB_.PCI0.HEC2.TCOS 3 (20120320/utaddress-251)

ADVICE: A resource conflict between an ACPI OperationRegion and a native driver
has been detected. By default the kernel will use a strict policy and will not
allow this region to conflict and -EBUSY will be returned to the caller that was
trying to allocate the already claimed region. If an ACPI driver is available
for this device then this should be used instead of a native driver, so
disabling the native driver may help. (Note that the lpc_ich driver can trigger
these warnings, in which case they can generally be ignored). One can specify
kernel boot parameter acpi_enforce_resources=lax to disable these checks but it
may lead to random problems and system instability. Alternatively, one can
specify acpi_enforce_resources=no and ACPI Operation Region resources will not
be registered.

FAILED [LOW] KlogAcpiSystemIOConflictLpcIchWatchDogTimer: Test 1, LOW Kernel
message: [ 4.240822] lpc_ich: Resource conflict(s) found affecting iTCO_wdt

ADVICE: A resource conflict has occurred between ACPI OperationRegions and the
same I/O region used by the lpc_ich driver for the Intel TCO (Total Cost of
Ownership) timer (iTCO_wdt, this is a watchdog timer that will reboot the
machine after its second expiration). According to Intel Controller Hub (ICH)
specifications, the TCO watchdog has a 32 bytes I/O space resource. ACPI
OperationRegions in the DSDT frequently reserve this TCO I/O space because they
require access to bit 9 (DMISCI_STS) of the TCO1_STS register of the TCO,
however, this bit is never used by the lpc_ich driver, so there is no risk of
conflict. In the vast majority of cases this warning can be ignored as no harm
will occur.

FAILED [LOW] KlogAcpiSystemIOConflict: Test 1, LOW Kernel message: [ 4.240824]
ACPI Warning: 0x0000000000000428-0x000000000000042f SystemIO conflicts with
Region \PMIO 1 (20120320/utaddress-251)

ADVICE: A resource conflict between an ACPI OperationRegion and a native driver
has been detected. By default the kernel will use a strict policy and will not
allow this region to conflict and -EBUSY will be returned to the caller that was
trying to allocate the already claimed region. If an ACPI driver is available
for this device then this should be used instead of a native driver, so
disabling the native driver may help. (Note that the lpc_ich driver can trigger
these warnings, in which case they can generally be ignored). One can specify
kernel boot parameter acpi_enforce_resources=lax to disable these checks but it
may lead to random problems and system instability. Alternatively, one can
specify acpi_enforce_resources=no and ACPI Operation Region resources will not
be registered.

FAILED [LOW] KlogAcpiSystemIOConflict: Test 1, LOW Kernel message: [ 4.240829]
ACPI Warning: 0x0000000000000500-0x000000000000053f SystemIO conflicts with
Region \GINV 1 (20120320/utaddress-251)

ADVICE: A resource conflict between an ACPI OperationRegion and a native driver
has been detected. By default the kernel will use a strict policy and will not
allow this region to conflict and -EBUSY will be returned to the caller that was
trying to allocate the already claimed region. If an ACPI driver is available
for this device then this should be used instead of a native driver, so
disabling the native driver may help. (Note that the lpc_ich driver can trigger
these warnings, in which case they can generally be ignored). One can specify
kernel boot parameter acpi_enforce_resources=lax to disable these checks but it
may lead to random problems and system instability. Alternatively, one can
specify acpi_enforce_resources=no and ACPI Operation Region resources will not
be registered.

FAILED [LOW] KlogAcpiSystemIOConflictLpcIchWatchDogTimer: Test 1, LOW Kernel
message: [ 4.240832] lpc_ich: Resource conflict(s) found affecting gpio_ich

ADVICE: A resource conflict has occurred between ACPI OperationRegions and the
same I/O region used by the lpc_ich driver for the General Purpose I/O (GPIO)
region. Sometimes this GPIO region is used by the firmware for rfkill or LED
controls or very rarely for the Embedded Controller System Control Interrupt.
This may require deeper inspection to check if the conflict will lead to any
real issues. However, in the vast majority of cases this warning can be ignored
as no harm will occur.

FAILED [HIGH] KlogAcpiNoHandlerForRegion: Test 1, HIGH Kernel message: [
75.037261] ACPI Error: No handler for Region [POWS] (ffff881029494120) [IPMI]
(20120320/evregion-376)
Message repeated 7 times.

ADVICE: ACPI attempted to read or write to a region however the ACPI driver does
not have a handler implemented for this particular region space. The read/write
will fail and undefined behaviour will occur.

FAILED [HIGH] KlogAcpiKlogRegionNoHandler: Test 1, HIGH Kernel message: [
75.037267] ACPI Error: Region IPMI (ID=7) has no handler (20120320/exfldio-306)
Message repeated 7 times.

ADVICE: An access (read or write) to an operation region has been attempted and
a region handler for this has not been implemented, this may need to be
implemented to provide the expected behaviour. See acpi_ex_access_region().

FAILED [HIGH] KlogAcpiObjectDoesNotExist: Test 1, HIGH Kernel message: [
75.037270] ACPI Error: Method parse/execution failed [\_SB_.M111._GAI] (Node
ffff881029493960), AE_NOT_EXIST (20120320/psparse-536)
Message repeated 3 times.

ADVICE: The ACPI interpreter failed to execute or parse some AML because a
object or control did not exist. This normally occurs because of buggy firmware
and may lead to unexpected behaviour or loss of functionality.

FAILED [HIGH] KlogAcpiObjectDoesNotExist: Test 1, HIGH Kernel message: [
75.037317] ACPI Error: Method parse/execution failed [\_SB_.M111._PMM] (Node
ffff881029493910), AE_NOT_EXIST (20120320/psparse-536)
Message repeated 3 times.

ADVICE: The ACPI interpreter failed to execute or parse some AML because a
object or control did not exist. This normally occurs because of buggy firmware
and may lead to unexpected behaviour or loss of functionality.

Found 13 unique errors in kernel log.

================================================================================
0 passed, 13 failed, 0 warning, 0 aborted, 0 skipped, 0 info only.
================================================================================

0 passed, 13 failed, 0 warning, 0 aborted, 0 skipped, 0 info only.

Test Failure Summary
================================================================================

Critical failures: 2
 klog: CRITICAL Kernel message: [ 0.902020] pci0000:7f: ACPI _OSC request failed (AE_NOT_FOUND), returned control mask: 0x1d
 klog: CRITICAL Kernel message: [ 0.906466] pci0000:ff: ACPI _OSC request failed (AE_NOT_FOUND), returned control mask: 0x1d

High failures: 4
 klog: HIGH Kernel message: [ 75.037261] ACPI Error: No handler for Region [POWS] (ffff881029494120) [IPMI] (20120320/evregion-376)
 klog: HIGH Kernel message: [ 75.037267] ACPI Error: Region IPMI (ID=7) has no handler (20120320/exfldio-306)
 klog: HIGH Kernel message: [ 75.037270] ACPI Error: Method parse/execution failed [\_SB_.M111._GAI] (Node ffff881029493960), AE_NOT_EXIST (20120320/psparse-536)
 klog: HIGH Kernel message: [ 75.037317] ACPI Error: Method parse/execution failed [\_SB_.M111._PMM] (Node ffff881029493910), AE_NOT_EXIST (20120320/psparse-536)

Medium failures: NONE

Low failures: 7
 klog: LOW Kernel message: [ 4.240803] ACPI Warning: 0x0000000000000460-0x000000000000047f SystemIO conflicts with Region \PMIO 1 (20120320/utaddress-251)
 klog: LOW Kernel message: [ 4.240816] ACPI Warning: 0x0000000000000460-0x000000000000047f SystemIO conflicts with Region \_SB_.PCI0.HEC1.TCOS 2 (20120320/utaddress-251)
 klog: LOW Kernel message: [ 4.240818] ACPI Warning: 0x0000000000000460-0x000000000000047f SystemIO conflicts with Region \_SB_.PCI0.HEC2.TCOS 3 (20120320/utaddress-251)
 klog: LOW Kernel message: [ 4.240822] lpc_ich: Resource conflict(s) found affecting iTCO_wdt
 klog: LOW Kernel message: [ 4.240824] ACPI Warning: 0x0000000000000428-0x000000000000042f SystemIO conflicts with Region \PMIO 1 (20120320/utaddress-251)
 klog: LOW Kernel message: [ 4.240829] ACPI Warning: 0x0000000000000500-0x000000000000053f SystemIO conflicts with Region \GINV 1 (20120320/utaddress-251)
 klog: LOW Kernel message: [ 4.240832] lpc_ich: Resource conflict(s) found affecting gpio_ich

Other failures: NONE

Test |Pass |Fail |Abort|Warn |Skip |Info |
---------------+-----+-----+-----+-----+-----+-----+
klog | | 13| | | | |
---------------+-----+-----+-----+-----+-----+-----+
Total: | 0| 13| 0| 0| 0| 0|
---------------+-----+-----+-----+-----+-----+-----+
Results generated by fwts: Version V13.11.00 (2013-11-15 06:47:53).

Some of this work - Copyright (c) 1999 - 2013, Intel Corp. All rights reserved.
Some of this work - Copyright (c) 2010 - 2013, Canonical.

Jeff Lane (bladernr) on 2013-12-03
Changed in fwts (Ubuntu):
assignee: nobody → Firmware Testing Team (firmware-testing-team)
affects: fwts (Ubuntu) → fwts
Changed in fwts:
assignee: Firmware Testing Team (firmware-testing-team) → nobody
assignee: nobody → Firmware Testing Team (firmware-testing-team)
Keng-Yu Lin (lexical) on 2013-12-04
Changed in fwts:
assignee: Firmware Testing Team (firmware-testing-team) → Anthony Wong (anthonywong)
importance: Undecided → Critical
Keng-Yu Lin (lexical) on 2013-12-04
Changed in fwts:
assignee: Anthony Wong (anthonywong) → Alex Hung (alexhung)
Alex Hung (alexhung) wrote :

The errors are ACPI-related. Some information such as "sudo acpidump > acpi.log" will be very help.

Changed in fwts:
status: New → In Progress
Srinivas (srinira) wrote :

Hello Alex,

I am attaching the acpidump of the machines.

Alex Hung (alexhung) wrote :

There is no acpidump for detailed analysis, but let me provide some basic information first.

From the description there are three types of errors:

(1) SystemIO conflicts (Low failures)
(2). No handler for IPMI (one High failures)
(3) AE_NOT_FOUND errors (other High failures and Critical failures)

(1) is not really an issue. It is kernel complains there are overlaps on what BIOS and kernel can controls and there can potentially race conditions on the IO resources - however it is how Intel designs their chipset.

(2) means Linux kernel does not include an ACPI IPMI driver. This is not a program if IPMI is not used or IPMI is controlled by other methods, i.e. another IPMI driver.

(3) means there are errors in BIOS AML. This may or may not be important, depending on whether such functions are used.

_GAI and _PMM are control method of "Power Meters" (ACPI Section 10.4) as below:
_GAI: Gets the averaging interval used by the power meter.
_PMM: Returns the power consumption measured by the Power Meter.

If any tool is using them, it will fail; otherwise, no misbehaviours will been seen.

_OSC, on the other hand, is more generic and may have more impacts.

ACPI 6.2.10 _OSC (Operating System Capabilities)

This optional object is a control method that is used by OSPM to communicate to the platform the feature support or capabilities provided by a device’s driver. This object is a child object of a device and may also exist in the \_SB scope, where it can be used to convey platform wide OSPM capabilities. When supported, _OSC is invoked by OSPM immediately after placing the device in the D0 power state. Device specific objects are evaluated after _OSC invocation. This allows the values returned from other objects to be predicated on the OSPM feature support / capability information conveyed by _OSC. OSPM may evaluate _OSC multiple times to indicate changes in OSPM capability to the device but this may be precluded by specific device requirements.

In summary, failing to evaluate _OSC may cause some features to be disabled. For instance, some advanced PCIe features such as ASPM or PCIE hotplug will be disabled.

While it may not cause obvious failure if no advanced features are needed for this platform, it is still advised to fix AE_NOT_FOUND errors if possible. acpidump may help what is missing to cause AE_NOT_FOUND errors.

Changed in fwts:
status: In Progress → Incomplete

On 05/12/13 10:07, Alex Hung wrote:
> There is no acpidump for detailed analysis, but let me provide some
> basic information first.
>
>>From the description there are three types of errors:
>
> (1) SystemIO conflicts (Low failures)
> (2). No handler for IPMI (one High failures)
> (3) AE_NOT_FOUND errors (other High failures and Critical failures)
>
> (1) is not really an issue. It is kernel complains there are overlaps on
> what BIOS and kernel can controls and there can potentially race
> conditions on the IO resources - however it is how Intel designs their
> chipset.
>
> (2) means Linux kernel does not include an ACPI IPMI driver. This is not
> a program if IPMI is not used or IPMI is controlled by other methods,
> i.e. another IPMI driver.
>
> (3) means there are errors in BIOS AML. This may or may not be
> important, depending on whether such functions are used.
>
> _GAI and _PMM are control method of "Power Meters" (ACPI Section 10.4) as below:
> _GAI: Gets the averaging interval used by the power meter.
> _PMM: Returns the power consumption measured by the Power Meter.
>
> If any tool is using them, it will fail; otherwise, no misbehaviours
> will been seen.

I believe these aren't really used much nowadays, but I guess we should
check this someday to see if this assumption is correct or not.

>
> _OSC, on the other hand, is more generic and may have more impacts.
>
> ACPI 6.2.10 _OSC (Operating System Capabilities)
>
> This optional object is a control method that is used by OSPM to
> communicate to the platform the feature support or capabilities provided
> by a device’s driver. This object is a child object of a device and may
> also exist in the \_SB scope, where it can be used to convey platform
> wide OSPM capabilities. When supported, _OSC is invoked by OSPM
> immediately after placing the device in the D0 power state. Device
> specific objects are evaluated after _OSC invocation. This allows the
> values returned from other objects to be predicated on the OSPM feature
> support / capability information conveyed by _OSC. OSPM may evaluate
> _OSC multiple times to indicate changes in OSPM capability to the device
> but this may be precluded by specific device requirements.
>
> In summary, failing to evaluate _OSC may cause some features to be
> disabled. For instance, some advanced PCIe features such as ASPM or PCIE
> hotplug will be disabled.

Is the fwts _OSC failure advice sufficient, or should we make it more
verbose to explain what's going on here?
>
> While it may not cause obvious failure if no advanced features are
> needed for this platform, it is still advised to fix AE_NOT_FOUND errors
> if possible. acpidump may help what is missing to cause AE_NOT_FOUND
> errors.
>
>
>
>
> ** Changed in: fwts
> Status: In Progress => Incomplete
>

Alex Hung (alexhung) wrote :

ACPI dump and fwst.log are attached here.

Alex Hung (alexhung) wrote :

Attached dmesg for c240

Alex Hung (alexhung) wrote :

Attached dmesg for c220

Alex Hung (alexhung) wrote :

From #6, we find out HIGH failures: no handler for IPMI and AE_NOT_EXIST are actual the same bug as they are grouping together as below:

[ 76.681758] ACPI Error: No handler for Region [POWS] (ffff881029494120) [IPMI] (20120320/evregion-376)
[ 76.681769] ACPI Error: Region IPMI (ID=7) has no handler (20120320/exfldio-306)
[ 76.681778] ACPI Error: Method parse/execution failed [\_SB_.M111._GAI] (Node ffff881029493960), AE_NOT_EXIST
[ 76.681795] ACPI Exception: AE_NOT_EXIST, Evaluating _GAI (20120320/power_meter-130)
[ 76.681884] ACPI Error: No handler for Region [POWS] (ffff881029494120) [IPMI] (20120320/evregion-376)
[ 76.681890] ACPI Error: Region IPMI (ID=7) has no handler (20120320/exfldio-306)
[ 76.681897] ACPI Error: Method parse/execution failed [\_SB_.M111._PMM] (Node ffff881029493910), AE_NOT_EXIST (20120320/psparse-536)
[ 76.681910] ACPI Exception: AE_NOT_EXIST, Evaluating _PMM (20120320/power_meter-340)
[ 76.685294] ACPI Error: No handler for Region [POWS] (ffff881029494120) [IPMI] (20120320/evregion-376)
[ 76.685306] ACPI Error: Region IPMI (ID=7) has no handler (20120320/exfldio-306)
[ 76.685315] ACPI Error: Method parse/execution failed [\_SB_.M111._GAI] (Node ffff881029493960), AE_NOT_EXIST (20120320/psparse-536)
[ 76.685332] ACPI Exception: AE_NOT_EXIST, Evaluating _GAI (20120320/power_meter-130)
[ 76.685431] ACPI Error: No handler for Region [POWS] (ffff881029494120) [IPMI] (20120320/evregion-376)
[ 76.685438] ACPI Error: Region IPMI (ID=7) has no handler (20120320/exfldio-306)
[ 76.685444] ACPI Error: Method parse/execution failed [\_SB_.M111._PMM] (Node ffff881029493910), AE_NOT_EXIST (20120320/psparse-536)
[ 76.685457] ACPI Exception: AE_NOT_EXIST, Evaluating _PMM (20120320/power_meter-340)

After inspecting DSDT (in #5), it is found M1111is a "Power Meter" device which no has ACPI handler installed for its OpRegion; therefore, it fails when _GAI and _PMM try to access the OpRegion.

The same story also goes to c220 (#7).

Conclusion: I talked to Ike who commented that IPMI is not really controlled by BIOS. Colin also suggested the same thing in #4. I think IPMI-related errors should not be a big concern.

This leaves us the _OSC errors that requires further investigation.

Alex Hung (alexhung) wrote :

After checking #6 and #7, it is found that only c240 has "ACPI _OSC request failed (AE_NOT_FOUND)". I don't find obvious errors but checking the kernel code finds only ASPM is disabled as in dmesg shows. However, ASPM is disabled as dmesg also shows "ACPI FADT declares the system doesn't support PCIe ASPM, so disable it".

As a result, I don't think there are impacts.

Alex Hung (alexhung) wrote :

@Colin,

I am going to find previous bug related to hotplug and _OSC errors. It is definitely a good idea to add more information on impacts to fwts.

Jeff Lane (bladernr) wrote :

"Conclusion: I talked to Ike who commented that IPMI is not really controlled by BIOS. Colin also suggested the same thing in #4. I think IPMI-related errors should not be a big concern."

If that's the case, could we get these errors bumped down to Medium to prevent false positives?

And thank you all for the quick investigation and turnaround :)

Alex Hung (alexhung) wrote :

I will set this bug as won't-fix, and wil keep eyes on upstream for IPMI pathces.

Changed in fwts:
importance: Critical → Medium
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers