DMAR initialization errors with newer Intel BIOSes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
Jim Somerville |
Bug Description
Brief Description
-----------------
We are experiencing errors when trying to start ovs-dpdk on some wolfpass systems with newer Intel BIOSses. The errors manifest themselves as a PCI bind error when running a command such as this one:
controller-1:~$ sudo /usr/share/
Password:
Error: bind failed for 0000:19:00.0 - Cannot bind to driver vfio-pci
Wolfpass systems with older Intel BIOS are not experiencing the same errors.
While investigating possible reasons for this error we realized that the IOMMU was not configured properly for this device (0000:19:00.0). (The MMU configuration is a requirement of the vfio-pci driver). We noticed that the dmesg log on the system was missing logs when compared to working systems with older BIOSes. These logs were missing:
controller-0:~$ dmesg |grep -i mmu|grep 19:00
[ 6.718559] iommu: Adding device 0000:19:00.0 to group 24
[ 6.723989] iommu: Adding device 0000:19:00.1 to group 25
Looking at the full dmesg log I noticed the following log which seems to happen when the DMAR initialization is starting. Following this log we do not see any of the normal DMAR initialization like we do on the other boards.
[ 3.680989] DMAR: Device scope type does not match for 0000:17:00.0
Therefore, it looks like DMAR initialization is bailing out rather than continuing. I tracked that log down to the following kernel patch which confirms that initialization is aborting when this log is output. I am guessing we do not have this patch in our kernel otherwise DMAR initialization should continue normally even though there is a scope mismatch.
https:/
To confirm that there was truly a mismatch on that device I used the following command to dump the DMAR table on both nodes.
cat /sys/firmware/
Then I copied those files down from both systems to my own Linux machine and used the following commands to parse the DMAR tables and get the translations which get output to the same filename with the ".isl" extension instead of the ".raw" extension.
iasl -d new-system.dmar.raw
iasl -d old-system.dmar.raw
The older (good) system reports the following entry:
[1B0h 0432 2] Subtable Type : 0002 [Root Port ATS Capability]
[1B2h 0434 2] Length : 0030
[1B4h 0436 1] Flags : 00
[1B5h 0437 1] Reserved : 00
[1B6h 0438 2] PCI Segment Number : 0000
[1B8h 0440 1] Device Scope Type : 02 [PCI Bridge Device]
[1B9h 0441 1] Entry Length : 08
[1BAh 0442 2] Reserved : 0000
[1BCh 0444 1] Enumeration ID : 00
[1BDh 0445 1] PCI Bus Number : 17
[1BEh 0446 2] PCI Path : 00,00
while the newer (broken) system reports multiple entries for bus 17 path 00,00 which is probably what is causing the DMAR initialization to error out.
[1B0h 0432 2] Subtable Type : 0001 [Reserved Memory Region]
[1B2h 0434 2] Length : 0020
[1B4h 0436 2] Reserved : 0000
[1B6h 0438 2] PCI Segment Number : 0000
[1B8h 0440 8] Base Address : 0000000052CC8000
[1C0h 0448 8] End Address (limit) : 000000005ACCFFFF
[1C8h 0456 1] Device Scope Type : 01 [PCI Endpoint Device]
[1C9h 0457 1] Entry Length : 08
[1CAh 0458 2] Reserved : 0000
[1CCh 0460 1] Enumeration ID : 00
[1CDh 0461 1] PCI Bus Number : 17
[1CEh 0462 2] PCI Path : 00,00
[1D0h 0464 2] Subtable Type : 0002 [Root Port ATS Capability]
[1D2h 0466 2] Length : 0030
[1D4h 0468 1] Flags : 00
[1D5h 0469 1] Reserved : 00
[1D6h 0470 2] PCI Segment Number : 0000
[1D8h 0472 1] Device Scope Type : 02 [PCI Bridge Device]
[1D9h 0473 1] Entry Length : 08
[1DAh 0474 2] Reserved : 0000
[1DCh 0476 1] Enumeration ID : 00
[1DDh 0477 1] PCI Bus Number : 17
[1DEh 0478 2] PCI Path : 00,00
Comparing the two ISL files I noticed that the newer system has an additional entry for bus number 17 path 00,00 which is reported as an "PCI Endpoint Device" in addition to a "PCI Bridge Device" whereas the older system only reports a single entry for "PCI Bridge Device".
The only significant difference, or difference related to the PCI setup, that I could find is that the BIOS and IFWI versions are different between the two systems.
older system:
BIOS: SE5C620.
IFWI: 2018.10.
newer system:
BIOS: SE5C620.
IFWI: 2019.12.
We downgraded the broken system to the same BIOS and firmware version as the good system and the problem went away so clearly this is some incompatibility between the newer BIOS and our older kernel. I am opening this LP to track the porting of the aforementioned kernel patch to our kernel (if not already present).
Severity
--------
Provide the severity of the defect.
Critical, hosts won't unlock if running latest BIOS on wolfpass hardware.
Steps to Reproduce
------------------
Install a system onto wolfpass hardware with the latest BIOS installing both ovs-dpdk and openstack.
Expected Behavior
------------------
The ovs-dpdk application should start and the hosts should unlock.
Actual Behavior
----------------
See error above.
Reproducibility
---------------
100%
System Configuration
-------
AIO-DX, but likely all systems with wolfpass and latest BIOS.
Branch/Pull Time/Commit
-------
2019/10/04
Last Pass
---------
Unknown
Timestamp/Logs
--------------
See above
Test Activity
-------------
Feature testing
Changed in starlingx: | |
importance: | Medium → Low |
Marking as stx.3.0 / medium priority - issue impacts updating to the latest NIC firmware