IOMMU error loop early in boot

Bug #894070 reported by C de-Avillez on 2011-11-23
50
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Fedora
Fix Released
Medium
linux (Ubuntu)
High
Leann Ogasawara

Bug Description

After updating to linux 3.2, all boots go into a loop of DMAR error messages (fault reason 02). This loop seems unending, and requires a power-cycle.

After some experiments I found I can only boot on 3.2 by passing 'intel_iommu=off' as a boot parm. Up to, and including 3.1, IOMMU did not show any visible issue.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-1-generic 3.2.0-1.3
ProcVersionSignature: Ubuntu 3.2.0-1.3-generic 3.2.0-rc2
Uname: Linux 3.2.0-1-generic x86_64
NonfreeKernelModules: wl
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 1.26-0ubuntu1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC2: cerdea 2911 F.... pulseaudio
 /dev/snd/controlC0: cerdea 2911 F.... pulseaudio
 /dev/snd/controlC1: cerdea 2911 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xe9660000 irq 44'
   Mixer name : 'IDT 92HD81B1C5'
   Components : 'HDA:111d76d5,1028040a,00100104'
   Controls : 15
   Simple ctrls : 10
Card1.Amixer.info:
 Card hw:1 'Set'/'C-Media USB Headphone Set at usb-0000:00:1d.0-1.4.2, full speed'
   Mixer name : 'USB Mixer'
   Components : 'USB0d8c:000c'
   Controls : 7
   Simple ctrls : 3
Card2.Amixer.info:
 Card hw:2 'NVidia'/'HDA NVidia at 0xe3080000 irq 17'
   Mixer name : 'Nvidia GPU 0b HDMI/DP'
   Components : 'HDA:10de000b,10de0101,00100200'
   Controls : 20
   Simple ctrls : 4
CheckboxSubmission: c8a7d84e13c3b258e707f056604eb0e0
CheckboxSystem: d00f84de8a555815fa1c4660280da308
Date: Wed Nov 23 10:48:51 2011
HibernationDevice: RESUME=UUID=5aeaf922-8187-4663-b93d-08b2df7b025e
MachineType: Dell Inc. Latitude E6410
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-1-generic root=/dev/mapper/hostname--vg-hostname--root ro intel_iommu=off
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-1-generic N/A
 linux-backports-modules-3.2.0-1-generic N/A
 linux-firmware 1.62
SourcePackage: linux
UpgradeStatus: Upgraded to precise on 2011-11-06 (16 days ago)
dmi.bios.date: 05/26/2011
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A09
dmi.board.name: 0K42JR
dmi.board.vendor: Dell Inc.
dmi.board.version: A01
dmi.chassis.type: 9
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvrA09:bd05/26/2011:svnDellInc.:pnLatitudeE6410:pvr0001:rvnDellInc.:rn0K42JR:rvrA01:cvnDellInc.:ct9:cvr:
dmi.product.name: Latitude E6410
dmi.product.version: 0001
dmi.sys.vendor: Dell Inc.

Description of problem:
/var/log/messages fills all disk space with error messages from DMAR.
HW is Lenovo T410 with Intel CORE i5 cpu.

How reproducible:
100% failure when feature activated in BIOS before booting Fedora.
No failure when feature deactivated in BIOS before booting Fedora.

Steps to Reproduce:
1. Power off system
2. Power on system and enter BIOS setup
3. From CPU option, enable VT-d feature
4. Save BIOS settings & boot system

Actual results:
Error messages flood /var/log/messages (1+ set of messages follows):
Jun 19 00:38:11 PLS-T410 kernel: DMAR:[fault reason 02] Pce [0d:00.0] fault addr fce [0d:00.0] fault addr fffDMAR:[fault reace [0d:00.0] fault addr ce [0d:00.0] fault addr fffffDMAR:[fault reasce [0d:00.0] fault addr fffff000
Jun 19 00:38:11 PLS-T410 kernel: DMAR:[fault reason 02] Prce [0d:00.0] fault addr fffff000
Jun 19 00:38:11 PLS-T410 kernel: DMAR:[fault reasonce [0d:00.0] fault addr DMAR:[fault reasce [0d:00.0] fault addr fffff000
Jun 19 00:38:11 PLS-T410 kernel: DMAR:[fault reason 02]ce [0d:00.0] fault addr fffff000
Jun 19 00:38:11 PLS-T410 kernel: DMAR:[fault reason 02] Present bice [0d:00.0] fault adce [0d:00.0] fault addr fffff000 DMAR:[fault reasce [0d:00.0] fault addr fffff000DMAR:[fault rece [0d:00.0] fault addr fffff000
Jun 19 00:38:11 PLS-T410 kernel: DMAR:[fault reason 02] Present bit in contexce [0d:00.0] fault addr fffff000
Jun 19 00:38:11 PLS-T410 kernel: DMARce [0d:00.0] fault addr fffff000
Jun 19 00:38:11 PLS-T410 kernel: DMAR:[fault reason 0ce [0d:00.0] fault addr fffff000
Jun 19 00:38:11 PLS-T410 kernel: DMAR:[fault reason 02] Prce [0d:00.0] fault addr fffff000
Jun 19 00:38:11 PLS-T410 kernel: DMAR:[fault reasonce [0d:00.0] fault addr fffff000

Expected results:
No error messages from DMAR

Additional info:
from messages file:
Linux version 2.6.33.3-85.fc13.x86_64 (<email address hidden>) (gcc version 4.4.4 20100503 (Red Hat 4.4.4-2) (GCC) ) #1 SMP Thu May 6 18:09:49 UTC 2010

kernel: CPU0: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz stepping 02
kernel: Booting Node 0, Processors #1
kernel: CPU 1 MCA banks SHD:2 SHD:3 SHD:5 SHD:6
kernel: #2
kernel: CPU 2 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6
kernel: #3 Ok.
kernel: CPU 3 MCA banks SHD:2 SHD:3 SHD:5 SHD:6
kernel: Brought up 4 CPUs
kernel: Total of 4 processors activated (19150.66 BogoMIPS).

Temporary resolution:
1. Start system from power off state and enter BIOS setup
2. Deactivate VT-d feature
3. Save configuration and reboot.

(In reply to comment #1)
> Linux version 2.6.33.3-85.fc13.x86_64 (<email address hidden>)
> (gcc version 4.4.4 20100503 (Red Hat 4.4.4-2) (GCC) ) #1 SMP Thu May 6 18:09:49
> UTC 2010
>

There have been three kernel updates since that version. Did you try any of them before reporting this problem?

I have the same problem here on 2.6.33.5-124.fc13.i686.PAE.

Having the same problem on a ThinkPad T410 with kernel-2.6.33.5-124.fc13.x86_64 (latest bios 1.25)

This is the PCI device in question

# lspci -vvnn -s 0d:00.0
0d:00.0 SD Host controller [0805]: Ricoh Co Ltd Device [1180:e822] (rev 01)
 Subsystem: Lenovo Device [17aa:2133]
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin A routed to IRQ 16
 Region 0: Memory at f2500000 (32-bit, non-prefetchable) [size=256]
 Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
  Address: 0000000000000000 Data: 0000
 Capabilities: [78] Power Management version 3
  Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
 Capabilities: [80] Express (v1) Endpoint, MSI 00
  DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
   ExtTag- AttnBtn+ AttnInd+ PwrInd+ RBE+ FLReset-
  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
   RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
   MaxPayload 128 bytes, MaxReadReq 512 bytes
  DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
  LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <4us, L1 <64us
   ClockPM+ Surprise- LLActRep- BwNot-
  LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
   ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
 Capabilities: [100 v1] Virtual Channel
  Caps: LPEVC=0 RefClk=100ns PATEntrySize=0
  Arb: Fixed- WRR32- WRR64- WRR128- 100ns- - - onfig- TableOffset=0
  Ctrl: ArbSelect=Fixed
  Status: InProgress-
  VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
   Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Fixed- RR32-
   Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
   Status: NegoPending- InProgress-
 Capabilities: [800 v1] Advanced Error Reporting
  UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
  UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
  UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
  CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
  CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
  AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
 Kernel driver in use: sdhci-pci
 Kernel modules: sdhci-pci

I'm working on a quirk to handle this buggy device.

*** Bug 587178 has been marked as a duplicate of this bug. ***

This happens on at least two different kinds of Ricoh multifunction devices. In one case device 00.0 is a cardbus bridge and in the other case it's shown as an SD controller:

0d:00.0 SD Host controller [0805]: Ricoh Co Ltd Device [1180:e822] (rev 01)
0d:00.1 System peripheral [0880]: Ricoh Co Ltd Device [1180:e230] (rev 01)
0d:00.3 FireWire (IEEE 1394) [0c00]: Ricoh Co Ltd Device [1180:e832] (rev 01)

04:00.0 CardBus bridge [0607]: Ricoh Co Ltd Device [1180:e476] (rev 02)
04:00.4 FireWire (IEEE 1394) [0c00]: Ricoh Co Ltd Device [1180:e832] (rev 03)
(prog-if 10)

Just in case anybody else is in a hurry and needs a workaround, here's a patch (against 2.6.35-rc5) that kludges all the DMAR mappings to point to the first device. Tested with Cardbus (a USB controller) and Firewire, but don't be surprised if it glues your cat to the carpet or something.

--- a/drivers/pci/intel-iommu.c 2010-07-13 07:55:33.000000000 +1000
+++ b/drivers/pci/intel-iommu.c 2010-08-03 22:19:09.000000000 +1000
@@ -2560,10 +2560,12 @@
  return 0;
 }

+struct pci_dev *ricohdev = 0;
+
 static dma_addr_t __intel_map_single(struct device *hwdev, phys_addr_t paddr,
          size_t size, int dir, u64 dma_mask)
 {
- struct pci_dev *pdev = to_pci_dev(hwdev);
+ struct pci_dev *tmp, *pdev = to_pci_dev(hwdev);
  struct dmar_domain *domain;
  phys_addr_t start_paddr;
  struct iova *iova;
@@ -2574,6 +2576,17 @@

  BUG_ON(dir == DMA_NONE);

+ tmp = (pdev->vendor==0x1180) ? pdev : pdev->bus->self;
+ if (tmp && tmp->vendor==0x1180 &&
+ (tmp->device==0xe822 ||
+ tmp->device==0xe230 ||
+ tmp->device==0xe832 ||
+ tmp->device==0xe476)) {
+ if (!ricohdev)
+ ricohdev = pci_get_domain_bus_and_slot(0, tmp->bus->number, tmp->devfn & ~7);
+ pdev = ricohdev;
+ }
+
  if (iommu_no_mapping(hwdev))
   return paddr;

@@ -2716,7 +2729,7 @@
         size_t size, enum dma_data_direction dir,
         struct dma_attrs *attrs)
 {
- struct pci_dev *pdev = to_pci_dev(dev);
+ struct pci_dev *tmp, *pdev = to_pci_dev(dev);
  struct dmar_domain *domain;
  unsigned long start_pfn, last_pfn;
  struct iova *iova;
@@ -2724,6 +2737,15 @@

  if (iommu_no_mapping(dev))
   return;
+
+ tmp = (pdev->vendor==0x1180) ? pdev : pdev->bus->self;
+ if (tmp && tmp->vendor==0x1180 &&
+ (tmp->device==0xe822 ||
+ tmp->device==0xe230 ||
+ tmp->device==0xe832 ||
+ tmp->device==0xe476)) {
+ pdev = ricohdev;
+ }

  domain = find_domain(pdev);
  BUG_ON(!domain);

ThinkPad T510 w/ i7 (common devel box) also has this device and issue.

Just verified that this problem still exists on my T510 with 2.6.34.6-47.fc13.x86_64.

This issue occurs on my T410 with Core i5 540m.

Also a problem on my ThinkPad T510 with i7 processor.

This is a hardware bug, and there's not really any good way to work around it.

The 'workaround' for now is to disable DMAR by either disabling VT-d in your BIOS, or by booting with intel_iommu=off.

Thanks, Kyle

After disabling Firewire in the BIOS, the errors ceased. I left the memory card reader on, but don't have a card to test it at the moment.

(In reply to comment #14)
> After disabling Firewire in the BIOS, the errors ceased. I left the memory
> card reader on, but don't have a card to test it at the moment.

Where did you find the option to disable Firewire? I just tried to do this on my T510 (BIOS 1.27), and I can't find anywhere to turn it off.

I am running BIOS 1.24 according to dmidecode, and it was located in the Security Section under I/O Port Access. I just checked that with my older T61 and the setting seems to be in the same place.

(In reply to comment #16)
> I am running BIOS 1.24 according to dmidecode, and it was located in the
> Security Section under I/O Port Access. I just checked that with my older T61
> and the setting seems to be in the same place.

Because obviously, that should be under Security, not Config. :-(

On the bright side, I'm seeing the same thing. Once IEEE1394 is disabled, I can run with VT-d enabled without any error messages in the log.

*** Bug 634135 has been marked as a duplicate of this bug. ***

This bug is severe enough to cause hard lockups under some conditions.
David, did you get anywhere on writing a quirk? My patch does what's needed, but feels like an ugly patch on a critical path...

Same issue here, 2.6.34.6-54.fc13.x86_64 on a brand
new Lenovo Thinkpad T510+.

*** Bug 635678 has been marked as a duplicate of this bug. ***

I also got this problem on T410, almost given a day to understand that problem is in kernel after trying different installations (F13 and F14 live/DVD)

I used intel_iommu=off. Now it feels so good to see the real dmesg output after starting my laptop, else it used to show full of those 3 debug lines repeatedly.

If this is not cloned for Fedora 14 then we should have new bug and marked as F14Target.

We had this problem with a number of T510s in the office. As well as the issue with flooding /var/log/messages it can also manifest during the installation phase where disks are being fscked.

On investigation we have found, that on these machines at least, turning off VT-d in the BIOS corrected the problem. The other virtualization related options in the BIOS can however be left on with no apparent ill effects.

kernel-2.6.35.6-45.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/kernel-2.6.35.6-45.fc14

kernel-2.6.34.7-61.fc13 has been submitted as an update for Fedora 13.
https://admin.fedoraproject.org/updates/kernel-2.6.34.7-61.fc13

I installed kernel-2.6.34.7-61.fc13 on my T410, and enabled VT-d in the BIOS. With this, I am no longer seeing the DMAR faults.

Will this fix also be pushed for RHEL6?

kernel-2.6.35.6-45.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report.

kernel-2.6.34.7-61.fc13 has been pushed to the Fedora 13 stable repository. If problems still persist, please make note of it in this bug report.

Will this fix also be pushed for RHEL6?

Download full text (6.2 KiB)

Upgraded firmware on mellanox infiniband connectx hardware and got following spam in system log.

[ 1680.962538] DRHD: handling fault status reg 302
[ 1680.967048] DMAR:[DMA Read] Request device [04:00.6] fault addr f647a000
[ 1680.967049] DMAR:[fault reason 02] Present bit in context entry is clear

This also broke infiniband on f13/f14 machines. I used the iommu workaround described above, and infiniband now works.

RHEL6.0 doesn't exhibit this problem with the updated firmware:

Linux mrg-03.mpc.lab.eng.bos.redhat.com 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue Mx

Is this a kernel.org regression?

lspci results:
[root@mrg-04 ~]# lspci
00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 13)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Po)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Po)
00:04.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 4 )
00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 )
00:06.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 6 )
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Po)
00:09.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Po)
00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Register)
00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Regis)
00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Reg)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Control)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Control)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Contro)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Control)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Control)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Contro)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller ()
00:1f.2 IDE interface: Intel Corporation 82801IB (ICH9) 2 port SATA IDE Control)
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit )
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit )
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit )
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit )
03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 0)
04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s )
08:03.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 (re)
fe:00.0 Host bridge: Intel Corporation Xeon 5500/Core i7 QuickPath Architecture)
fe:00.1 Host bridge: Intel Corporation Xeon 5500/Core i7 QuickPath Architecture)
fe:02.0 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Link 0 (rev 05)
fe:02.1 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Physical 0 (rev 05)
fe:02.4 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Link 1 (rev 05...

Read more...

This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 13 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora please change the 'version' of this
bug to the applicable version. If you are unable to change the version,
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

The process we are following is described here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

(In reply to comment #30)
> Upgraded firmware on mellanox infiniband connectx hardware and got following
> spam in system log.
>
> [ 1680.962538] DRHD: handling fault status reg 302
> [ 1680.967048] DMAR:[DMA Read] Request device [04:00.6] fault addr f647a000
> [ 1680.967049] DMAR:[fault reason 02] Present bit in context entry is clear
>

This has absolutely nothing to do with workarounds needed for buggy Ricoh multifunction devices. Please open a new bug...

C de-Avillez (hggdh2) wrote :
C de-Avillez (hggdh2) wrote :
Brad Figg (brad-figg) on 2011-11-23
Changed in linux (Ubuntu):
status: New → Confirmed
C de-Avillez (hggdh2) wrote :
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Medium → High
tags: added: kernel-key
Joseph Salisbury (jsalisbury) wrote :

Hi Carlos,

I've added this bug to the kernel team hotlist:
http://reports.qa.ubuntu.com/reports/kernel-bugs/reports/_kernel_hot_.html

Just adding upstream thread for additional information.

https://lkml.org/lkml/2010/5/22/69

For now I suggest you continue to use the workaround of 'intel_iommu=off' until we are able to investigate further.

Joseph Salisbury (jsalisbury) wrote :

Hi Carlos,

I'm going to do some bisecting to identify when the bug was introduced. Before starting, could you test the 3.2.0-2.5 kernel just to see if it happens to be fixed?

C de-Avillez (hggdh2) wrote :

(for the record, already answered on IRC)

Tested, still fails. Interesting that the RH bugs talk about VT-d. I do not remember if this laptop BIOS has VT-d as an option (I am sure it has a VT, probably VT-x). OTOH, it can be argued if the options displayed on BIOS have a direct relation with reality. Will check later.

Also for the record, this machine runs under an encrypted LVM, with multiple filesystems under the LVM.

Joseph Salisbury (jsalisbury) wrote :

Hi Carlos,

Can you also confirm that this was not an issue with the the Ubuntu 3.1.0-1.3 kernel? That way we have a starting point for the bisect.

Joseph Salisbury (jsalisbury) wrote :

In my prior comment I meant the 3.1.0-2.3 kernel:
https://launchpad.net/ubuntu/+source/linux/3.1.0-2.3/+build/2885386

I would also like to have you test a couple of mainline kernels, to rule out an Ubuntu patch as a cause. Can you test with the upstream v3.1 kernel and upstream v3.2-rc2 kernel?

v3.1:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.1-oneiric/

v3.2-rc2:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2-rc2-oneiric/

We would expect v3.1 to work and v3.2-rc2 with exhibit the issue.

So to summarize, can you please test the following kernels and report the results:
1. Ubuntu 3.1.0-2.3
2. Mainline 3.1
3. Mainline 3.2-rc2

Once we have this data, we will know the best starting version to start the bisect.

C de-Avillez (hggdh2) wrote :

1. Ubuntu 3.1.0-2.3 -- works. This is the last 3.1 kernel I installed on this machine
2. Mainline 3.2-rc1 -- works
3. Mainline 3.2-rc2 -- works.

OTOH I lost wireless on the 3.2 mainline.

Still to test 3.1 mainline

Robert Hooker (sarvatt) wrote :

04:00.0 maps to your sd reader. The mainline kernels were using older configs, and 3.2 in precise enabled CONFIG_MEMSTICK_R592 which hggdh's SD reader device on 04:00.0 is using. It looks like this patch may fix it.

 http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob;f=dmar-disable-when-ricoh-multifunction.patch;h=a4528617ecfdc437072e96c47cafbe10b5e5478f;hb=HEAD

Robert Hooker (sarvatt) wrote :

Test kernel with that patch applied. What it does is disable intel_iommu when it finds one of the 4 ricoh devices because it leads to problems like this.

http://kernel.ubuntu.com/~sarvatt/lp894070/

Joseph Salisbury (jsalisbury) wrote :

git commit that caused issue:
fe763ab898670195870b889e145483ce5c8997d5 UBUNTU: [Config] CONFIG_MEMSTICK_R592=m

C de-Avillez (hggdh2) wrote :

Sarvatt's test kernel works. We have a deal :-)

tags: added: rls-p-tracking
Changed in linux (Ubuntu):
assignee: nobody → Leann Ogasawara (leannogasawara)
milestone: none → precise-alpha-2
tags: removed: kernel-da-key kernel-key
2pac (la-tupac) wrote :

Well job! works fine on i7-950/ASUS P6X58D-E and GTX-260.

Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.2.0-4.10

---------------
linux (3.2.0-4.10) precise; urgency=low

  [ Kyle McMartin ]

  * SAUCE: dmar: disable if ricoh multifunction detected
    - LP: #894070

  [ Seth Forshee ]

  * SAUCE: dell-wmi: Demote unknown WMI event message to pr_debug
    - LP: #581312

  [ Tim Gardner ]

  * Start new release, Bump ABI, rebase to 3.2-rc5

  [ Leann Ogasawara ]

  * [Config] Enable CONFIG_SENSORS_AK8975=m
 -- Tim Gardner <email address hidden> Sat, 10 Dec 2011 08:57:04 -0700

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released

Hi,

I've built an alternative test kernel which reverts the out of tree patch we applied and instead disables CONFIG_INTEL_IOMMU_DEFAULT_ON. Could you please test this kernel and let me know your results, ie. confirm the loop of DMAR error messages do not reappear. Thanks in advance.

http://people.canonical.com/~ogasawara/lp894070/amd64/

C de-Avillez (hggdh2) wrote :

@Leann: no loop, no excessive DMAR messages being issued:

[cerdea-aws]cerdea@xango3$ cat /proc/version_signature
Ubuntu 3.2.0-9.17~lp894070v1-generic 3.2.1
[cerdea-aws]cerdea@xango3$ dmesg|grep DMAR
[ 0.000000] ACPI: DMAR 00000000cf34ff18 00080 (v01 INTEL CP_DALE 00000001 INTL 00000001)
[ 0.017156] DMAR: Host address width 36
[ 0.017162] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[ 0.017171] DMAR: DRHD base: 0x000000fed93000 flags: 0x1
[ 0.017180] DMAR: RMRR base: 0x000000cf3d7000 end: 0x000000cf3e6fff
[ 1.492476] DMAR: BIOS has allocated no shadow GTT; disabling IOMMU for graphics
[cerdea-aws]cerdea@xango3$

Thanks for the testing, much appreciated. I intend to apply this alternative solution and upload shortly. Thanks.

quequotion (quequotion) wrote :

so i finally get a computer with an IOMMU switch in BIOS, but turning it "on"
causes ubuntu to go to sleep at boot.

things go wrong between plymouth and lightdm.
the only way out is [SYSRQ]+R,E,I,S,U,B or cutting the power.

Neco (extraordinario007) wrote :

Ubuntu beta2 lastest 3.0.2.21 and pre3.0.2.22 still the same problem

The screen goes black, no login screen and after 30 seconds, able to swich to tty

Solution by now: disable iommu on my GA-990FXA-UD3 motherboard bios

Not a solution at all if i want to continue using pci pastrough and etc

Jade Lacosse (sprucegum) wrote :

I believe I'm also suffering from this issue; however, I've found a workaround: boot the kernel with the iommu=pt parameter. From what I can tell, this allows the iommu to function in the presence of an NVidia graphics card.

Changed in fedora:
importance: Unknown → Medium
status: Unknown → Fix Released
To post a comment you must log in.