Xen 3.1/Gutsy: PCI-DMA: Out of SW-IOMMU space

Bug #162147 reported by Alvin Cura
16
Affects Status Importance Assigned to Milestone
xen-meta (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: ubuntu-xen-server

I had been commenting on Bug #135818, however it is marked as fixed. This issue seems related, but is not identical:

//**********//

 Alvin Cura wrote on 2007-11-05: (permalink)

I'm afraid I'm still seeing this bug. Or something similar.

I did not get a kernel panic, but my network died and stayed dead. I was, however, able to reboot.

Nov 4 16:44:43 xen1 -- MARK --
Nov 4 17:04:36 xen1 kernel: [19235.154900] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.157617] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.160469] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.163254] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.166076] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.168897] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.171727] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.174519] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.177666] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0

7222 bytes is likely dictated by my ethernet interface being set for jumbo frames:

11: eth0: <BROADCAST,MULTICAST,UP,10000> mtu 7200 qdisc pfifo_fast qlen 1000
    link/ether 00:1b:fc:1f:33:40 brd ff:ff:ff:ff:ff:ff

this is on the following controller:

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)

I will proceed with some tests of using different NICs to try to isolate the behaviour.
 Alvin Cura wrote 1 seconds ago: (permalink)

More bad news. Problem is persisting, even with a different NIC (although also a Realtek, this one a PCI card instead of the onboard):

syslog:Nov 11 23:22:17 xen1 kernel: [ 8113.207809] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0
syslog:Nov 11 23:22:40 xen1 kernel: [ 8136.691486] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0
syslog:Nov 11 23:23:17 xen1 kernel: [ 8173.209884] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0
syslog:Nov 11 23:24:17 xen1 kernel: [ 8233.214080] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0
syslog:Nov 11 23:24:32 xen1 kernel: [ 8248.385552] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0

03:05.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)

-----

Further research yielded the following thread in xen-devel:

http://lists.xensource.com/archives/html/xen-devel/2007-09/msg00138.html

However, more bad news, even setting swiotlb at boottime, the problem still occurs.

Revision history for this message
Frank Abel (frankabel) wrote :

I confirm this, to me not seem related at all with NIC (https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/135818/comments/7). In my case seem related with a high intensive used of hard disk, the errors come after execute the command "sudo xen-create-image --dir/home/xen...." on Dom0 to try to build the first DomU. I using files as disks. After hit ENTER lots lines with this messages appear in my console:

"
[ 5345.211424] PCI-DMA: Out of SW-IOMMU sapce for 65536 bytes at device 0000:03:01.0
"

the 65536 number in some lines change to 53248, 32768, etc.

Beside in dmesg appear lines like:

"
[ 5345.302257] (scsi4:A:0:0): data overrun detected in Data-out phase. Tag == 0x2.
[ 5345.302263] (scsi4:A:0:0): Have seen Data Phase. Length = 0. NumSGs = 0.
"

Revision history for this message
Konstantin Sharlaimov (konstantin-sharlaimov) wrote :

I can confirm this bug. To me it seems related to IOMMU implementation in the kernel. I experienced it on my company's server when disk usage is high no matter in dom0 or domU. This is very disturbing - Xen is not stable at all and I am forced to cope with one server shared over all services instead of VPS solution.

Revision history for this message
mikmak (mikmak) wrote :

a (apparently known) workaround is to add something like :
swiotlb=128 to your dom0 kernel line , this will increase memory allocated to swiotlb and avoid these errors
I can confirm it works for me

Mik

Revision history for this message
Konstantin Sharlaimov (konstantin-sharlaimov) wrote :

Unfortunately, the swiotlb didn't work for me. Apparently it postponed the problem, however after some big disk activity those errors showed up again. Is there a way to completely disable SW-IOMMU until this bug (apparently this is a dom0 kernel bug) is completely resolved?

Revision history for this message
Rick (rick2001) wrote :

I can also confirm that swiotlb didn't work for me on my PowerEdge 2850. The system will crash/segfault if I try to do anything big, like create a 40GB Xen VM.

Revision history for this message
Scott Wimer (scott-wimer) wrote :

Rick,

Does your PowerEdge server have a PERC controller?

I'm getting this on Ubuntu 8.04 on a Dell 2950 that has the PERC 5/i controller.

This bug is happening now that I have taken the megaraid scsi driver code from RHEL 5 and modified it to compile in Ubuntu 8.04. With the Ubuntu driver the Xen dom0 would just panic whenever I tried to write a lot to the disk. Where "a lot" was dd'ing 100MB from /dev/zero followed by a call to sync.

Trying to get the full stack trace requires a serial console, which for the life of me I can't seem to figure out how to attach to Xen's dom0.

If you're having the problem and you're using the megaraid_sas scsi controller, maybe you'll be able to test out the driver code I'm modifying. assuming I can get it to be stable on my system. At the moment, the box isn't falling over, but I'm getting gobs of these messages:

May 12 18:56:20 phaethon kernel: [ 222.605912] sd 0:2:0:0: [sda] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
May 12 18:56:20 phaethon kernel: [ 222.605917] end_request: I/O error, dev sda, sector 93696298
May 12 18:56:20 phaethon kernel: [ 222.618742] sd 0:2:0:0: [sda] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
May 12 18:56:20 phaethon kernel: [ 222.618747] end_request: I/O error, dev sda, sector 93708338
May 12 18:56:20 phaethon kernel: [ 222.619380] sd 0:2:0:0: [sda] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
May 12 18:56:20 phaethon kernel: [ 222.619384] end_request: I/O error, dev sda, sector 93708858

As well as the PCI-DMA: Out of SW-IOMMU space errors on the console.

Scott

Revision history for this message
Alessandro Bono (a.bono) wrote :
Download full text (4.1 KiB)

I have hit the same problem on an hardy 64bit with a 3ware 9650SE raid controller
I found on 3ware site a (non optimal) workaround for my controller and a link for a patch on xen site (attached)

http://lists.xensource.com/archives/html/xen-changelog/2008-04/msg00008.html
http://www.3ware.com/kb/article.aspx?id=15345

Nov 12 14:41:14 fico kernel: [610002.809638] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.809715] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.809765] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.809821] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.809879] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.809937] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.809988] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810037] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810085] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810134] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810182] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810231] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810278] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810327] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810374] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810434] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810485] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810534] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810581] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810630] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810677] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810725] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810772] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810820] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810827] sd 0:0:0:0: [sda] Result: hostbyte=DID_...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers