Xen 3.1/Gutsy: PCI-DMA: Out of SW-IOMMU space

Bug #162147 reported by Alvin Cura
16
Affects Status Importance Assigned to Milestone
xen-meta (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: ubuntu-xen-server

I had been commenting on Bug #135818, however it is marked as fixed. This issue seems related, but is not identical:

//**********//

 Alvin Cura wrote on 2007-11-05: (permalink)

I'm afraid I'm still seeing this bug. Or something similar.

I did not get a kernel panic, but my network died and stayed dead. I was, however, able to reboot.

Nov 4 16:44:43 xen1 -- MARK --
Nov 4 17:04:36 xen1 kernel: [19235.154900] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.157617] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.160469] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.163254] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.166076] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.168897] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.171727] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.174519] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0
Nov 4 17:04:36 xen1 kernel: [19235.177666] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:00.0

7222 bytes is likely dictated by my ethernet interface being set for jumbo frames:

11: eth0: <BROADCAST,MULTICAST,UP,10000> mtu 7200 qdisc pfifo_fast qlen 1000
    link/ether 00:1b:fc:1f:33:40 brd ff:ff:ff:ff:ff:ff

this is on the following controller:

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)

I will proceed with some tests of using different NICs to try to isolate the behaviour.
 Alvin Cura wrote 1 seconds ago: (permalink)

More bad news. Problem is persisting, even with a different NIC (although also a Realtek, this one a PCI card instead of the onboard):

syslog:Nov 11 23:22:17 xen1 kernel: [ 8113.207809] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0
syslog:Nov 11 23:22:40 xen1 kernel: [ 8136.691486] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0
syslog:Nov 11 23:23:17 xen1 kernel: [ 8173.209884] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0
syslog:Nov 11 23:24:17 xen1 kernel: [ 8233.214080] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0
syslog:Nov 11 23:24:32 xen1 kernel: [ 8248.385552] PCI-DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:03:05.0

03:05.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)

-----

Further research yielded the following thread in xen-devel:

http://lists.xensource.com/archives/html/xen-devel/2007-09/msg00138.html

However, more bad news, even setting swiotlb at boottime, the problem still occurs.

Revision history for this message
Frank Abel (frankabel) wrote :

I confirm this, to me not seem related at all with NIC (https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/135818/comments/7). In my case seem related with a high intensive used of hard disk, the errors come after execute the command "sudo xen-create-image --dir/home/xen...." on Dom0 to try to build the first DomU. I using files as disks. After hit ENTER lots lines with this messages appear in my console:

"
[ 5345.211424] PCI-DMA: Out of SW-IOMMU sapce for 65536 bytes at device 0000:03:01.0
"

the 65536 number in some lines change to 53248, 32768, etc.

Beside in dmesg appear lines like:

"
[ 5345.302257] (scsi4:A:0:0): data overrun detected in Data-out phase. Tag == 0x2.
[ 5345.302263] (scsi4:A:0:0): Have seen Data Phase. Length = 0. NumSGs = 0.
"

Revision history for this message
Konstantin Sharlaimov (konstantin-sharlaimov) wrote :

I can confirm this bug. To me it seems related to IOMMU implementation in the kernel. I experienced it on my company's server when disk usage is high no matter in dom0 or domU. This is very disturbing - Xen is not stable at all and I am forced to cope with one server shared over all services instead of VPS solution.

Revision history for this message
mikmak (mikmak) wrote :

a (apparently known) workaround is to add something like :
swiotlb=128 to your dom0 kernel line , this will increase memory allocated to swiotlb and avoid these errors
I can confirm it works for me

Mik

Revision history for this message
Konstantin Sharlaimov (konstantin-sharlaimov) wrote :

Unfortunately, the swiotlb didn't work for me. Apparently it postponed the problem, however after some big disk activity those errors showed up again. Is there a way to completely disable SW-IOMMU until this bug (apparently this is a dom0 kernel bug) is completely resolved?

Revision history for this message
Rick (rick2001) wrote :

I can also confirm that swiotlb didn't work for me on my PowerEdge 2850. The system will crash/segfault if I try to do anything big, like create a 40GB Xen VM.

Revision history for this message
Scott Wimer (scott-wimer) wrote :

Rick,

Does your PowerEdge server have a PERC controller?

I'm getting this on Ubuntu 8.04 on a Dell 2950 that has the PERC 5/i controller.

This bug is happening now that I have taken the megaraid scsi driver code from RHEL 5 and modified it to compile in Ubuntu 8.04. With the Ubuntu driver the Xen dom0 would just panic whenever I tried to write a lot to the disk. Where "a lot" was dd'ing 100MB from /dev/zero followed by a call to sync.

Trying to get the full stack trace requires a serial console, which for the life of me I can't seem to figure out how to attach to Xen's dom0.

If you're having the problem and you're using the megaraid_sas scsi controller, maybe you'll be able to test out the driver code I'm modifying. assuming I can get it to be stable on my system. At the moment, the box isn't falling over, but I'm getting gobs of these messages:

May 12 18:56:20 phaethon kernel: [ 222.605912] sd 0:2:0:0: [sda] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
May 12 18:56:20 phaethon kernel: [ 222.605917] end_request: I/O error, dev sda, sector 93696298
May 12 18:56:20 phaethon kernel: [ 222.618742] sd 0:2:0:0: [sda] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
May 12 18:56:20 phaethon kernel: [ 222.618747] end_request: I/O error, dev sda, sector 93708338
May 12 18:56:20 phaethon kernel: [ 222.619380] sd 0:2:0:0: [sda] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
May 12 18:56:20 phaethon kernel: [ 222.619384] end_request: I/O error, dev sda, sector 93708858

As well as the PCI-DMA: Out of SW-IOMMU space errors on the console.

Scott

Revision history for this message
Alessandro Bono (a.bono) wrote :
Download full text (4.1 KiB)

I have hit the same problem on an hardy 64bit with a 3ware 9650SE raid controller
I found on 3ware site a (non optimal) workaround for my controller and a link for a patch on xen site (attached)

http://lists.xensource.com/archives/html/xen-changelog/2008-04/msg00008.html
http://www.3ware.com/kb/article.aspx?id=15345

Nov 12 14:41:14 fico kernel: [610002.809638] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.809715] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.809765] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.809821] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.809879] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.809937] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.809988] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810037] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810085] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810134] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810182] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810231] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810278] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810327] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810374] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810434] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810485] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810534] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810581] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810630] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810677] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810725] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810772] PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:0a:00.0
Nov 12 14:41:14 fico kernel: [610002.810820] 3w-9xxx: scsi0: ERROR: (0x06:0x001C): Failed to map scatter gather list.
Nov 12 14:41:14 fico kernel: [610002.810827] sd 0:0:0:0: [sda] Result: hostbyte=DID_...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.