Bug #709245 “ARM SMP scheduler performance bug” : Bugs : Linaro Linux

Revision history for this message

Dr. David Alan Gilbert (davidgil-uk) wrote on 2011-03-03:

#1

Slower on 2.6.38-1000:

/dev/sda:
Timing cached reads: 436 MB in 2.00 seconds = 218.03 MB/sec
Timing buffered disk reads: 20 MB in 3.23 seconds = 6.19 MB/sec

Mounir Bsaibes (mounir-bsaibes) on 2011-03-21

Changed in linux-linaro:
importance:	Undecided → Medium

Mounir Bsaibes (mounir-bsaibes) on 2011-03-24

Changed in linux-linaro:
assignee:	nobody → Mian Yousaf Kaukab (mian-yousaf-kaukab)

Revision history for this message

Mian Yousaf Kaukab (mian-yousaf-kaukab) wrote on 2011-03-29:

#2

In Omap3xxx EHCI functional clock “hs_fck” is clocked at 120MHz. Where as rate for this clock has been reduced to 60MHz for Omap4xxx. This can be a potential reason for the reduced USB bandwidth available on omap4 platforms. How the reduced clock frequency effects the bandwidth, if at all, depends upon the way this clock is used inside EHCI core.

Can someone from TI take a look at this and confirm if the reduced USB bandwidth on Omap4 as compared to Omap3 is expected behavior?

Thanks,
Mian Yousaf

Revision history for this message

Mian Yousaf Kaukab (mian-yousaf-kaukab) wrote on 2011-03-29:

#3

Following patch can be used to log the EHCI clock frequencies for Omap3 and Omap4 platforms.

diff --git a/drivers/usb/host/ehci-omap.c b/drivers/usb/host/ehci-omap.c
index f784ceb..a886e82 100644
--- a/drivers/usb/host/ehci-omap.c
+++ b/drivers/usb/host/ehci-omap.c
@@ -325,6 +325,8 @@ static int omap_start_ehc(struct ehci_hcd_omap *omap, struct usb_hcd *hcd)
                goto err_host_ick;
        }
        clk_enable(omap->usbhost_ick);
+ printk(KERN_INFO"ehci clock usbhost_ick rate = %lu Hz\n",
+ clk_get_rate(omap->usbhost_ick));

        omap->usbhost_hs_fck = clk_get(omap->dev, "hs_fck");
        if (IS_ERR(omap->usbhost_hs_fck)) {
@@ -332,6 +334,8 @@ static int omap_start_ehc(struct ehci_hcd_omap *omap, struct usb_hcd *hcd)
                goto err_host_120m_fck;
        }
        clk_enable(omap->usbhost_hs_fck);
+ printk(KERN_INFO"ehci clock hs_fck rate = %lu Hz\n",
+ clk_get_rate(omap->usbhost_hs_fck));

        omap->usbhost_fs_fck = clk_get(omap->dev, "fs_fck");
        if (IS_ERR(omap->usbhost_fs_fck)) {
@@ -339,6 +343,8 @@ static int omap_start_ehc(struct ehci_hcd_omap *omap, struct usb_hcd *hcd)
                goto err_host_48m_fck;
        }
        clk_enable(omap->usbhost_fs_fck);
+ printk(KERN_INFO"ehci clock fs_fck rate = %lu Hz\n",
+ clk_get_rate(omap->usbhost_fs_fck));

if (omap->phy_reset) {
/* Refer: ISSUE1 */

Jamie Bennett (jamiebennett) on 2011-04-08

Changed in linux-linaro:
status:	New → Confirmed

Revision history for this message

warmcat (andy-warmcat) wrote on 2011-06-05: Re: panda: USB disk IO slow

#5

I meddled around with this last week via the Fedora ARM list, it was simple to reproduce that pinging the Panda Ethernet from another machine on the local network with -i0.001 yielded an effective 50% increase in USB storage throughput at the Panda compared to it being left idle.

Neither RUNTIME_PM (suggested by Felipe Balbi at TI) nor USB_SUSPEND were to blame since the issue stayed the same with those deconfigured.

It seems the issue may be to do with the Panda's hub / ethernet chip idle / wakeup characteristics but since there are no hub registers in there settable from Panda side AFAIK (there are some settable in the missing EEPROM that can configure the hub / Ethernet bridge) I am out of ideas to follow it up.

Revision history for this message

mahmoh (mahmoh) wrote on 2011-07-19:

#6

Download full text (3.9 KiB)

Ran a quick benchmark with and without ping -f and 20x performance increase with ping vs. without (below); I'd like to see this bumped up to High since this poor performance is impacting testing using these boards. I also came across a similar problem on another board that may help with a fix (or not - https://lists.yoctoproject.org/pipermail/poky/2011-June/006649.html).

Linux panda-qa1 2.6.38-1309-omap4 #14-Ubuntu SMP PREEMPT Wed Jul 6 09:42:21 UTC 2011 armv7l armv7l armv7l GNU/Linux

Description: Ubuntu oneiric (development branch)
Release: 11.10

# phoronix-test-suite run pts/tiobench

Phoronix Test Suite v3.2.0
Threaded I/O Tester Test Configuration

Test:

1: Write
2: Random Write
3: Read
4: Random Read
5: Test All Options

Enter Your Choice: 5

Size Per Thread:

1: 32MB
2: 64MB
3: 128MB
4: 256MB
5: Test All Options

Enter Your Choice: 1

Thread Count:

1: 4
2: 8
3: 16
4: 32
5: Test All Options

Enter Your Choice: 1
Would you like to save these test results (Y/n): n

Estimated Run-Time: 2 Hours, 31 Minutes

Threaded I/O Tester:
    pts/tiobench-1.1.0 [Test: Write - Size Per Thread: 32MB - Thread Count: 4]
    Test 1 of 4
    Estimated Time Remaining: 2 Hours, 31 Minutes
    Estimated Test Run-Time: 38 Minutes
    Expected Trial Run Count: 4
        Started Run 1 @ 17:28:21
        Started Run 2 @ 17:28:45
        Started Run 3 @ 17:29:09
        Started Run 4 @ 17:29:32 [Std. Dev: 2.16%]

    Test Results:
        6.273
        6.195
        6.388
        6.508

Average: 6.34 MB/s

Threaded I/O Tester:
    pts/tiobench-1.1.0 [Test: Random Write - Size Per Thread: 32MB - Thread Count: 4]
    Test 2 of 4
    Estimated Time Remaining: 1 Hour, 53 Minutes
    Estimated Test Run-Time: 38 Minutes
    Expected Trial Run Count: 4
        Started Run 1 @ 17:29:59
        Started Run 2 @ 17:31:40
        Started Run 3 @ 17:33:08

        Started Run 4 @ 17:33:26 [Std. Dev: 104.51%]
        Started Run 5 @ 17:33:40 [Std. Dev: 84.05%]
        Started Run 6 @ 17:33:55 [Std. Dev: 71.87%]
        Started Run 7 @ 17:34:09 [Std. Dev: 63.69%]
        Started Run 8 @ 17:34:24 [Std. Dev: 61.29%]

    Test Results:
        0.206
        0.241
        4.493
        4.476
        4.623
        4.461
        4.534
        2.534

Average: 3.20 MB/s

^C
# uname -a
Linux panda-qa1 2.6.38-1309-omap4 #14-Ubuntu SMP PREEMPT Wed Jul 6 09:42:21 UTC 2011 armv7l armv7l armv7l GNU/Linux
root@panda-qa1:/home/u# phoronix-test-suite run pts/tiobench

Phoronix Test Suite v3.2.0
Threaded I/O Tester Test Configuration

Test:

1: Write
2: Random Write
3: Read
4: Random Read
5: Test All Options

Enter Your Choice: 5

Size Per Thread:

1: 32MB
2: 64MB
3: 128MB
4: 256MB
5: Test All Options

Enter Your Choice: 1

Thread Count:

1: 4
2: 8
3: 16
4: 32
5: Test All Options

Enter Your Choice: 1
Would you like to save these test results (Y/n): n

Estimated Run-Time: 2 Hours, 9 Minutes

Threaded I/O Tester:
    pts/tiobench-1.1.0 [Test: Write - Size Per Thread: 32MB - Thread Count: 4]
    Test 1 of 4
    Estimated Time Remaining: 2 Hours, 9 Minutes
    Estimated Test Run-Time: 33 Minutes
    Expected Trial Run Count: 4
     ...

Ran a quick benchmark with and without ping -f and 20x performance increase with ping vs. without (below); I'd like to see this bumped up to High since this poor performance is impacting testing using these boards.  I also came across a similar problem on another board that may help with a fix (or not - https://lists.yoctoproject.org/pipermail/poky/2011-June/006649.html).

Linux panda-qa1 2.6.38-1309-omap4 #14-Ubuntu SMP PREEMPT Wed Jul 6 09:42:21 UTC 2011 armv7l armv7l armv7l GNU/Linux

Description:    Ubuntu oneiric (development branch)
Release:        11.10

# phoronix-test-suite run pts/tiobench

Phoronix Test Suite v3.2.0
Threaded I/O Tester Test Configuration

Test:

1: Write
2: Random Write
3: Read
4: Random Read
5: Test All Options

Enter Your Choice: 5

Size Per Thread:

1: 32MB
2: 64MB
3: 128MB
4: 256MB
5: Test All Options

Enter Your Choice: 1

Thread Count:

1: 4
2: 8
3: 16
4: 32
5: Test All Options

Enter Your Choice: 1
    Would you like to save these test results (Y/n): n

Estimated Run-Time: 2 Hours, 31 Minutes

Threaded I/O Tester:
    pts/tiobench-1.1.0 [Test: Write - Size Per Thread: 32MB - Thread Count: 4]
    Test 1 of 4
    Estimated Time Remaining: 2 Hours, 31 Minutes
    Estimated Test Run-Time: 38 Minutes
    Expected Trial Run Count: 4
        Started Run 1 @ 17:28:21
        Started Run 2 @ 17:28:45
        Started Run 3 @ 17:29:09
        Started Run 4 @ 17:29:32  [Std. Dev: 2.16%]

Test Results:
        6.273
        6.195
        6.388
        6.508

Average: 6.34 MB/s

Threaded I/O Tester:
    pts/tiobench-1.1.0 [Test: Random Write - Size Per Thread: 32MB - Thread Count: 4]
    Test 2 of 4
    Estimated Time Remaining: 1 Hour, 53 Minutes
    Estimated Test Run-Time: 38 Minutes
    Expected Trial Run Count: 4
        Started Run 1 @ 17:29:59
        Started Run 2 @ 17:31:40
        Started Run 3 @ 17:33:08

Started Run 4 @ 17:33:26  [Std. Dev: 104.51%]
        Started Run 5 @ 17:33:40  [Std. Dev: 84.05%]
        Started Run 6 @ 17:33:55  [Std. Dev: 71.87%]
        Started Run 7 @ 17:34:09  [Std. Dev: 63.69%]
        Started Run 8 @ 17:34:24  [Std. Dev: 61.29%]

Test Results:
        0.206
        0.241
        4.493
        4.476
        4.623
        4.461
        4.534
        2.534

Average: 3.20 MB/s

^C
# uname -a
Linux panda-qa1 2.6.38-1309-omap4 #14-Ubuntu SMP PREEMPT Wed Jul 6 09:42:21 UTC 2011 armv7l armv7l armv7l GNU/Linux
root@panda-qa1:/home/u# phoronix-test-suite run pts/tiobench

Phoronix Test Suite v3.2.0
Threaded I/O Tester Test Configuration

Test:

1: Write
2: Random Write
3: Read
4: Random Read
5: Test All Options

Enter Your Choice: 5

Size Per Thread:

1: 32MB
2: 64MB
3: 128MB
4: 256MB
5: Test All Options

Enter Your Choice: 1

Thread Count:

1: 4
2: 8
3: 16
4: 32
5: Test All Options

Enter Your Choice: 1
    Would you like to save these test results (Y/n): n

Estimated Run-Time: 2 Hours, 9 Minutes

Threaded I/O Tester:
    pts/tiobench-1.1.0 [Test: Write - Size Per Thread: 32MB - Thread Count: 4]
    Test 1 of 4
    Estimated Time Remaining: 2 Hours, 9 Minutes
    Estimated Test Run-Time: 33 Minutes
    Expected Trial Run Count: 4
        Started Run 1 @ 17:46:16
        Started Run 2 @ 17:46:26
        Started Run 3 @ 17:46:36
        Started Run 4 @ 17:46:46  [Std. Dev: 0.58%]

Test Results:
        19.276
        19.526
        19.311
        19.323

Average: 19.36 MB/s

Threaded I/O Tester:
    pts/tiobench-1.1.0 [Test: Random Write - Size Per Thread: 32MB - Thread Count: 4]
    Test 2 of 4
    Estimated Time Remaining: 1 Hour, 37 Minutes
    Estimated Test Run-Time: 33 Minutes
    Expected Trial Run Count: 4
        Started Run 1 @ 17:47:00
        Started Run 2 @ 17:47:14
        Started Run 3 @ 17:47:28
        Started Run 4 @ 17:47:42  [Std. Dev: 1.58%]

Test Results:
        4.905
        4.72
        4.815
        4.799

Average: 4.81 MB/s

Tobin Davis (gruemaster) on 2011-07-19

Changed in linux-ti-omap4 (Ubuntu):
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Tobin Davis (gruemaster) wrote on 2011-07-19:

#7

bonnie.html Edit (4.0 KiB, text/html)

I have been able to reproduce this bug using bonnie++. The process I used was to install Oneiric using the netboot installer to an external USB SATA drive and boot with that as my root filesystem (SD only has bootloaders & kernel).

Steps to reproduce on a running system:

On panda (no ping):
sudo apt-get install bonnie++
bonnie++ -n 0 -m "USB HD" -q 1>>bonnie.csv

After this runs, on a different system on the same network, start "sudo ping -i 0.001 panda" (assuming the panda is in the host lookup), then on the panda run:
bonnie++ -n 0 -m "USB HD with ping" -q 1>>bonnie.csv

This will produce the results attached (bonnie.csv). Bonnie.html is an html formatted version of the output (with non-run tests removed).

Revision history for this message

Tobin Davis (gruemaster) wrote on 2011-07-20:

#8

bonnie-maverick.html Edit (4.0 KiB, text/html)

I was able to reproduce this in Maverick as well.

Revision history for this message

Ricardo Salveti (rsalveti) wrote on 2011-07-21:

#9

Would be good to check if you're able to reproduce this issue with the 3.0 based kernel.

Changed in linaro-ubuntu:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

mahmoh (mahmoh) wrote on 2011-07-21:

#10

Appears to still be a problem with 3.0.0-1:
...
Threaded I/O Tester:
    pts/tiobench-1.1.0 [Test: Write - Size Per Thread: 32MB - Thread Count: 4]
    Test 1 of 4
    Estimated Time Remaining: 4 Hours, 57 Minutes
    Estimated Test Run-Time: 28 Minutes
    Expected Trial Run Count: 4
        Started Run 1 @ 16:35:46
        Started Run 2 @ 16:36:28
<PING-START>
        Started Run 3 @ 16:37:06
        Started Run 4 @ 16:37:17 [Std. Dev: 73.18%]
        Started Run 5 @ 16:37:28 [Std. Dev: 61.73%]
        Started Run 6 @ 16:37:39 [Std. Dev: 54.21%]
<PING-STOP>
        Started Run 7 @ 16:37:50 [Std. Dev: 63.24%]
        Started Run 8 @ 16:38:31 [Std. Dev: 68.34%]

    Test Results:
        3.686
        3.746
        16.491
        16.656
        16.796
        16.874
        3.349
        3.933

Average: 10.19 MB/s

^C
# uname -a
Linux panda-qa1 3.0.0-1-omap4 #0~dd1d9db SMP PREEMPT Wed Jul 20 19:41:32 CEST 2011 armv7l armv7l armv7l GNU/Linux

Ricardo Salveti (rsalveti) on 2011-07-27

Changed in linaro-ubuntu:
milestone:	none → 11.07

Tobin Davis (gruemaster) on 2011-07-29

tags:

added: armel kernel-bug

Revision history for this message

Jani Monoses (jani) wrote on 2011-08-01:

#11

I see the same behaviour without ethernet plugged in, and pinging the panda via wlan0. Is that chip too on the same USB controller?

5x speed increase on write when pinging with -i.0001

jani@panda:~$ time dd if=/dev/zero of=alma bs=4M count=1024
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB) copied, 1451.47 s, 3.0 MB/s

real 25m7.935s
user 0m0.023s
sys 1m24.266s

jani@panda:~$ time dd if=/dev/zero of=alma2 bs=4M count=1024
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB) copied, 267.908 s, 16.0 MB/s

real 4m30.361s
user 0m0.016s
sys 1m15.844s

Revision history for this message

warmcat (andy-warmcat) wrote on 2011-08-01:

#12

Hi Jani -

No, Panda WLAN is on a 4-bit SDIO interface. So it's completely different from ping on Ethernet, I don't know what that means that it still changes USB throughput.

Also when I reproduced this via pinging ethernet, I only saw a 50% increase in throughput; maybe it's to do with the block size.

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-01:

#13

Guess what, this issue indeed seems to be completely unrelated to USB.
Something similar is going on for SD card access (see Bug #787246), and accessing LAN/WLAN has only an indirect influence.

Look at this: http://paste.ubuntu.com/656410/
Seems like the CPU entering sleep states or something similar is the culprit?

Revision history for this message

Nicolas Pitre (npitre) wrote on 2011-08-01: Re: [Bug 709245] Re: panda: USB disk IO slow

#14

On Mon, 1 Aug 2011, Michael Sparmann wrote:

> Guess what, this issue indeed seems to be completely unrelated to USB.
> Something similar is going on for SD card access (see Bug #787246), and accessing LAN/WLAN has only an indirect influence.
>
> Look at this: http://paste.ubuntu.com/656410/
> Seems like the CPU entering sleep states or something similar is the culprit?

Interesting.

To confirm this, please add nohlt to your kernel cmdline string and
retest.

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-01: Re: panda: USB disk IO slow

#15

No noticable improvement: http://paste.ubuntu.com/656498/

Revision history for this message

Nicolas Pitre (npitre) wrote on 2011-08-01: Re: [Bug 709245] Re: panda: USB disk IO slow

#16

On Mon, 1 Aug 2011, Michael Sparmann wrote:

> No noticable improvement: http://paste.ubuntu.com/656498/

The nohlt argument prevents the idle power saving code from being called
at all. So if that doesn't help the USB case either than it is unlikely
that anything related to sleep modes.

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-01: Re: panda: USB disk IO slow

#17

Further testing points towards thread switching behavior being related:

Nothing running in parallel: <10MB/s at big block sizes
perl -e "while (1) { }": <5MB/s
perl -e "use Time::HiRes qw( usleep ); while (1) { usleep(1) }": 20-25MB/s
perl -e "use Time::HiRes qw( nanosleep ); while (1) { nanosleep(1) }": 5-25MB/s (varying widely over time)
perl -e "use Thread qw( yield ); while (1) { yield }": 10-20MB/s
sudo ping -q -i.001 localhost: 25MB/s (consistently)

Note that the sleep variants only reach ~25% CPU load (as seen in top), while the yield and ping variants use a full core.

Revision history for this message

Ming Lei (tom-leiming) wrote on 2011-08-02:

#18

I have added nohlt and seems no any changes wrt. usb storage performance.

But the following test is interesting, just run a 'busy' in background and increas usb
storage performance triple or 4 times, and killing the 'busy' task will make usb
storage performance poor like before.

see my test below:

root@tom-panda:~# ifconfig
lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

root@tom-panda:~# dd if=/dev/sda iflag=direct of=/dev/null bs=128K count=32
32+0 records in
32+0 records out
4194304 bytes (4.2 MB) copied, 5.9715 s, 702 kB/s
root@tom-panda:~# cat /proc/cmdline
ro elevator=noop vram=32M mem=456M@0x80000000 mem=512M@0xA0000000 root=UUID=a957f519-7a09-48d9-
a60d-b15883901edf fixrtc quiet splash console=ttyO2,115200n8 serialtty=ttyO2 earlyprintk
root@tom-panda:~# ./busy&
[1] 1183
root@tom-panda:~# cat busy
#/bin/sh

x=1
while [ true ]
do
x=$(( $x + 1 ))
done
root@tom-panda:~# dd if=/dev/sda iflag=direct of=/dev/null bs=128K count=32
32+0 records in
32+0 records out
4194304 bytes (4.2 MB) copied, 0.771301 s, 5.4 MB/s
root@tom-panda:~# kill -9 1183
root@tom-panda:~# sleep 3
[1]+ Killed ./busy
root@tom-panda:~# dd if=/dev/sda iflag=direct of=/dev/null bs=128K count=32
32+0 records in
32+0 records out
4194304 bytes (4.2 MB) copied, 3.40616 s, 1.2 MB/s

Revision history for this message

Ming Lei (tom-leiming) wrote on 2011-08-02:

#19

Another interesting thing, enabling usbmon will increase performance a lot.

root@tom-panda:~# dd if=/dev/sda iflag=direct of=/dev/null bs=128K count=32
32+0 records in
32+0 records out
4194304 bytes (4.2 MB) copied, 3.4598 s, 1.2 MB/s
root@tom-panda:~# cat /sys/kernel/debug/usb/usbmon/1u > /tmp/usbmon-1u&
[1] 1485
root@tom-panda:~# dd if=/dev/sda iflag=direct of=/dev/null bs=128K count=32
32+0 records in
32+0 records out
4194304 bytes (4.2 MB) copied, 0.795455 s, 5.3 MB/s
root@tom-panda:~# dd if=/dev/sda iflag=direct of=/dev/null bs=128K count=32
32+0 records in
32+0 records out
4194304 bytes (4.2 MB) copied, 0.278628 s, 15.1 MB/s
root@tom-panda:~# killall -9 cat
[1]+ Killed cat /sys/kernel/debug/usb/usbmon/1u > /tmp/usbmon-1u
root@tom-panda:~# dd if=/dev/sda iflag=direct of=/dev/null bs=128K count=32
32+0 records in
32+0 records out
4194304 bytes (4.2 MB) copied, 5.0556 s, 830 kB/s
root@tom-panda:~# dd if=/dev/sda iflag=direct of=/dev/null bs=128K count=32
32+0 records in
32+0 records out
4194304 bytes (4.2 MB) copied, 3.2009 s, 1.3 MB/s

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-02:

#20

When running just dd at 1MB block size, I get ~5.5MB/s.

When running with sudo ping -q -i.0100 localhost in the background, ping causes 1% CPU load, and I get ~10MB/s at 1MB block size and these ping statistics:
6228 packets transmitted, 6227 received, 0% packet loss, time 97329ms
rtt min/avg/max/mdev = 0.000/0.053/2.899/0.050 ms

When running with sudo ping -q -i.0099 localhost in the background, ping causes 100% (one core) CPU load, and I get ~20MB/s at 1MB block size and these ping statistics:
10339 packets transmitted, 10337 received, 0% packet loss, time 93032ms
rtt min/avg/max/mdev = 0.000/0.033/0.946/0.031 ms

Now why does ping process about twice as many packets but use 100 times more CPU (and get better response times) when I reduce the delay by just one percent? There's something fishy going on here...

Revision history for this message

warmcat (andy-warmcat) wrote on 2011-08-02:

#21

Maybe the 1ms threshold thing is just to do with a threshold heuristic where it stops using the scheduler and sleeping and instead spins udelay() style, hence the 100% CPU. That's not to write off its effect on the bug just a possible explanation of why it cares about 1ms or less.

Reading these new findings yesterday and today I wonder if it's somehow related to when the ehci stuff schedules the URB. Like if the CPU is totally up it can pack USB transactions to go out immediately, but somehow when "idle", although what exactly is in what kind of idle is unclear, it bumps it to the next frame or so.

Revision history for this message

Marcin Juszkiewicz (hrw) wrote on 2011-08-02:

#22

Can someone check does it happens on other multicore A9 board? Origen or Versatile Express (Snowball lacks usb host port).

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-02:

#23

Let me quote from #pandaboard:
<prpplague> TheSeven: the direction the debugging is going is that this is a thread issue, i suspect that the .0099 is just under the thread timing
<prpplague> TheSeven: linaro folks and fedora folks are working on debugging this
<prpplague> TheSeven: it doesn't appear to be panda or omap4 specific
<prpplague> TheSeven: it has been recreated on the snowball and on a tegra platform

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-02:

#24

I just tried the 3.0 kernel, and it behaves in exactly the same way.

Revision history for this message

Jani Monoses (jani) wrote on 2011-08-02:

#25

I tested some more today (Oneiric 3.0 kernel) and the external USB drive on a Panda.

With nosmp passed on the boot command line, the throughput is good with or without the ping and about the same as using both cores and pinging it.

Booting with both cores and putting cpu1 offline does not seem to affect the behaviour - pinging leads to 4x or 5x throughput over not-pinging.

So it may be related to SMP scheduling. Also nosmp and putting one core offline at runtime is apparently not the same ( I naively thought it should be)

Ubuntu QA Website (ubuntuqa) on 2011-08-02

tags:

added: iso-testing

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-05:

#26

noop test run 1 Edit (3.9 KiB, text/plain)

I ran the tests again, with the noop, cfq and deadline I/O schedulers, and the results differed in interesting ways. See attached logs.

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-05:

#27

cfq test run 1 Edit (3.9 KiB, text/plain)

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-05:

#28

deadline test run 1 Edit (3.8 KiB, text/plain)

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-05:

#29

noop test run 2 Edit (3.9 KiB, text/plain)

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-05:

#30

cfq test run 2 Edit (3.8 KiB, text/plain)

Revision history for this message

Michael Sparmann (theseven) wrote on 2011-08-05:

#31

deadline test run 2 Edit (3.8 KiB, text/plain)

David Mandala (davidm) on 2011-08-09

description:	updated
summary:	- panda: USB disk IO slow + ARM SMP scheduler performance bug

Ricardo Salveti (rsalveti) on 2011-08-10

Changed in linaro-ubuntu:
milestone:	11.07 → 11.08

Tobin Davis (gruemaster) on 2011-08-10

Changed in linux-ti-omap4 (Ubuntu Natty):
status:	New → Confirmed
Changed in linux-ti-omap4 (Ubuntu Maverick):
status:	New → Confirmed
importance:	Undecided → High
Changed in linux-ti-omap4 (Ubuntu Natty):
importance:	Undecided → High

Revision history for this message

Stefan Bader (smb) wrote on 2011-08-16:

#32

I tried to gather more data on this and made several runs of "hdparm -Tt" for the good and bad case:
- For the good case (nosmp)
  - cached reads avg. 184.61MB/sec (MIN=181.02 MAX=186.30 STDDEV=1.62)
  - buffered reads avg. 3.64MB/sec (MIN=3.64 MAX=3.65 STDDEV=0)
- For the bad case
  - cached reads avg. 62.00MB/sec (MIN=21.98 MAX=111.55 STDDEV=27.97)
  - buffered reads avg. 1.35MB/sec (MIN=1.09 MAX=1.86 STDDEV=0.20)

So not only the SMP case is slower, there is also a high variation in the numbers. Whatever happens, it is not a linear slowdown.

When trying various options maxcpus=1 has the same effect as nosmp (disabling a core later does not improve things), there was one option (which right now slips my memory) causing a higher rate of timer interrupts which lead to a even worse buffered reads performance (as was disabling the irqbalance daemon). Using nohlt seemed to have no effect.

When looking at the interrupts, the main difference seemed to be the use of gp_timer when booted with nosmp or maxcpus=1 and LOCal timers in the other case. Booting with 2 CPUs and disabling one did not change that. I cannot recall whether the IPIs stopped being incremented completely or only on the other CPU. But LOC definitely was still used for CPU#0.

Revision history for this message

Nicolas Dechesne (ndec) wrote on 2011-08-16:

#33

the thing about the GPTimer is interesting... i believe the kernel has CONFIG_NOHZ defined, that might be interesting to try without that (e.g. with HZ). in fact it would be nice to make a custom kernel that uses GPTimer instead of local timer, but that might be a little more tricky (even though we had that for very early OMAP chips since local timer weren't functional).

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2011-08-17:

#34

dmesg comparison: smp vs nosmp Edit (19.9 KiB, text/plain)

comparing the boot logs of an smp and a nosmp kernel (see attached file), what struck me was an error message about the smp_twd clock:

CPU: Testing write buffer coherency: ok
-twd_timer_setup: no clock found <<<<<
-Calibrating local timer... 492.00MHz. <<<<<
L310 cache controller enabled
l2x0: 16 ways, CACHE_ID 0x410000c4, AUX_CTRL 0x5e470000, Cache size: 1048576 B

looking closer at the code, i found that:

smp_twd.c::twd_timer_setup():

...
    if (twd_clk == NULL) {
        twd_clk = clk_get_sys("smp_twd", NULL);
        if (IS_ERR_OR_NULL(twd_clk))
            pr_warn("%s: no clock found\n", __func__);
    }
...

and clk_get_sys() prototype is:

struct clk *clk_get_sys(const char *dev_id, const char *con_id)

while clocks are defined as:

clock44xx_data.c in struct omap_clk omap44xx_clks[]:
...
CLK(NULL, "smp_twd", &smp_twd, CK_44XX)
...

#define CLK(dev, con, ck, cp) \
    { \
         .cpu = cp, \
        .lk = { \
            .dev_id = dev, \
            .con_id = con, \
            .clk = ck, \
        }, \
    }

so the parameters in clk_get_sys() are inverted:

--- a/arch/arm/kernel/smp_twd.c
+++ b/arch/arm/kernel/smp_twd.c
@@ -175,7 +175,8 @@ static void __cpuinit twd_calibrate_rate(void)
  */
void __cpuinit twd_timer_setup(struct clock_event_device *clk)
{
        if (twd_clk == NULL) {
- twd_clk = clk_get_sys("smp_twd", NULL);
+ twd_clk = clk_get_sys(NULL, "smp_twd");
                if (IS_ERR_OR_NULL(twd_clk))
                        pr_warn("%s: no clock found\n", __func__);

this will make the error go away, but unfortunately won't fix the i/o problem.

hdparm -t /dev/sda1

smp:

/dev/sda1:
Timing buffered disk reads: 6 MB in 4.73 seconds = 1.27 MB/sec

nosmp:

/dev/sda1:
Timing buffered disk reads: 44 MB in 3.08 seconds = 14.27 MB/sec

Steve Magoun (smagoun) on 2011-08-18

Changed in oem-priority:
importance:	Undecided → High

Revision history for this message

Jani Monoses (jani) wrote on 2011-08-24:

#35

I filed a bug for adding dbgsyms to the kernel so we can use systemtap easily in similar debugging scenarios
https://bugs.launchpad.net/ubuntu/+source/linux-ti-omap4/+bug/832606

David Mandala (davidm) on 2011-08-25

description:

updated

Revision history for this message

Ming Lei (tom-leiming) wrote on 2011-08-27: Re: [Bug 709245] Re: panda: USB disk IO slow

#36

Hi,

I think that the problem is caused by ehci driver instead of ARM SMP
scheduler, and I
have verified that patch below can fix the problem. And the patch has
been posted
on usb/arm/omap mail list for discussion.

But I am wondering that why no such problem after passing 'nosmp' to kernel.

From d85a08714ed23ec8688013b464dc90c6386db0d8 Mon Sep 17 00:00:00 2001
From: Ming Lei <email address hidden>
Date: Sat, 27 Aug 2011 22:29:15 +0800
Subject: [PATCH] usb: ehci: fix update qtd->token in qh_append_tds

This patch fixs one performance bug on ARM Cortex A9 dual core platform,
which has been reported on quite a few ARM machines(OMAP4, Tegra 2,
snowball...),
see details from link of https://bugs.launchpad.net/bugs/709245.

In fact, one mb() on ARM is enough to flush L2 cache, but
'dummy->hw_token = token;' after mb() is added just for obeying
correct mb() usage.

The patch has been tested ok on OMAP4 panda A1 board, the performance
of 'dd' over usb mass storage can be increased from 4~5MB/sec to
14~16MB/sec after applying this patch.

Signed-off-by: Ming Lei <email address hidden>
---
drivers/usb/host/ehci-q.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index 0917e3a..65b5021 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds (
wmb ();
dummy->hw_token = token;

+ /* The mb() below is added to make sure that
+ * 'token' can be writen into qtd, so that ehci
+ * HC can see the up-to-date qtd descriptor. On
+ * some archs(at least on ARM Cortex A9 dual core),
+ * writing into coherenet memory doesn't mean the
+ * value written can reach physical memory
+ * immediately, and the value may be buffered
+ * inside L2 cache. 'dummy->hw_token = token;'
+ * after mb() is added for obeying correct mb()
+ * usage.
+ * */
+ mb();
+ token = dummy->hw_token;
+
    urb->hcpriv = qh_get (qh);
   }
  }
--
1.7.4.1

Hi,

I think that the problem is caused by ehci driver instead of ARM SMP
scheduler, and I
have verified that patch below can fix the problem. And the patch has
been posted
on usb/arm/omap mail list for discussion.

But I am wondering that why no such problem after passing  'nosmp' to kernel.

From d85a08714ed23ec8688013b464dc90c6386db0d8 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@canonical.com>
Date: Sat, 27 Aug 2011 22:29:15 +0800
Subject: [PATCH] usb: ehci: fix update qtd->token in qh_append_tds

This patch fixs one performance bug on ARM Cortex A9 dual core platform,
which has been reported on quite a few ARM machines(OMAP4, Tegra 2,
snowball...),
see details from link of https://bugs.launchpad.net/bugs/709245.

In fact, one mb() on ARM is enough to flush L2 cache, but
'dummy->hw_token = token;' after mb() is added just for obeying
correct mb() usage.

The patch has been tested ok on OMAP4 panda A1 board, the performance
of 'dd' over usb mass storage can be increased from 4~5MB/sec to
14~16MB/sec after applying this patch.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/usb/host/ehci-q.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index 0917e3a..65b5021 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds (
 			wmb ();
 			dummy->hw_token = token;

+			/* The mb() below is added to make sure that
+			 * 'token' can be writen into qtd, so that ehci
+			 * HC can see the up-to-date qtd descriptor. On
+			 * some archs(at least on ARM Cortex A9 dual core),
+			 * writing into coherenet memory doesn't mean the
+			 * value written can reach physical memory
+			 * immediately, and the value may be buffered
+			 * inside L2 cache. 'dummy->hw_token = token;'
+			 * after mb() is added for obeying correct mb()
+			 * usage.
+			 * */
+			mb();
+			token = dummy->hw_token;
+
 			urb->hcpriv = qh_get (qh);
 		}
 	}
-- 
1.7.4.1

Revision history for this message

warmcat (andy-warmcat) wrote on 2011-08-28:

#37

Ming-Lei it sounds like a really good explanation. But when I test it, I don't see any real difference on USB memory stick before and after. Maybe my test is broken somehow?

8GB no-name USB stick

Before the patch -->

root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
1000+0 records in
1000+0 records out
40960000 bytes (41 MB) copied, 21.9735 s, 1.9 MB/s

root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
1000+0 records in
1000+0 records out
40960000 bytes (41 MB) copied, 21.3798 s, 1.9 MB/s

root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
1000+0 records in
1000+0 records out
40960000 bytes (41 MB) copied, 21.3694 s, 1.9 MB/s

After the patch -->

root@linaro:~# mount /dev/sda1 /media/
root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
1000+0 records in
1000+0 records out
40960000 bytes (41 MB) copied, 21.6143 s, 1.9 MB/s
root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
1000+0 records in
1000+0 records out
40960000 bytes (41 MB) copied, 21.6175 s, 1.9 MB/s
root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
1000+0 records in
1000+0 records out
40960000 bytes (41 MB) copied, 21.5983 s, 1.9 MB/s

I like the idea the delay is coming from L2 cache not being evicted until something else forces it a lot though.

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2011-08-28:

#38

works here on an external usb hdd:

oneiric/ti-omap4:

flag@omap:~$ sudo hdparm -t /dev/sda1

/dev/sda1:
Timing buffered disk reads: 6 MB in 3.53 seconds = 1.70 MB/sec

oneiric/ti-omap4 + Ming's patch:

flag@omap:~$ sudo hdparm -t /dev/sda1

/dev/sda1:
Timing buffered disk reads: 60 MB in 3.09 seconds = 19.40 MB/sec

Revision history for this message

Ming Lei (tom-leiming) wrote on 2011-08-28: Re: [Bug 709245] Re: ARM SMP scheduler performance bug

#39

Hi warmcat,

On Sun, Aug 28, 2011 at 4:32 PM, warmcat <email address hidden> wrote:
> Ming-Lei it sounds like a really good explanation. But when I test it,
> I don't see any real difference on USB memory stick before and after.
> Maybe my test is broken somehow?

Could you check if you have applied patch, built kernel and update correctly?
What is your board and version? and could you post out your kernel config?

>
> 8GB no-name USB stick
>
> Before the patch -->
>
> root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
> 1000+0 records in
> 1000+0 records out
> 40960000 bytes (41 MB) copied, 21.9735 s, 1.9 MB/s
>
> root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
> 1000+0 records in
> 1000+0 records out
> 40960000 bytes (41 MB) copied, 21.3798 s, 1.9 MB/s
>
> root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
> 1000+0 records in
> 1000+0 records out
> 40960000 bytes (41 MB) copied, 21.3694 s, 1.9 MB/s
>
>
> After the patch -->
>
> root@linaro:~# mount /dev/sda1 /media/
> root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
> 1000+0 records in
> 1000+0 records out
> 40960000 bytes (41 MB) copied, 21.6143 s, 1.9 MB/s
> root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
> 1000+0 records in
> 1000+0 records out
> 40960000 bytes (41 MB) copied, 21.6175 s, 1.9 MB/s
> root@linaro:~# dd if=/dev/urandom of=/media/dump bs=40960 count=1000
> 1000+0 records in
> 1000+0 records out
> 40960000 bytes (41 MB) copied, 21.5983 s, 1.9 MB/s

I think the patch should have some changes on the test result.

>
> I like the idea the delay is coming from L2 cache not being evicted
> until something else forces it a lot though.

The delay is coming from the delayed ehci irq or lost ehci irq, and which
is coming from the not up-to-date qtd descriptor.

thanks,
--
Ming Lei

Revision history for this message

warmcat (andy-warmcat) wrote on 2011-08-28:

#40

Hi Ming-Lei -

What I like about your explanation is it allows cache status to be the communication path for network traffic to impact the issue, ie network activity finally spills the cache that does update the qtd descriptor and then we get less deadtime. That sounds very good to me.

I'm using this tree

http://git.linaro.org/gitweb?p=people/andygreen/kernel-tilt.git;a=patch;h=7021a2ea35d0b0f96e6c5a753312c26466919efd

Your patch

http://git.linaro.org/gitweb?p=people/andygreen/kernel-tilt.git;a=shortlog;h=refs/heads/tilt-linux-linaro-3.0

I realized I do have a USB HDD here. I will try that as well in case the USB flash drive is the limiting factor.

Revision history for this message

warmcat (andy-warmcat) wrote on 2011-08-28:

#41

Ah! /dev/urandom is the limiting factor ^^ USB HDD is also stuck at 1.9Mbps with the same test. Same test with of=/dev/null is only getting 2.0Mbps.

With the USB HDD and if=/dev/zero instead, and 10x count

root@linaro:~# dd if=/dev/zero of=/media/dump bs=40960 count=10000
10000+0 records in
10000+0 records out
409600000 bytes (410 MB) copied, 16.5168 s, 24.8 MB/s
root@linaro:~# dd if=/dev/zero of=/media/dump bs=40960 count=10000
10000+0 records in
10000+0 records out
409600000 bytes (410 MB) copied, 18.4294 s, 22.2 MB/s

With the patch out, this takes an insanely long time.

Against the USB stick, it rises to 7.1MBps with the patch in vs 3.5MBps with it out.

So nice job!

Fathi Boudra (fboudra) on 2011-08-29

tags:	added: patch
Changed in linaro-ubuntu:
milestone:	11.08 → 11.09

Revision history for this message

Fathi Boudra (fboudra) wrote on 2011-08-29:

#42

discussion ongoing: https://patchwork.kernel.org/patch/1103812/

Revision history for this message

Ming Lei (tom-leiming) wrote on 2011-08-31:

#43

new discussion:
https://patchwork.kernel.org/patch/1113332/

Changed in linux-ti-omap4 (Ubuntu Oneiric):
assignee:	nobody → Ming Lei (tom-leiming)

Ricardo Salveti (rsalveti) on 2011-09-01

Changed in linaro-ubuntu:
assignee:	nobody → John Rigby (jcrigby)

Brad Figg (brad-figg) on 2011-09-02

tags:

added: rls-mgr-o-tracking

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-09-05:

#44

This bug was fixed in the package linux-ti-omap4 - 3.0.0-1204.9

---------------
linux-ti-omap4 (3.0.0-1204.9) oneiric; urgency=low

[ Leann Ogasawara ]

* [Config] Update CONFIG_EFI_VARS enforcer check

[ Paolo Pisati ]

  * Revert "[Config] Temporarily turn off audio support on omap4."
  * [Config] update configs after BSP update
  * Rebased against Ubuntu-3.0.0-9.15
  * TI BSP update: kernel-tilt/tilt-3.0-nodspvideo1 @ a6871e8f1f3c906b37884fb35837546b7c1dc15e

[ Ricardo Salveti de Araujo ]

* [Config] Enabling DRM_OMAP as built-in

[ Upstream Kernel Changes ]

  * Revert "usb: ehci: fix update qtd->token in qh_append_tds"
  * usb: ehci: make HC see up-to-date qh/qtd descriptor ASAP
    - LP: #709245
-- Paolo Pisati <email address hidden> Thu, 01 Sep 2011 11:56:08 +0200

Changed in linux-ti-omap4 (Ubuntu Oneiric):
status:	Confirmed → Fix Released

Tim Gardner (timg-tpi) on 2011-09-14

Changed in linux-ti-omap4 (Ubuntu Maverick):
status:	Confirmed → Fix Committed
Changed in linux-ti-omap4 (Ubuntu Natty):
status:	Confirmed → Fix Committed

Revision history for this message

Herton R. Krzesinski (herton) wrote on 2011-09-21:

#45

This bug is awaiting verification that the kernel for Maverick in -proposed (2.6.35-903.25) solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-maverick' to 'verification-done-maverick'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-maverick

Chris Van Hoof (vanhoof) on 2011-09-26

Changed in oem-priority:
status:	New → Fix Committed

Ricardo Salveti (rsalveti) on 2011-09-27

Changed in linaro-ubuntu:
status:	Confirmed → Fix Committed

Revision history for this message

Tobin Davis (gruemaster) wrote on 2011-09-27:

#46

Tested the Maverick kernel in proposed. While I am not seeing much improvement, the kernel at least doesn't break other things.

The following is the results of testing on a PATA drive in an external enclosure:
-- BEFORE --
Linux panda5 2.6.35-903-omap4 #24-Ubuntu SMP PREEMPT Thu Sep 1 20:37:40 UTC 2011 armv7l GNU/Linux
/dev/sda:
Timing cached reads: 388 MB in 2.00 seconds = 193.66 MB/sec
Timing buffered disk reads: 54 MB in 3.10 seconds = 17.45 MB/sec
-- AFTER --
Linux panda5 2.6.35-903-omap4 #25-Ubuntu SMP PREEMPT Tue Sep 20 21:02:11 UTC 2011 armv7l GNU/Linux
/dev/sda:
Timing cached reads: 402 MB in 2.00 seconds = 200.71 MB/sec
Timing buffered disk reads: 70 MB in 3.07 seconds = 22.83 MB/sec

tags:

added: verification-done-maverick
removed: verification-needed-maverick

Nicolas Pitre (npitre) on 2011-09-27

Changed in linux-linaro:
status:	Confirmed → Fix Committed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-09-29:

#47

This bug was fixed in the package linux-ti-omap4 - 2.6.35-903.25

---------------
linux-ti-omap4 (2.6.35-903.25) maverick-proposed; urgency=low

[ Ming Lei ]

* SAUCE: usb: ehci: make HC see up-to-date qh/qtd descriptor ASAP
- LP: #709245

[ Upstream Kernel Changes ]

  * cifs: fix possible memory corruption in CIFSFindNext, CVE-2011-3191
    - LP: #834135
    - CVE-2011-3191
  * befs: Validate length of long symbolic links, CVE-2011-2928
    - LP: #834124
    - CVE-2011-2928
  * gro: Only reset frag0 when skb can be pulled, CVE-2011-2723
    - LP: #844371
    - CVE-2011-2723
  * Validate size of EFI GUID partition entries, CVE-2011-1776
    - LP: #844365
    - CVE-2011-1776
  * inet_diag: fix inet_diag_bc_audit(), CVE-2011-2213
    - LP: #838421
    - CVE-2011-2213
  * si4713-i2c: avoid potential buffer overflow on si4713, CVE-2011-2700
    - LP: #844370
    - CVE-2011-2700
  * Bluetooth: Prevent buffer overflow in l2cap config request,
    CVE-2011-2497
    - LP: #838423
    - CVE-2011-2497
  * core: Fix memory leak/corruption on VLAN GRO_DROP, CVE-2011-1576
    - LP: #844361
    - CVE-2011-1576
  * crypto: Move md5_transform to lib/md5.c, CVE-2011-3188
    - LP: #834129
    - CVE-2011-3188
  * net: Compute protocol sequence numbers and fragment IDs using MD5,
    CVE-2011-3188
    - LP: #834129
    - CVE-2011-3188
-- Paolo Pisati <email address hidden> Tue, 20 Sep 2011 19:12:28 +0200

Changed in linux-ti-omap4 (Ubuntu Maverick):
status:	Fix Committed → Fix Released

Ricardo Salveti (rsalveti) on 2011-09-29

Changed in linaro-ubuntu:
status:	Fix Committed → Fix Released

Revision history for this message

Herton R. Krzesinski (herton) wrote on 2011-10-04:

#48

Natty 2.6.38-1209.16 release is now available in -proposed.

This bug is awaiting verification that this kernel for Natty in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-natty' to 'verification-done-natty'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-natty

Revision history for this message

Tobin Davis (gruemaster) wrote on 2011-10-09:

#49

Tested kernel 2.6.38-1209.16. Much improvement to write speeds.

Test used: dd bs=4M count=100 if=/dev/zero of=test.img

Before:
419430400 bytes (419 MB) copied, 50.2334 s, 8.3 MB/s

After:
419430400 bytes (419 MB) copied, 18.7676 s, 22.3 MB/s

tags:

added: verification-done-natty
removed: verification-needed-natty

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-10-12:

#50

This bug was fixed in the package linux-ti-omap4 - 2.6.38-1209.16

---------------
linux-ti-omap4 (2.6.38-1209.16) natty-proposed; urgency=low

* Release tracking bug
- LP: #862554

[ Ming Lei ]

* SAUCE: usb: ehci: make HC see up-to-date qh/qtd descriptor ASAP
- LP: #709245

[ Upstream Kernel Changes ]

  * cifs: fix possible memory corruption in CIFSFindNext, CVE-2011-3191
    - LP: #834135
    - CVE-2011-3191
  * befs: Validate length of long symbolic links, CVE-2011-2928
    - LP: #834124
    - CVE-2011-2928
  * gro: Only reset frag0 when skb can be pulled, CVE-2011-2723
    - LP: #844371
    - CVE-2011-2723
  * Validate size of EFI GUID partition entries, CVE-2011-1776
    - LP: #844365
    - CVE-2011-1776
  * inet_diag: fix inet_diag_bc_audit(), CVE-2011-2213
    - LP: #838421
    - CVE-2011-2213
  * si4713-i2c: avoid potential buffer overflow on si4713, CVE-2011-2700
    - LP: #844370
    - CVE-2011-2700
  * Bluetooth: Prevent buffer overflow in l2cap config request,
    CVE-2011-2497
    - LP: #838423
    - CVE-2011-2497
  * crypto: Move md5_transform to lib/md5.c, CVE-2011-3188
    - LP: #834129
    - CVE-2011-3188
  * net: Compute protocol sequence numbers and fragment IDs using MD5,
    CVE-2011-3188
    - LP: #834129
    - CVE-2011-3188
  * ext4: Fix max file size and logical block counting of extent format
    file, CVE-2011-2695
    - LP: #819574
    - CVE-2011-2695
-- Paolo Pisati <email address hidden> Fri, 30 Sep 2011 12:12:00 +0200

Changed in linux-ti-omap4 (Ubuntu Natty):
status:	Fix Committed → Fix Released

Steve Magoun (smagoun) on 2011-10-25

Changed in oem-priority:
status:	Fix Committed → Fix Released

Revision history for this message

Ming Lei (tom-leiming) wrote on 2012-01-09:

#51

Finally, Will Deacon figured out a workaround for the issue on arm code, see below upstream commit:
(CONFIG_PL310_ERRATA_769419 is needed for the fix)

commit 11ed0ba1754841316d4095478944300acf19acc3
Author: Will Deacon <email address hidden>
Date: Mon Nov 14 17:24:58 2011 +0100

ARM: 7161/1: errata: no automatic store buffer drain

    This patch implements a workaround for PL310 erratum 769419. On
    revisions of the PL310 prior to r3p2, the Store Buffer does not
    automatically drain. This can cause normal, non-cacheable writes to be
    retained when the memory system is idle, leading to suboptimal I/O
    performance for drivers using coherent DMA.

This patch adds an optional wmb() call to the cpu_idle loop. On systems
with an outer cache, this causes an explicit flush of the store buffer.

    Cc: <email address hidden>
    Acked-by: Catalin Marinas <email address hidden>
    Tested-by: Marc Zyngier <email address hidden>
    Signed-off-by: Will Deacon <email address hidden>
    Signed-off-by: Russell King <email address hidden>

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 44789ef..83aa746 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1352,6 +1352,18 @@ config ARM_ERRATA_764369
relevant cache maintenance functions and sets a specific bit
in the diagnostic control register of the SCU.

+config PL310_ERRATA_769419
+ bool "PL310 errata: no automatic Store Buffer drain"
+ depends on CACHE_L2X0
+ help
+ On revisions of the PL310 prior to r3p2, the Store Buffer does
+ not automatically drain. This can cause normal, non-cacheable
+ writes to be retained when the memory system is idle, leading
+ to suboptimal I/O performance for drivers using coherent DMA.
+ This option adds a write barrier to the cpu_idle loop so that,
+ on systems with an outer cache, the store buffer is drained
+ explicitly.
+
endmenu

source "arch/arm/common/Kconfig"
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 75316f0..3d0c6fb 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -192,6 +192,9 @@ void cpu_idle(void)
#endif

    local_irq_disable();
+#ifdef CONFIG_PL310_ERRATA_769419
+ wmb();
+#endif
    if (hlt_counter) {
     local_irq_enable();
     cpu_relax();

Finally, Will Deacon figured out a workaround for the issue on arm code, see below upstream commit:
(CONFIG_PL310_ERRATA_769419 is needed for the fix)

commit 11ed0ba1754841316d4095478944300acf19acc3
Author: Will Deacon <will.deacon@arm.com>
Date:   Mon Nov 14 17:24:58 2011 +0100

ARM: 7161/1: errata: no automatic store buffer drain
    
    This patch implements a workaround for PL310 erratum 769419. On
    revisions of the PL310 prior to r3p2, the Store Buffer does not
    automatically drain. This can cause normal, non-cacheable writes to be
    retained when the memory system is idle, leading to suboptimal I/O
    performance for drivers using coherent DMA.
    
    This patch adds an optional wmb() call to the cpu_idle loop. On systems
    with an outer cache, this causes an explicit flush of the store buffer.
    
    Cc: stable@vger.kernel.org
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Tested-by: Marc Zyngier <marc.zyngier@arm.com>
    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 44789ef..83aa746 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1352,6 +1352,18 @@ config ARM_ERRATA_764369
 	  relevant cache maintenance functions and sets a specific bit
 	  in the diagnostic control register of the SCU.
 
+config PL310_ERRATA_769419
+	bool "PL310 errata: no automatic Store Buffer drain"
+	depends on CACHE_L2X0
+	help
+	  On revisions of the PL310 prior to r3p2, the Store Buffer does
+	  not automatically drain. This can cause normal, non-cacheable
+	  writes to be retained when the memory system is idle, leading
+	  to suboptimal I/O performance for drivers using coherent DMA.
+	  This option adds a write barrier to the cpu_idle loop so that,
+	  on systems with an outer cache, the store buffer is drained
+	  explicitly.
+
 endmenu
 
 source "arch/arm/common/Kconfig"
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 75316f0..3d0c6fb 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -192,6 +192,9 @@ void cpu_idle(void)
 #endif
 
 			local_irq_disable();
+#ifdef CONFIG_PL310_ERRATA_769419
+			wmb();
+#endif
 			if (hlt_counter) {
 				local_irq_enable();
 				cpu_relax();

Revision history for this message

Emeric Vigier (emeric-vigier) wrote on 2012-07-05:

#52

Hi,

I am having ethernet performance issue on panda rev A3. TX gets stuck while RX seems fine. At least pinging is ok...
I started a thread on various mailing-list (http://comments.gmane.org/gmane.linux.network/234335).
I made a lot of trials but no luck so far...

My problem may be related to the one described here for usb storage. My kernel already embeds Ming Lei's patch (https://patchwork.kernel.org/patch/1113332/), and CONFIG_ARM_DMA_MEM_BUFFERABLE and CONFIG_ARCH_HAS_BARRIERS are enabled. But still ethernet performance is really poor.

Smsc9514 device driver relies on usbnet. I lack usb knowledge, thus I cannot state whether Ming Lei's patch in ehci-q.c "should" fix my issue if they were related. Otherwise, did anyone find out that CONFIG_PL310_ERRATA_769419 patch above effectively fixes the usb storage issue?

Revision history for this message

Ming Lei (tom-leiming) wrote on 2012-07-06:

#53

On Fri, Jul 6, 2012 at 6:57 AM, Emeric Vigier <email address hidden> wrote:
> Hi,
>
> I am having ethernet performance issue on panda rev A3. TX gets stuck while RX seems fine. At least pinging is ok...

We are discussing a similar problem now, see below link:

http://marc.info/?t=134145311800001&r=1&w=2

I found that enabling DEBUG_SLAB may degrade TX performance a lot( ~35Mbps).
But if the option is disabled, the TX performance should be OK(> 90Mbps).

I am not sure if it is same with your problem.

> I started a thread on various mailing-list (http://comments.gmane.org/gmane.linux.network/234335).
> I made a lot of trials but no luck so far...
>
> My problem may be related to the one described here for usb storage. My
> kernel already embeds Ming Lei's patch
> (https://patchwork.kernel.org/patch/1113332/), and
> CONFIG_ARM_DMA_MEM_BUFFERABLE and CONFIG_ARCH_HAS_BARRIERS are enabled.
> But still ethernet performance is really poor.
>
> Smsc9514 device driver relies on usbnet. I lack usb knowledge, thus I
> cannot state whether Ming Lei's patch in ehci-q.c "should" fix my issue

> if they were related. Otherwise, did anyone find out that
> CONFIG_PL310_ERRATA_769419 patch above effectively fixes the usb storage
> issue?

Yes, it is fine for the problem, so the ehci patch isn't needed to upstream.

Thanks,
--
Ming Lei

Revision history for this message

Emeric Vigier (emeric-vigier) wrote on 2012-07-06:

#54

I don't think it is the same problem. My ethernet TX gets stuck stuck to 10-20KBps (80-160Kbps). When I force some activity (USB mouse events) it can reach ~1MBps (8Mbps). Also the interface is supposed to (and says to) be 100Mbps...

I opened a thread on various mailing-lists:
http://comments.gmane.org/gmane.linux.network/234335

I saw your ehci patch on this thread:
https://patchwork.kernel.org/patch/1113332/

The thing is, my kernel already includes your patch. I also included Will Deacon's patch in process.c (cpu_idle), the one you mentioned here.

But I have some reasons to believe that none of the patches are suited in my case because:
1. Ethernet TX performance is still very bad with these patches and corresponding configs,
2. smsc95xx driver relies on usbnet. I think that usbnet uses interrupt for RX but not for TX. Then flushing the caches with wmb() just before entering idle (WFI in my case) won't be efficient for me as ethernet TX has no way to wake-up the cpu, it has to wait for an external wakeup source. USB Mouse event improves TX performance.
3. I cannot state whether your patch in ehci-q.c "should" work for me, because I don't know if (TX) URB submission in usbnet finally goes through this file's functions or another.

Do you know if my statements above make sense? Do you know where I should put the memory barrier and test again in my particular case?

Revision history for this message

Ming Lei (tom-leiming) wrote on 2012-07-07:

#55

On Fri, Jul 6, 2012 at 9:57 PM, Emeric Vigier <email address hidden> wrote:
> I don't think it is the same problem. My ethernet TX gets stuck stuck to
> 10-20KBps (80-160Kbps). When I force some activity (USB mouse events) it
> can reach ~1MBps (8Mbps). Also the interface is supposed to (and says
> to) be 100Mbps...
>
> I opened a thread on various mailing-lists:
> http://comments.gmane.org/gmane.linux.network/234335
>
> I saw your ehci patch on this thread:
> https://patchwork.kernel.org/patch/1113332/
>
> The thing is, my kernel already includes your patch. I also included
> Will Deacon's patch in process.c (cpu_idle), the one you mentioned here.
>
> But I have some reasons to believe that none of the patches are suited in my case because:
> 1. Ethernet TX performance is still very bad with these patches and corresponding configs,
> 2. smsc95xx driver relies on usbnet. I think that usbnet uses interrupt for RX but not for TX. Then flushing the caches with wmb() just before entering idle (WFI in my case) won't be efficient for me as ethernet TX has no way to wake-up the cpu, it has to wait for an external wakeup source. USB Mouse event improves TX performance.
> 3. I cannot state whether your patch in ehci-q.c "should" work for me, because I don't know if (TX) URB submission in usbnet finally goes through this file's functions or another.
>
> Do you know if my statements above make sense? Do you know where I
> should put the memory barrier and test again in my particular case?

OK, I don't suggest that you do some blind test further without knowing
the reason.

Could you provide the trace result according to the guide in
Documentation/usb/usbmon.txt or the link[1]?

[1], http://www.mjmwired.net/kernel/Documentation/usb/usbmon.txt

thanks,
--
Ming Lei

Revision history for this message

Emeric Vigier (emeric-vigier) wrote on 2012-07-09:

#56

1. Here I booted Android without RJ45 cable plugged. I issued:
# cat /sys/kernel/debug/usb/usbmon/1u > /sdcard/1.mon.out.11

2. Then I plugged in the cable, got an IP address through DHCP:
(panda) # netcfg eth0 dhcp
eth0 UP 192.168.50.107/24 0x00001043 42:82:7e:fd:d9:71

3. Then I started a TX using nc command:
(panda) # nc -l -p 1032 < /dev/zero &
(desktop-pc) $ nc 192.168.50.107 1032 > /dev/null
(desktop-pc) $ sudo nethogs
nc eth0 0.486 12.532 KB/sec
Speed is very poor.

4. I wait for netdev watchdog to bark and stopped the nc on desktop-pc side. Then killed the nc on panda side. And eventually ctrl'ed-C the usbmon dump.

On my dmesg, you can see the slow path occurring after TX struggled for a while:
http://paste.ubuntu.com/1083276/

The usbmon trace on bus 1:
http://paste.ubuntu.com/1083284/

My usb connected devices:
http://paste.ubuntu.com/1083292/

Thanks for your kind support,
Emeric

Revision history for this message

Emeric Vigier (emeric-vigier) wrote on 2012-07-09:

#57

I ran iperf as well and got poor results with both 256KB and 16Bytes (2KB).

(desktop-pc) $ iperf -s -w 256K
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 4] local 192.168.50.223 port 5001 connected with 192.168.50.107 port 39723

(panda) # iperf -c 192.168.50.223 -w 256K -t 10
------------------------------------------------------------
Client connecting to 192.168.50.223, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.50.107 port 39723 connected with 192.168.50.223 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-13.3 sec 200 KBytes 123 Kbits/sec

(desktop-pc) $ iperf -s -w 16
WARNING: TCP window size set to 16 bytes. A small window size
will give poor performance. See the Iperf documentation.
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 2.23 KByte (WARNING: requested 16.0 Byte)
------------------------------------------------------------
[ 4] local 192.168.50.223 port 5001 connected with 192.168.50.107 port 39724
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-18.3 sec 144 KBytes 64.5 Kbits/sec

(panda) # iperf -c 192.168.50.223 -w 16 -t 10
WARNING: TCP window size set to 16 bytes. A small window size
will give poor performance. See the Iperf documentation.
------------------------------------------------------------
Client connecting to 192.168.50.223, TCP port 5001
TCP window size: 2.00 KByte (WARNING: requested 16.0 Byte)
------------------------------------------------------------
[ 3] local 192.168.50.107 port 39724 connected with 192.168.50.223 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-18.3 sec 144 KBytes 64.5 Kbits/sec

I ran iperf as well and got poor results with both 256KB and 16Bytes (2KB).

(desktop-pc) $ iperf -s -w 256K
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  256 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  4] local 192.168.50.223 port 5001 connected with 192.168.50.107 port 39723

(panda) # iperf -c 192.168.50.223 -w 256K -t 10                         
------------------------------------------------------------
Client connecting to 192.168.50.223, TCP port 5001
TCP window size:   256 KByte (WARNING: requested   256 KByte)
------------------------------------------------------------
[  3] local 192.168.50.107 port 39723 connected with 192.168.50.223 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-13.3 sec    200 KBytes    123 Kbits/sec

(desktop-pc) $ iperf -s -w 16
WARNING: TCP window size set to 16 bytes. A small window size
will give poor performance. See the Iperf documentation.
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 2.23 KByte (WARNING: requested 16.0 Byte)
------------------------------------------------------------
[  4] local 192.168.50.223 port 5001 connected with 192.168.50.107 port 39724
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-18.3 sec   144 KBytes  64.5 Kbits/sec

(panda) # iperf -c 192.168.50.223 -w 16 -t 10                           
WARNING: TCP window size set to 16 bytes. A small window size
will give poor performance. See the Iperf documentation.
------------------------------------------------------------
Client connecting to 192.168.50.223, TCP port 5001
TCP window size: 2.00 KByte (WARNING: requested 16.0 Byte)
------------------------------------------------------------
[  3] local 192.168.50.107 port 39724 connected with 192.168.50.223 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-18.3 sec    144 KBytes  64.5 Kbits/sec

Revision history for this message

Ming Lei (tom-leiming) wrote on 2012-07-10:

#58

Emeric，

From your usbmon trace, I can find that your TX is extremely slow:

It may take several seconds to complete one packet with 1514bytes.

Could you run upstream kernel(3.4 or 3.5-rcX) first to see if the hardware of
your smsc95xx NIC is OK? And you'd better to test a usb storage transfer
on the omap4 ehci host.

I guess the hardware of smsc95xx may be bad.

Thanks,

Revision history for this message

Emeric Vigier (emeric-vigier) wrote on 2012-07-10:

#59

Ming Lei,

Delay between URB submission and call-back for a given URB is huge indeed...
My hardware (panda rev A3) is fine: linaro-12.05 (kernel-3.4) and linaro-11.09's ethernet works just fine.
My aosp kernel is "based" on kernel-3.0.8 but has wide differences with linaro-11.09 (3.1) kernel, starting with PM_RUNTIME support.

I ran the mass storage test on a Kingston DT mini-fun 4GB usb stick, featuring one ext3 partition. On my panda, I ran:
# mount -o remount,rw /
# mkdir /mnt/msd
# mount -t ext3 /dev/block/sda1 /mnt/msd

# sync && echo 3 > /proc/sys/vm/drop_caches
# busybox dd bs=1M count=16 if=/dev/zero of=/mnt/msd/16MB-file conv=fsync conv=notrunc
16777216 bytes (16.0MB) copied, 144.580627 seconds, 113.3KB/s

# sync && echo 3 > /proc/sys/vm/drop_caches
# busybox dd bs=16M count=1 if=/dev/zero of=/mnt/msd/16MB-file conv=fsync conv=notrunc
16777216 bytes (16.0MB) copied, 137.893890 seconds, 118.8KB/s

Then I assume issue is in ehci stack, not the net stack.
Running "top" shows the cpus are "mostly" idle during the transfer.

I ran usbmon before pluging the usb stick, mounting the FS, and starting the transfer. I interrupted it few seconds after transfer started.
# cat /sys/kernel/debug/usb/usbmon/1u > /sdcard/1.mon.out.20
http://paste.ubuntu.com/1084832/

dmesg:
http://paste.ubuntu.com/1084838/

T: Bus=01 Lev=02 Prnt=02 Port=02 Cnt=03 Dev#= 5 Spd=480 MxCh= 0
D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1
P: Vendor=0930 ProdID=6545 Rev= 1.00
S: Manufacturer=Kingston
S: Product=DT Mini Fun G2
S: SerialNumber=0060E049DF6FEB9080000883
C:* #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=200mA
I:* If#= 0 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage
E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0m

Revision history for this message

Ming Lei (tom-leiming) wrote on 2012-07-13:

#60

Emeric,

On Wed, Jul 11, 2012 at 1:20 AM, Emeric Vigier
<email address hidden> wrote:
> Ming Lei,
>
> Delay between URB submission and call-back for a given URB is huge indeed...
> My hardware (panda rev A3) is fine: linaro-12.05 (kernel-3.4) and linaro-11.09's ethernet works just fine.

Yes, so it may be caused by ehci driver or some ehci-omap low level
configuration.

But I have no idea about the root cause since there are very few
information about
the problem.

Maybe you need to use a USB analyzer to trace the bus traffic if you have it.

Or one easier way is to run 'git bisect' on upstream kernel or stable-tree to
look for the commit which fixes the problem.

Thanks,
--
Ming Lei

Revision history for this message

Emeric Vigier (emeric-vigier) wrote on 2012-07-13:

#61

I have no USB analyzer in my office. The 'git bisect' was one of the few items remaining in my todo list. Thanks for your time.

Linus Walleij (triad) on 2013-09-16

Changed in linux-linaro:
assignee:	Mian Yousaf Kaukab (mian-yousaf-kaukab) → nobody
status:	Fix Committed → Fix Released

Linaro Linux

ARM SMP scheduler performance bug

Bug Description

Related branches

CVE References

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Linaro Linux	Fix Released	Medium	Unassigned
Linaro Ubuntu	Fix Released	High	John Rigby	Linaro Ubuntu 11.09
OEM Priority Project	Fix Released	High	Unassigned
Maverick	Fix Released	Undecided	Unassigned
Natty	Fix Released	Undecided	Unassigned
Oneiric	Fix Released	Undecided	Unassigned
linux-ti-omap4 (Ubuntu)	Fix Released	High	Ming Lei
Maverick	Fix Released	High	Unassigned
Natty	Fix Released	High	Unassigned
Oneiric	Fix Released	High	Ming Lei