smart server speed regression on older (e.g. Ubuntu 6.06) kernels

Bug #234287 reported by James Troup
4
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
High
Andrew Bennetts

Bug Description

On a machine with an Ubuntu 6.06 or 6.10 machine (i.e. 2.6.15 or
2.6.17 kernel):

| $ ./bzr --no-plugins selftest 'clone_branch_parent.*Remote' -v
| No handlers could be found for logger "bzr"
| testing: /home/james/haha/home/+trunk/bzr
| /home/james/haha/home/+trunk/bzrlib (1.6dev python2.4.4rc1)
|
| running 1 tests...
| ...implementations.test_branch.TestBranch.test_clone_branch_parent(RemoteBranchFormat) OK 18005ms
|
| ----------------------------------------------------------------------
| Ran 1 test in 18.006s
|
| OK
| [ 2669] 2008-05-23 11:02:45.387 INFO: tests passed
| tests passed
| [ 2669] 2008-05-23 11:02:45.387 WARNING: bzrlib.tests.branch_implementations.test_branch.TestBranch.test_clone_branch_parent(RemoteBranchFormat) is leaking threads among 1 leaking tests
| bzrlib.tests.branch_implementations.test_branch.TestBranch.test_clone_branch_parent(RemoteBranchFormat) is leaking threads among 1 leaking tests
| $

On the same machine, running Ubuntu 6.06 but with an Ubuntu 8.04
(2.6.24) kernel:

| $ ./bzr --no-plugins selftest 'clone_branch_parent.*Remote' -v
| No handlers could be found for logger "bzr"
| testing: /home/james/haha/home/+trunk/bzr
| /home/james/haha/home/+trunk/bzrlib (1.6dev python2.4.3)
|
| running 1 tests...
| ...implementations.test_branch.TestBranch.test_clone_branch_parent(RemoteBranchFormat) OK 439ms
|
| ----------------------------------------------------------------------
| Ran 1 test in 0.440s
|
| OK
| [22234] 2008-05-23 11:03:31.060 INFO: tests passed
| tests passed
| [22234] 2008-05-23 11:03:31.061 WARNING: bzrlib.tests.branch_implementations.test_branch.TestBranch.test_clone_branch_parent(RemoteBranchFormat) is leaking threads among 1 leaking tests
| bzrlib.tests.branch_implementations.test_branch.TestBranch.test_clone_branch_parent(RemoteBranchFormat) is leaking threads among 1 leaking tests
| $

Also, if I drop back 100 revisions, the time for the runs becomes, on
a 2.6.17 kernel:

| $ bzr revert -r -100
 [...]
| $ ./bzr --no-plugins selftest 'clone_branch_parent.*Remote' -v
| No handlers could be found for logger "bzr"
| testing: /home/james/revert/trunk/bzr
| /home/james/revert/trunk/bzrlib (1.4dev python2.4.4rc1)
|
| running 1 tests...
| ...implementations.test_branch.TestBranch.test_clone_branch_parent(RemoteBranchFormat) OK 5213ms
|
| ----------------------------------------------------------------------
| Ran 1 test in 5.214s
|
| OK
| [ 3195] 2008-05-23 11:48:49.714 INFO: tests passed
| tests passed
| $

And on a 2.6.24 kernel:

| $ ./bzr --no-plugins selftest 'clone_branch_parent.*Remote' -v
| No handlers could be found for logger "bzr"
| testing: /home/james/haha-revert/home/+trunk/bzr
| /home/james/haha-revert/home/+trunk/bzrlib (1.4dev python2.4.3)
|
| running 1 tests...
| ...implementations.test_branch.TestBranch.test_clone_branch_parent(RemoteBranchFormat) OK 314ms
|
| ----------------------------------------------------------------------
| Ran 1 test in 0.316s
|
| OK
| [22972] 2008-05-23 11:50:34.750 INFO: tests passed
| tests passed
| $

So this appears to be a regression in bzr.

Revision history for this message
John A Meinel (jameinel) wrote :

Any chance you could run the comparison with '--lsprof' to see a breakdown of where the time is spent?

I wonder if it is actually our changes which track leaking threads.

Certainly there is somewhat of a kernel issue here, since even the old one is 5000ms => 314ms. Even if there is a regression of 5000 => 18000ms from bzr.

Revision history for this message
Andrew Bennetts (spiv) wrote :

I'm guessing the increase from 5000ms to 18000ms is because the protocol v3 code currently does many small writes when one larger write would do. I'm guessing that when https://code.launchpad.net/~spiv/bzr/hpss-buffering is merged it'll drop back to 5000ms on 2.6.17.

(I would have landed that branch already, but https://bugs.launchpad.net/bzr/+bug/225020 seems to be preventing it.)

Revision history for this message
James Troup (elmo) wrote : Re: [Bug 234287] Re: smart server speed regression on older (e.g. Ubuntu 6.06) kernels

John A Meinel <email address hidden> writes:

> Any chance you could run the comparison with '--lsprof' to see a
> breakdown of where the time is spent?

This is reproducible on tungsten, so would you mind checking this out
yourself? Should you need it, the specific branch I was checking is
available in ~james/haha/.

> Certainly there is somewhat of a kernel issue here, since even the
> old one is 5000ms => 314ms. Even if there is a regression of 5000 =>
> 18000ms from bzr.

Err, all things being equal, I think it's far more likely that bzr is
doing something wrong rather than the kernel. (I mean, our entire DC
ran 6.06/6.10 kernels for years and bzr is the only program I've seen
with this kind of networking problem.)

--
James

Revision history for this message
John A Meinel (jameinel) wrote :

Andrew's patch landed as bzr.dev 3452:
  Speed up HPSS v3 by buffering writes to the medium. (Andrew Bennetts)

This changes the time for the test from 18,500 ms down to 396 ms.

Changed in bzr:
assignee: nobody → spiv
importance: Undecided → High
milestone: none → 1.6
status: New → Fix Released
Revision history for this message
John A Meinel (jameinel) wrote :

Just as another datapoint, Andrew's V3 changes didn't cause all of the problem. If you revert to just before v3 landed, the test takes 9,000ms. Just after his patch it jumps 18,000 ms.
So the newest V3 patch is all-around better than the old V2 code.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.