small changes to dirstate are too slow

Bug #380202 reported by Ian Clatworthy on 2009-05-25
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Medium
John A Meinel

Bug Description

Looking at the profiling data for 'bzr st' after changing a single file on OOo, 47% of the time is spent just saving the dirstate. For 'bzr st file', saving takes 75% of the time. Note that this is the primary reason why 'hg status' is faster than 'bzr status' on OOo (for hg 1.2.1 vs bzr 1.15).

It appears that we deserialise and reserialise every dirstate line, whether it changed or not, as the "is modified" marker is dirstate-wide. Perhaps we should remember what lines changed and only deserialise/serialise those? Or something like that.

Related branches

Changed in bzr:
assignee: nobody → Ian Clatworthy (ian-clatworthy)

2009/5/25 Ian Clatworthy <email address hidden>:
> It appears that we deserialise and reserialise every dirstate line,
> whether it changed or not, as the "is modified" marker is dirstate-wide.
> Perhaps we should remember what lines changed and only
> deserialise/serialise those? Or something like that.

There are a few things we could do.

If there are no changes, we shouldn't even think about writing the
file. Do we get this right at the moment?

If there's very few changes, even just opening the file for writing
may be more work than is worthwhile. Updating it in place should be
possible, and it might be faster to seek to the write place and just
write that one line or section.

The other thing to consider is that the hash cache function is
generally only useful to tell if the contents of the file are
different to the basis tree. There are other cases like a file that's
changed from the parent but the same as some other tree the user's
diffing against, but I think they're uncommon. I think, further, it's
reasonable to assume that files will rarely be made the same as in the
parent except by bzr operations like commit, merge, and revert. So if
you accept this, we should probably never write the tree from bzr st,
but only when we've just updated the tree eg from building it, revert,
or commit.

There are three possible snags here: first, that reasoning may be
wrong. Second, because of the granularity of file timestamps, we may
not safely be able to record the timestamp when we first build the
tree. Third, I'm not sure that we always do update the hash cache
when we update the tree, and updating it on read operations will give
us some safety net against that. (But it's kind of hiding the real
problem, maybe we should rip off the bandaid.)

Doing this would have the advantage that we'd no longer be doing
physical writes from logical read operations which would be nice.

[Also, this bug may be a dupe.]

--
Martin <http://launchpad.net/~mbp/>

John A Meinel (jameinel) wrote :
Download full text (3.9 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> 2009/5/25 Ian Clatworthy <email address hidden>:
>> It appears that we deserialise and reserialise every dirstate line,
>> whether it changed or not, as the "is modified" marker is dirstate-wide.
>> Perhaps we should remember what lines changed and only
>> deserialise/serialise those? Or something like that.
>
> There are a few things we could do.
>
> If there are no changes, we shouldn't even think about writing the
> file. Do we get this right at the moment?

I'm pretty sure we do. We keep an "IN_MEMORY_MODIFIED" flag to let us
know if there are changes versus "IN_MEMORY_UNMODIFIED".

>
> If there's very few changes, even just opening the file for writing
> may be more work than is worthwhile. Updating it in place should be
> possible, and it might be faster to seek to the write place and just
> write that one line or section.

So the hash state code was meant to be possible to be updated-in-place.
I believe when we *don't* cache the sha, we save a fixed width
placeholder "xxxxxxx" comes to mind.

Note that we need a placeholder for both the sha value and the stat
value (in the current form). I'm not 100% sure we got this right, but I
know we tried.

There is also working (and tested) code for bisecting the disk file
without having to read everything. That code should be easy enough to
leverage to find the bytes that need to be updated, and write them in place.

Which was the whole point of grabbing an OS lock, so we could
mutate-in-place. That said, if we are trying to move away from OS locks,
we should be careful about whether we want to allow that or not.

>
> The other thing to consider is that the hash cache function is
> generally only useful to tell if the contents of the file are
> different to the basis tree. There are other cases like a file that's
> changed from the parent but the same as some other tree the user's
> diffing against, but I think they're uncommon. I think, further, it's
> reasonable to assume that files will rarely be made the same as in the
> parent except by bzr operations like commit, merge, and revert. So if
> you accept this, we should probably never write the tree from bzr st,
> but only when we've just updated the tree eg from building it, revert,
> or commit.
>
> There are three possible snags here: first, that reasoning may be
> wrong. Second, because of the granularity of file timestamps, we may
> not safely be able to record the timestamp when we first build the
> tree. Third, I'm not sure that we always do update the hash cache
> when we update the tree, and updating it on read operations will give
> us some safety net against that. (But it's kind of hiding the real
> problem, maybe we should rip off the bandaid.)
>
> Doing this would have the advantage that we'd no longer be doing
> physical writes from logical read operations which would be nice.
>
> [Also, this bug may be a dupe.]
>

I think you have fleshed out the details fairly well. I think the
timestamp granularity is still a small issue. Given that you could have
a large number of your files modified by 'bzr merge' or 'bzr revert' and
that could take pl...

Read more...

Ian Clatworthy (ian-clatworthy) wrote :

See the linked branch for an improvement to status.

Changed in bzr:
status: Triaged → Fix Committed
Robert Collins (lifeless) wrote :

We could conceptually change the dirstate format to permit journalled changes too.

tags: added: dirstate2

I've found that running "bzr st" the first time on a branch takes forever (especially on windows machines.)
Whatever its doing could probably be done as part of the branch process.

Also some progress would improve the situation. The waiting period can sometimes take a few minutes, without progress it doesn't look good.

On Wed, 2009-09-30 at 08:54 +0000, Craig Hewetson wrote:
> I've found that running "bzr st" the first time on a branch takes forever (especially on windows machines.)
> Whatever its doing could probably be done as part of the branch process.

Please file a new bug to discuss this; its almost certainly /not/ the
same problem.

In general, we prefer new bug reports we can 'dupe' rather than
unrelated discussions on existing bugs - if in doubt, please file a new
bug.

Thanks,
Rob

Martin Pool (mbp) wrote :

The branch proposed for this was rejected.

Changed in bzr:
status: Fix Committed → Confirmed
status: Confirmed → In Progress
Ian Clatworthy (ian-clatworthy) wrote :

I'm not working on this and are unlikely to get back to this anytime soon.

Changed in bzr:
assignee: Ian Clatworthy (ian-clatworthy) → nobody
John A Meinel (jameinel) wrote :

This was an old branch from Ian which we all agreed on conceptually but needed a fair amount of polish to finish off. We intended that it would just go back to WorkInProgress, but obviously that stalled. I'm going to try to take over from here, and make sure we land it. It goes along well with the other little bits that I've been poking at.

Changed in bzr:
assignee: nobody → John A Meinel (jameinel)
John A Meinel (jameinel) on 2011-05-10
Changed in bzr:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers