Bug #184237 “autopacking should be optional” : Bugs : Bazaar

Revision history for this message

Robert Collins (lifeless) wrote on 2008-01-19: Re: [Bug 184237] autopacking should be optional

#1

On Sat, 2008-01-19 at 01:13 +0000, Paul Pelzl wrote:
> Public bug reported:
>
> Autopacking is a sensible default, but is not suitable for all use
> cases.
>
> I maintain a pack-0.92 repository which stores revision data for a
> number of unrelated projects. One of the projects, which is primarily
> composed of a large number of binary files, has accumulated a revision
> history of about 2GB. Consequently, autopacking can result in moving
> large amounts of data around at unpredictable times. The situation is
> made worse by the fact that the repository is located on a file server
> with limited bandwidth; when the autopack operation kicks in, it can
> take 5 minutes to do a commit.

This happens at expontially backing off occasions based on the number of
revisions in a pack; adding size to the pack is something I considered
but couldn't get a satisfactory model for the various tradeoffs.

An important thing to note is that the 2GB of history will /not/ be
moved by many autopacks.

Specifically, your 2GB of data will be moved once from a single pack to
a 10-pack, then once from a 10-pack to a 100-pack, and once from a
100-pack to a 1000-pack. (In other words any one piece of data is moved
log10(commits since it was introduced).

5 minutes seems like a long time to do the autopack of a smaller amount
of data; what protocol are you using? If you are using bzr+ssh the
autopack will occur on the server itself with no network bandwidth use.

~/.bzr.log will have details on which packs were combined by autopack.

> I would very much like to be able to disable autopacking, and run "bzr
> pack" as part of a regular maintenance script.

I have no objection to this. You could just run bzr pack as part of a
regular maintenance script: this will cause autopack to do nothing as
the repository is already more tightly packed than autopack aims for so
it will do nothing.

> As a side note, I noticed that the 2GB autopack operation can kick in
> even when touching one of the smaller projects. My intuition may be
> wrong here, but that doesn't really feel right to me... is there any
> benefit to be had from packing unrelated projects into the same glob?

Yes. You can't tell whats in a pack without reading the index, so if you
had (say) 16000 projects you would have to read up to 16000 indices to
find a given revision. Detecting 'unrelated' requires total-history
analysis, so its much more IO expensive to keep projects separate, and
there is little benefit in having them separate as you have efficient
access within any given single index: less indices is better than more,
regardless of project count.

-Rob

--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

On Sat, 2008-01-19 at 01:13 +0000, Paul Pelzl wrote:
> Public bug reported:
> 
> Autopacking is a sensible default, but is not suitable for all use
> cases.
> 
> I maintain a pack-0.92 repository which stores revision data for a
> number of unrelated projects.  One of the projects, which is primarily
> composed of a large number of binary files, has accumulated a revision
> history of about 2GB.  Consequently, autopacking can result in moving
> large amounts of data around at unpredictable times.  The situation is
> made worse by the fact that the repository is located on a file server
> with limited bandwidth; when the autopack operation kicks in, it can
> take 5 minutes to do a commit.

This happens at expontially backing off occasions based on the number of
revisions in a pack; adding size to the pack is something I considered
but couldn't get a satisfactory model for the various tradeoffs.

An important thing to note is that the 2GB of history will /not/ be
moved by many autopacks.

Specifically, your 2GB of data will be moved once from a single pack to
a 10-pack, then once from a 10-pack to a 100-pack, and once from a
100-pack to a 1000-pack. (In other words any one piece of data is moved
log10(commits since it was introduced).

5 minutes seems like a long time to do the autopack of a smaller amount
of data; what protocol are you using? If you are using bzr+ssh the
autopack will occur on the server itself with no network bandwidth use.

~/.bzr.log will have details on which packs were combined by autopack.

> I would very much like to be able to disable autopacking, and run "bzr
> pack" as part of a regular maintenance script.

I have no objection to this. You could just run bzr pack as part of a
regular maintenance script: this will cause autopack to do nothing as
the repository is already more tightly packed than autopack aims for so
it will do nothing.

> As a side note, I noticed that the 2GB autopack operation can kick in
> even when touching one of the smaller projects.  My intuition may be
> wrong here, but that doesn't really feel right to me... is there any
> benefit to be had from packing unrelated projects into the same glob?

Yes. You can't tell whats in a pack without reading the index, so if you
had (say) 16000 projects you would have to read up to 16000 indices to
find a given revision. Detecting 'unrelated' requires total-history
analysis, so its much more IO expensive to keep projects separate, and
there is little benefit in having them separate as you have efficient
access within any given single index: less indices is better than more,
regardless of project count.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message

Paul Pelzl (pelzlpj) wrote on 2008-01-21:

#2

Download full text (3.3 KiB)

On Sat, Jan 19, 2008 at 01:41:25AM -0000, Robert Collins wrote:
> An important thing to note is that the 2GB of history will /not/ be
> moved by many autopacks.
>
> Specifically, your 2GB of data will be moved once from a single pack to
> a 10-pack, then once from a 10-pack to a 100-pack, and once from a
> 100-pack to a 1000-pack. (In other words any one piece of data is moved
> log10(commits since it was introduced).

Well, the troublesome project has a large number of small binary files,
and each commit is typically a modify or add of some smaller set of
binary files. So the 2GB is spread out over 100-ish commits.

If moving a piece of data between two packs implies rewriting both
packs, then my interpretation of your statement above is that *every*
autopack operation will mean reading and writing ~2GB of data; assuming
all commits are non-empty, there will *always* be at least one piece of
data migrating from an N-pack to a 10N-pack. Is my understanding
correct?

In any case, I believe I've personally seen a large autopack operation
take place three times since upgrading the repository to pack-0.92, and
I'm not the only developer using bzr. I would estimate that there have
been about 50 commits since that upgrade.

> 5 minutes seems like a long time to do the autopack of a smaller amount
> of data; what protocol are you using? If you are using bzr+ssh the
> autopack will occur on the server itself with no network bandwidth use.

The repository is on a Windows SMB shared drive. I do understand that
this is suboptimal, and I would like to set up a bzr server when time
permits, but I anticipate that that configuring it properly will be a
bit of work since we use a true Windows network.

> ~/.bzr.log will have details on which packs were combined by autopack.

Thanks for pointing that out, I will have a look.

> > I would very much like to be able to disable autopacking, and run "bzr
> > pack" as part of a regular maintenance script.
>
> I have no objection to this. You could just run bzr pack as part of a
> regular maintenance script: this will cause autopack to do nothing as
> the repository is already more tightly packed than autopack aims for so
> it will do nothing.

That had occurred to me. However, given the rate of large autopack
operations to date, it seems like the maintenance interval would have to
be relatively short to avoid autopacking. I believe I read that an
autopack occurs every 10 commits; if this is the case, then for our use
case a daily "bzr pack" would not always be sufficient to prevent
hitting a poor performance case.

> > As a side note, I noticed that the 2GB autopack operation can kick in
> > even when touching one of the smaller projects. My intuition may be
> > wrong here, but that doesn't really feel right to me... is there any
> > benefit to be had from packing unrelated projects into the same glob?
>
> Yes. You can't tell whats in a pack without reading the index, so if you
> had (say) 16000 projects you would have to read up to 16000 indices to
> find a given revision. Detecting 'unrelated' requires total-history
> analysis, so its much more IO expensive to keep projects separate, and
> there is...

On Sat, Jan 19, 2008 at 01:41:25AM -0000, Robert Collins wrote:
> An important thing to note is that the 2GB of history will /not/ be
> moved by many autopacks.
> 
> Specifically, your 2GB of data will be moved once from a single pack to
> a 10-pack, then once from a 10-pack to a 100-pack, and once from a
> 100-pack to a 1000-pack. (In other words any one piece of data is moved
> log10(commits since it was introduced).

Well, the troublesome project has a large number of small binary files,
and each commit is typically a modify or add of some smaller set of
binary files.  So the 2GB is spread out over 100-ish commits.

If moving a piece of data between two packs implies rewriting both
packs, then my interpretation of your statement above is that *every*
autopack operation will mean reading and writing ~2GB of data; assuming
all commits are non-empty, there will *always* be at least one piece of
data migrating from an N-pack to a 10N-pack.  Is my understanding
correct?

In any case, I believe I've personally seen a large autopack operation
take place three times since upgrading the repository to pack-0.92, and
I'm not the only developer using bzr.  I would estimate that there have
been about 50 commits since that upgrade.

> 5 minutes seems like a long time to do the autopack of a smaller amount
> of data; what protocol are you using? If you are using bzr+ssh the
> autopack will occur on the server itself with no network bandwidth use.

The repository is on a Windows SMB shared drive.  I do understand that
this is suboptimal, and I would like to set up a bzr server when time
permits, but I anticipate that that configuring it properly will be a
bit of work since we use a true Windows network.

> ~/.bzr.log will have details on which packs were combined by autopack.

Thanks for pointing that out, I will have a look.

> > I would very much like to be able to disable autopacking, and run "bzr
> > pack" as part of a regular maintenance script.
> 
> I have no objection to this. You could just run bzr pack as part of a
> regular maintenance script: this will cause autopack to do nothing as
> the repository is already more tightly packed than autopack aims for so
> it will do nothing.

That had occurred to me.  However, given the rate of large autopack
operations to date, it seems like the maintenance interval would have to
be relatively short to avoid autopacking.  I believe I read that an
autopack occurs every 10 commits; if this is the case, then for our use
case a daily "bzr pack" would not always be sufficient to prevent
hitting a poor performance case.

> > As a side note, I noticed that the 2GB autopack operation can kick in
> > even when touching one of the smaller projects.  My intuition may be
> > wrong here, but that doesn't really feel right to me... is there any
> > benefit to be had from packing unrelated projects into the same glob?
> 
> Yes. You can't tell whats in a pack without reading the index, so if you
> had (say) 16000 projects you would have to read up to 16000 indices to
> find a given revision. Detecting 'unrelated' requires total-history
> analysis, so its much more IO expensive to keep projects separate, and
> there is little benefit in having them separate as you have efficient
> access within any given single index: less indices is better than more,
> regardless of project count.

I see.  Thanks for correcting my intuition.

Paul

Revision history for this message

Robert Collins (lifeless) wrote on 2008-01-21:

#3

Download full text (3.3 KiB)

On Mon, 2008-01-21 at 18:14 +0000, Paul Pelzl wrote:
> On Sat, Jan 19, 2008 at 01:41:25AM -0000, Robert Collins wrote:

> Well, the troublesome project has a large number of small binary files,
> and each commit is typically a modify or add of some smaller set of
> binary files. So the 2GB is spread out over 100-ish commits.

or 20MB/commit on average.

> If moving a piece of data between two packs implies rewriting both
> packs, then my interpretation of your statement above is that *every*
> autopack operation will mean reading and writing ~2GB of data; assuming
> all commits are non-empty, there will *always* be at least one piece of
> data migrating from an N-pack to a 10N-pack. Is my understanding
> correct?

No, because until there are 10 1-packs, no migration occurs. When you
have 10 1-packs, you get migration to 1-10pack. And for the next 9
commits no migration. And then on the 10th a migration of those 1-packs.
Then on the 100th commit you get migration of the:
9 10-packs
10 1-packs
to 1x100pack.

> In any case, I believe I've personally seen a large autopack operation
> take place three times since upgrading the repository to pack-0.92, and
> I'm not the only developer using bzr. I would estimate that there have
> been about 50 commits since that upgrade.

If you have 100 commits, I would expect:
9 operations of 10x1->1x10;
1 operation of 9x10+10x1 ->1x100

If you started at 50 commits, then you would see an initial threshold of
5x10 as the goal for the autopack code; so it should still have the
inital 50-pack that the upgrade created in place.

> > 5 minutes seems like a long time to do the autopack of a smaller amount
> > of data; what protocol are you using? If you are using bzr+ssh the
> > autopack will occur on the server itself with no network bandwidth use.
>
> The repository is on a Windows SMB shared drive. I do understand that
> this is suboptimal, and I would like to set up a bzr server when time
> permits, but I anticipate that that configuring it properly will be a
> bit of work since we use a true Windows network.

Is it on a LAN?

>
> > ~/.bzr.log will have details on which packs were combined by autopack.
>
> Thanks for pointing that out, I will have a look.

Please do.

> > > I would very much like to be able to disable autopacking, and run "bzr
> > > pack" as part of a regular maintenance script.
> >
> > I have no objection to this. You could just run bzr pack as part of a
> > regular maintenance script: this will cause autopack to do nothing as
> > the repository is already more tightly packed than autopack aims for so
> > it will do nothing.
>
> That had occurred to me. However, given the rate of large autopack
> operations to date, it seems like the maintenance interval would have to
> be relatively short to avoid autopacking. I believe I read that an
> autopack occurs every 10 commits; if this is the case, then for our use
> case a daily "bzr pack" would not always be sufficient to prevent
> hitting a poor performance case.

Well, the point of autopackings logic is to make poor performance cases
occur with exponential backoff. I maintain that we need to analyse why
you are seeing poor performance wh...

On Mon, 2008-01-21 at 18:14 +0000, Paul Pelzl wrote:
> On Sat, Jan 19, 2008 at 01:41:25AM -0000, Robert Collins wrote:

> Well, the troublesome project has a large number of small binary files,
> and each commit is typically a modify or add of some smaller set of
> binary files.  So the 2GB is spread out over 100-ish commits.

or 20MB/commit on average.

> If moving a piece of data between two packs implies rewriting both
> packs, then my interpretation of your statement above is that *every*
> autopack operation will mean reading and writing ~2GB of data; assuming
> all commits are non-empty, there will *always* be at least one piece of
> data migrating from an N-pack to a 10N-pack.  Is my understanding
> correct?

No, because until there are 10 1-packs, no migration occurs. When you
have 10 1-packs, you get migration to 1-10pack. And for the next 9
commits no migration. And then on the 10th a migration of those 1-packs.
Then on the 100th commit you get migration of the:
9 10-packs
10 1-packs
to 1x100pack.

> In any case, I believe I've personally seen a large autopack operation
> take place three times since upgrading the repository to pack-0.92, and
> I'm not the only developer using bzr.  I would estimate that there have
> been about 50 commits since that upgrade.

If you have 100 commits, I would expect:
9 operations of 10x1->1x10;
1 operation of 9x10+10x1 ->1x100

If you started at 50 commits, then you would see an initial threshold of
5x10 as the goal for the autopack code; so it should still have the
inital 50-pack that the upgrade created in place.

> > 5 minutes seems like a long time to do the autopack of a smaller amount
> > of data; what protocol are you using? If you are using bzr+ssh the
> > autopack will occur on the server itself with no network bandwidth use.
> 
> The repository is on a Windows SMB shared drive.  I do understand that
> this is suboptimal, and I would like to set up a bzr server when time
> permits, but I anticipate that that configuring it properly will be a
> bit of work since we use a true Windows network.

Is it on a LAN?

> 
> > ~/.bzr.log will have details on which packs were combined by autopack.
> 
> Thanks for pointing that out, I will have a look.

Please do.

> > > I would very much like to be able to disable autopacking, and run "bzr
> > > pack" as part of a regular maintenance script.
> > 
> > I have no objection to this. You could just run bzr pack as part of a
> > regular maintenance script: this will cause autopack to do nothing as
> > the repository is already more tightly packed than autopack aims for so
> > it will do nothing.
> 
> That had occurred to me.  However, given the rate of large autopack
> operations to date, it seems like the maintenance interval would have to
> be relatively short to avoid autopacking.  I believe I read that an
> autopack occurs every 10 commits; if this is the case, then for our use
> case a daily "bzr pack" would not always be sufficient to prevent
> hitting a poor performance case.

Well, the point of autopackings logic is to make poor performance cases
occur with exponential backoff. I maintain that we need to analyse why
you are seeing poor performance when autopack kicks in: autopacking 20MB
of commits on a LAN should be a few seconds work or so.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message

Paul Pelzl (pelzlpj) wrote on 2008-01-21:

#4

On Mon, Jan 21, 2008 at 06:36:33PM -0000, Robert Collins wrote:
> > If moving a piece of data between two packs implies rewriting both
> > packs, then my interpretation of your statement above is that *every*
> > autopack operation will mean reading and writing ~2GB of data; assuming
> > all commits are non-empty, there will *always* be at least one piece of
> > data migrating from an N-pack to a 10N-pack. Is my understanding
> > correct?
>
> No, because until there are 10 1-packs, no migration occurs. When you
> have 10 1-packs, you get migration to 1-10pack. And for the next 9
> commits no migration. And then on the 10th a migration of those 1-packs.
> Then on the 100th commit you get migration of the:
> 9 10-packs
> 10 1-packs
> to 1x100pack.

OK, I follow you now.

> > In any case, I believe I've personally seen a large autopack operation
> > take place three times since upgrading the repository to pack-0.92, and
> > I'm not the only developer using bzr. I would estimate that there have
> > been about 50 commits since that upgrade.
>
> If you have 100 commits, I would expect:
> 9 operations of 10x1->1x10;
> 1 operation of 9x10+10x1 ->1x100
>
> If you started at 50 commits, then you would see an initial threshold of
> 5x10 as the goal for the autopack code; so it should still have the
> inital 50-pack that the upgrade created in place.

Right now I see a smallish pack (~200MB) which is datestamped from the
time of the upgrade, a 2GB pack which is datestamped from Friday (when I
filed this bug report), and five very small packs which would correspond
to five commits I did today.

The logfile from Friday says "Auto-packing repository
<bzrlib.repofmt.pack_repo.RepositoryPackCollection object at
0x01512C70>, which has 17 pack files, containing 2000 revisions into 2
packs."

> > > 5 minutes seems like a long time to do the autopack of a smaller amount
> > > of data; what protocol are you using? If you are using bzr+ssh the
> > > autopack will occur on the server itself with no network bandwidth use.
> >
> > The repository is on a Windows SMB shared drive. I do understand that
> > this is suboptimal, and I would like to set up a bzr server when time
> > permits, but I anticipate that that configuring it properly will be a
> > bit of work since we use a true Windows network.
>
> Is it on a LAN?

Sorry, I should have been more clear. Yes, 100Mbit LAN.

> Well, the point of autopackings logic is to make poor performance cases
> occur with exponential backoff. I maintain that we need to analyse why
> you are seeing poor performance when autopack kicks in: autopacking 20MB
> of commits on a LAN should be a few seconds work or so.

I agree. Now that I better understand the expected behavior, I will try
to collect more relevant information.

Paul

On Mon, Jan 21, 2008 at 06:36:33PM -0000, Robert Collins wrote:
> > If moving a piece of data between two packs implies rewriting both
> > packs, then my interpretation of your statement above is that *every*
> > autopack operation will mean reading and writing ~2GB of data; assuming
> > all commits are non-empty, there will *always* be at least one piece of
> > data migrating from an N-pack to a 10N-pack.  Is my understanding
> > correct?
> 
> No, because until there are 10 1-packs, no migration occurs. When you
> have 10 1-packs, you get migration to 1-10pack. And for the next 9
> commits no migration. And then on the 10th a migration of those 1-packs.
> Then on the 100th commit you get migration of the:
> 9 10-packs
> 10 1-packs
> to 1x100pack.

OK, I follow you now.

> > In any case, I believe I've personally seen a large autopack operation
> > take place three times since upgrading the repository to pack-0.92, and
> > I'm not the only developer using bzr.  I would estimate that there have
> > been about 50 commits since that upgrade.
> 
> If you have 100 commits, I would expect:
> 9 operations of 10x1->1x10;
> 1 operation of 9x10+10x1 ->1x100
> 
> If you started at 50 commits, then you would see an initial threshold of
> 5x10 as the goal for the autopack code; so it should still have the
> inital 50-pack that the upgrade created in place.

Right now I see a smallish pack (~200MB) which is datestamped from the
time of the upgrade, a 2GB pack which is datestamped from Friday (when I
filed this bug report), and five very small packs which would correspond
to five commits I did today.

The logfile from Friday says "Auto-packing repository
<bzrlib.repofmt.pack_repo.RepositoryPackCollection object at
0x01512C70>, which has 17 pack files, containing 2000 revisions into 2
packs."

> > > 5 minutes seems like a long time to do the autopack of a smaller amount
> > > of data; what protocol are you using? If you are using bzr+ssh the
> > > autopack will occur on the server itself with no network bandwidth use.
> > 
> > The repository is on a Windows SMB shared drive.  I do understand that
> > this is suboptimal, and I would like to set up a bzr server when time
> > permits, but I anticipate that that configuring it properly will be a
> > bit of work since we use a true Windows network.
> 
> Is it on a LAN?

Sorry, I should have been more clear.  Yes, 100Mbit LAN.

> Well, the point of autopackings logic is to make poor performance cases
> occur with exponential backoff. I maintain that we need to analyse why
> you are seeing poor performance when autopack kicks in: autopacking 20MB
> of commits on a LAN should be a few seconds work or so.

I agree.  Now that I better understand the expected behavior, I will try
to collect more relevant information.

Paul

Revision history for this message

Robert Collins (lifeless) wrote on 2008-01-21:

#5

On Mon, 2008-01-21 at 19:45 +0000, Paul Pelzl wrote:
>
> Right now I see a smallish pack (~200MB) which is datestamped from the
> time of the upgrade, a 2GB pack which is datestamped from Friday (when
> I
> filed this bug report), and five very small packs which would
> correspond
> to five commits I did today.
>
> The logfile from Friday says "Auto-packing repository
> <bzrlib.repofmt.pack_repo.RepositoryPackCollection object at
> 0x01512C70>, which has 17 pack files, containing 2000 revisions into 2
> packs."

I think this is the key bit: that 2G pack will not be touched now until
you hit 10000 revisions in that repository.

-Rob :)
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message

Paul Pelzl (pelzlpj) wrote on 2008-01-21:

#6

On Mon, Jan 21, 2008 at 09:42:50PM -0000, Robert Collins wrote:
> > The logfile from Friday says "Auto-packing repository
> > <bzrlib.repofmt.pack_repo.RepositoryPackCollection object at
> > 0x01512C70>, which has 17 pack files, containing 2000 revisions into 2
> > packs."
>
> I think this is the key bit: that 2G pack will not be touched now until
> you hit 10000 revisions in that repository.

Yeah, I may just have been caught off-guard by hitting this corner-case
so soon after upgrade. Still, the 2GB repack has happened more than
once, and I'm having difficulty coming up with a satisfactory
explanation. I'm starting to wonder if there's another factor in play,
perhaps something triggered by our usage model.

I'll monitor performance pretty closely in the coming weeks, and will
report back if I have anything interesting to add.

Paul

Revision history for this message

Robert Collins (lifeless) wrote on 2008-01-22:

#7

On Mon, 2008-01-21 at 22:33 +0000, Paul Pelzl wrote:
> On Mon, Jan 21, 2008 at 09:42:50PM -0000, Robert Collins wrote:
> > > The logfile from Friday says "Auto-packing repository
> > > <bzrlib.repofmt.pack_repo.RepositoryPackCollection object at
> > > 0x01512C70>, which has 17 pack files, containing 2000 revisions into 2
> > > packs."
> >
> > I think this is the key bit: that 2G pack will not be touched now until
> > you hit 10000 revisions in that repository.
>
> Yeah, I may just have been caught off-guard by hitting this corner-case
> so soon after upgrade. Still, the 2GB repack has happened more than
> once, and I'm having difficulty coming up with a satisfactory
> explanation.

thats unusual, I have to guess at the presence of bugs (or perhaps some
corner case in the number of commits at the time of conversion).

> I'm starting to wonder if there's another factor in play,
> perhaps something triggered by our usage model.
>
> I'll monitor performance pretty closely in the coming weeks, and will
> report back if I have anything interesting to add.

Please do - bugs in this area could cause poor performance, and while I
think the design is still solid, we should track down what is happening
as accurately as possible.

You can use -Dpack (e.g. bzr -Dpack commit) to cause extra logging about
packing actions; which may help somewhat in figuring out the actions
being taken.

Cheers,
Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message

Paul Pelzl (pelzlpj) wrote on 2008-03-03:

#8

I have not noticed any unusual performance in recent weeks; autopacking appears to be working as designed. This bug could be closed as far as I am concerned.

Thanks for helping me to understand the expected behavior.

Paul

Martin Albisetti (beuno) on 2008-04-10

Changed in bzr:
status:	New → Invalid

Bazaar

autopacking should be optional

Bug Description

Other bug subscribers

Remote bug watches