"package lists" should be downloaded as diffs (Contents-i386.gz)

Bug #144083 reported by Bogdan Butnaru
6
Affects Status Importance Assigned to Milestone
apt (Ubuntu)
Fix Released
Wishlist
Unassigned
reprepro (Ubuntu)
Fix Released
Wishlist
Unassigned

Bug Description

Binary package hint: apt

Hello! This is a feature request for Ubuntu Gutsy, though it can apply to every dist that uses apt-get. I marked it as affecting apt-get, but I think it would affect lots of other things if implemented.

AFAIK, every "apt-get update" checks each repository in turn and downloads the package list from each one of those. For instance, http://archive.ubuntu.com/ubuntu/dists/gutsy/main/binary-i386/Packages.bz2 is one that is used on my system. If the last file downloaded is the unchanged on the server, this step is skipped, but otherwise the file is re-downloaded completely.

This is inefficient for most repositories. The uncompressed Packages file's contents is just a lot of text (6.3 MB last time I looked), and only a small part of that changes for the vast majority of "apt-get update" operations. It would be a very good candidate for diff-based compression. Even if in its compressed form it takes only around a MB, which is relatively small, it's still very inefficient to download it each time.

I suggest an optional extension to the apt protocol should be added to allow incremental updates: the server should keep in the same folder with the Packages.* files a set of diffs calculated from the last few versions of the Packages file. I suggest they be named "Packages-[hash]" (with the gz/bz2 compressions), where "[hash]" is the hash of the contents of the base file against which the diff is calculated. (This would eliminate any version-number complexity.) For instance, if the last three Packages files contained "1", "2" and "3", and the current file contains something else, the contents of the directory would be:

Packages.gz
Packages-c4ca4238a0b923820dcc509a6f75849b.diff.gz
Packages-c81e728d9d4c2f636f067f89cc14862c.diff.gz
Packages-eccbc87e4b5ce2fe28308fd9f2a7baf3.diff.gz

(I used MD5 here, we could use SHA256 or a better hash.) If the last package list my apt downloaded was the one containing "2", apt would hash it (it can do that when it's downloaded, anyway, for checking), see that its contains hash to "c4ca4238a0b923820dcc509a6f75849b", and try to download the corresponding file, if it finds it. The file would contain the diff between the "2" package list and the latest version on the server.

Because the diffs would be very small (and compressed), the server could remember lots of them (for a couple of days or so) without a lot of storage overhead.

The problem is updating the diffs each time the Packages file is changed. One way would be to simply remember the old versions of the Packages file, and re-calculate the diffs completely. That would be wasteful, however, both in space and in the amount of space it would take.
It could also keep only diffs between successive versions, but that's more complicated (apt would have to download the series, and apply them in turns).

There are also diff formats/algorithms that can be combined. (I think that's what it's called.) This means that for versions 1, 2 and three of a file, if you know diff1->2 and diff2->3 you can calculate diff1->3 directly, without looking at the complete files. If we used one of those algorithms, for each update the server would need to (1) calculate the diff between the last two versions and (2) combine that diff with the other diffs it knows. (And probably delete the oldest.)

A nice thing is that it doesn't have to do this at once. It can move everything to an "old" directory, put the new Package file in its place, and then start adding the updated diffs. If anyone tries to do an update in the mean-time, apt will look for the diff, won't find it, and it will fall back to the downloading the complete package file. (Alternatively, the server could put the diff calculation on hold, compute on-the-fly the diff needed---each pair of diffs is easy to combine, it's just that there's lots of them---send it, and resume with the others.) This also means that the system will be perfectly backwards-compatible, because old apt versions (or other programs) can just pick up the complete file.

Since many users have the Update Manager check for updates daily (it's a built-in option), this would reduce part of the bandwidth used significantly. It would also make manual updates speedier.

Revision history for this message
TDB (michael-baranov) wrote :

+1: look at gentoo: they use rsync and plain text files

Revision history for this message
Bogdan Butnaru (bogdanb) wrote :

That would work, too. I didn't look at their system in detail, but if I understand how it works I don't think rsync is the best idea.

The problem is that _each_time_ a file is re-synchronized, both the server (with the new file) and the client (the one updating) must read the file completely, hash it several times (it's a simple hash, but still), exchange the hashes and only then they transmit the diff. In a sense, the server must determine the diff for _every_ client connecting, without even seeing the client's file. Although, I think it _is_ technically possible to pre-compute the hashes and store them on the server (ie, you only need them once for each version of the file), but I'm not sure if there's any rsynch implementation that does this.

The only advantages I see are that (1) this would work with any old version of the Package file, and (2) it gets rid of the large list of diffs.

With straight rsync, there are disadvantages: (a) we'd need a completely new parallel protocol (rsync doesn't work through HTTP), (b) the synchronization is processing-intensive both on the server and the client, for _every_ connection (the client isn't the problem, but we'd be trading bandwidth for processor time).

If we do find something rsync-like that can pre-compute the server-side hashes (so that the client download them, and then the file differences selectively using HTTP ranges), that would be nice too.

Michael Vogt (mvo)
Changed in apt:
importance: Undecided → Wishlist
Revision history for this message
Scott Ritchie (scottritchie) wrote :

I'm going to bring this up at UDS. I don't have any specific statistics, but I imagine that daily updates place a pretty substantial burden on the mirrors.

Another thing worth considering is moving from .gz to .lzma (or better) compression of the Packages file itself. A simple test shows that Intrepid can save another 10% over .bz2 by using Packages.lzma

Changed in reprepro:
importance: Undecided → Wishlist
Changed in apt:
status: New → Confirmed
Changed in reprepro:
status: New → Triaged
Changed in reprepro (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Julian Andres Klode (juliank) wrote :

pdiffs have been in APT for quite some time.

Changed in apt (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.