"package lists" should be downloaded as diffs (Contents-i386.gz)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
apt (Ubuntu) |
Fix Released
|
Wishlist
|
Unassigned | ||
reprepro (Ubuntu) |
Fix Released
|
Wishlist
|
Unassigned |
Bug Description
Binary package hint: apt
Hello! This is a feature request for Ubuntu Gutsy, though it can apply to every dist that uses apt-get. I marked it as affecting apt-get, but I think it would affect lots of other things if implemented.
AFAIK, every "apt-get update" checks each repository in turn and downloads the package list from each one of those. For instance, http://
This is inefficient for most repositories. The uncompressed Packages file's contents is just a lot of text (6.3 MB last time I looked), and only a small part of that changes for the vast majority of "apt-get update" operations. It would be a very good candidate for diff-based compression. Even if in its compressed form it takes only around a MB, which is relatively small, it's still very inefficient to download it each time.
I suggest an optional extension to the apt protocol should be added to allow incremental updates: the server should keep in the same folder with the Packages.* files a set of diffs calculated from the last few versions of the Packages file. I suggest they be named "Packages-[hash]" (with the gz/bz2 compressions), where "[hash]" is the hash of the contents of the base file against which the diff is calculated. (This would eliminate any version-number complexity.) For instance, if the last three Packages files contained "1", "2" and "3", and the current file contains something else, the contents of the directory would be:
Packages.gz
Packages-
Packages-
Packages-
(I used MD5 here, we could use SHA256 or a better hash.) If the last package list my apt downloaded was the one containing "2", apt would hash it (it can do that when it's downloaded, anyway, for checking), see that its contains hash to "c4ca4238a0b923
Because the diffs would be very small (and compressed), the server could remember lots of them (for a couple of days or so) without a lot of storage overhead.
The problem is updating the diffs each time the Packages file is changed. One way would be to simply remember the old versions of the Packages file, and re-calculate the diffs completely. That would be wasteful, however, both in space and in the amount of space it would take.
It could also keep only diffs between successive versions, but that's more complicated (apt would have to download the series, and apply them in turns).
There are also diff formats/algorithms that can be combined. (I think that's what it's called.) This means that for versions 1, 2 and three of a file, if you know diff1->2 and diff2->3 you can calculate diff1->3 directly, without looking at the complete files. If we used one of those algorithms, for each update the server would need to (1) calculate the diff between the last two versions and (2) combine that diff with the other diffs it knows. (And probably delete the oldest.)
A nice thing is that it doesn't have to do this at once. It can move everything to an "old" directory, put the new Package file in its place, and then start adding the updated diffs. If anyone tries to do an update in the mean-time, apt will look for the diff, won't find it, and it will fall back to the downloading the complete package file. (Alternatively, the server could put the diff calculation on hold, compute on-the-fly the diff needed---each pair of diffs is easy to combine, it's just that there's lots of them---send it, and resume with the others.) This also means that the system will be perfectly backwards-
Since many users have the Update Manager check for updates daily (it's a built-in option), this would reduce part of the bandwidth used significantly. It would also make manual updates speedier.
Changed in apt: | |
importance: | Undecided → Wishlist |
Changed in reprepro (Ubuntu): | |
status: | Triaged → Fix Released |
+1: look at gentoo: they use rsync and plain text files