On Sat, 2009-08-22 at 14:48 +0000, Fabien Tassin wrote: > Here are some thoughts, hoping they will help make this bug move > forward.. > > I assume that the raw data is available somewhere. No one explained how the PPA > files are spread to the world, but as the user URL is unique, it seems reasonable > to assume that the data is in the form of httpd or web proxy logs somewhere. > Then it's a matter of post-precessing that. I also assume we can ignore all the direct > downloads from the LP pages (librarian), focusing on what's available through APT > should be enough. I believe that all requests are currently served via Apache from one server. The data should all be in Apache logs on that server. Conveniently enough, most of the backend work is already done: the same log parsing technique is used to count project downloads. > The next problem is to interpret those data. > The OP asked for some precise figures: > > 1/ Downloads stats for each package in the archive > > what do we want to know? > > Ideally, number of users for each version over time: if my assumption about the > logs is correct, they only show downloads, with no way to distinguish between > upgrades and new installs, so accounting just the number of downloads will not > give an accurate representation of the number of installations. Correct. All we can see is the HTTP request; it could be an installation, upgrade, reinstallation, or just somebody stuffing up the stats! > The information > has to come from the user's machine, identified by a unique ID (like with > popcon) - not the IP address - maybe transported in a (fake) http referrer. It > will still not catch removals though.. And it would be a privacy violation. Making each installation send a unique tracking number in apt requests is dodgy and would not be accepted by the community. > Number of downloads over time: this seems possible, but tricky to represent as > there's an unknown (and increasing) number of versions. > http://popcon.debian.org/stat/release.png is a good example as to why it is > tricky. For fast moving PPAs, such as dailies, or trunk/tip builds, it's even > worse. I'm not sure that it's possible without privacy issues to reliably track the number of users of a daily PPA, particularly since update-manager now only pops up once a week for that sort of update. The raw download numbers will certainly allow comparisons with other PPAs, though. > 2/ Distribution release used > > this should be easy. I also find this info very valuable, as there's no point > spending time maintaining backports for a distro used by no one. > It should probably not be based on the indexes stats, as it's possible to have > multiple versions of the same repository, esp. when a new ubuntu is released, > PPAs maintainers often take time to start producing debs for the new version > (debs are not copied like in the real archives). > It should come from the download stats, aggregated by package numbers. Fairly easily done, and rather important. But there are complications -- a particular binary package may live in multiple distroseries, and the apt User-Agent doesn't include the distroseries, just the apt version. > 3/ Number of users subscribed to the archive over time > > i don't think we'll ever get stats per user, it's always per machine (not to > mention proxies/caches). Right, that's not possible to do reliably. > 4/ Number of download requests over time > > hm, this is 1/, sort of.. Sort of, but graphs are nice. Graphs by (bpn,), (archive, bpn) and maybe (archive, bpn, version) would be interesting. > 5/ Amount of data transfered over time > > this one should be trivial. Indeed, trivial. > In the meantime, what about giving the PPA owners access to their raw logs, > properly anonymized, for ex by md5-ing IP addresses? The privacy risk will be > the same as with popcon (i.e. if there's just 1 user for a given package, it's > safe to assume it's the PPA maintainer, making him a target), but given a md5, > finding the IP to exploit is, well, you know.. > This could allow users to experiment, and maybe find good ideas, create mockups.. Actually, I yesterday did a bit of a refactor of the existing Librarian log parser, and implemented a basic one for PPA logs. Now the download counts can be stored for each (archive, bpr, day, country), much like the project release file counts. But somebody still needs to figure out UI. -- William Grant