Not-grouping dependent packages in Phased Updates changes propability distribution for entire group of dependent packages.

Bug #1214482 reported by Marek Wrzosek
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
update-manager (Ubuntu)
New
Undecided
Unassigned

Bug Description

I assume that you decided for uniform distribution in the random number generator. If group of dependent packages is installed only when all phased-update-percentage values (for each package) are smaller than the present percentage on server - it will change probability distribution for the whole group of packages.

Just run this:

figure(1)
hist(rand(1,100000))
for i = 2:10
 figure(i)
 hist(max(rand(i,100000)))
endfor

in octave and look at histograms for i>1 (i is the number of dependent packages).

Revision history for this message
Marek Wrzosek (marek-wrzosek) wrote :

For normalized histograms:

figure(1)
hist(rand(1,100000),10,1)
for i = 2:10
 figure(i)
 hist(max(rand(i,100000)),10,1)
endfor

If we're upgrading kernel, than packages are:
linux-generic
linux-headers-generic
linux-image-generic
linux-libc-dev
linux-headers-version
linux-headers-version-generic
linux-image-version-generic

so for this group i=7 and for percentage < 40% will be almost no bug reports ;)

Revision history for this message
Marek Wrzosek (marek-wrzosek) wrote :

The real probability distribution will be a sum of separate distributions and will depend of fractions of user who installed i of n packages. Below I pasted formula from LibO Math:

P(x)= sum from {i=1} to {n} a_i P_i(x) newline
0<= a_i <= 1 newline
a_i - "a fraction of users who installed" i "packages from a group of "n" dependent packages" newline
"to total amount of users who installed any number of packages from the same group" newline
P_i(x) - "Probability distribution for" i "dependent packages"

The real probability distribution will be hard to predict, so leaving this "as is" is against the KISS paradigm.

Revision history for this message
Michael Vogt (mvo) wrote :

Thanks for your bugreport and sorry for the late reply.

Your analysis is correct, the probability changed for dependent packages.

However because of the way we seed the random number generator its less of a issue in a lot of the practical cases. We use the (source package name, update version, client-machine-id) as the input for the RNG. This means that in the linux example all packages have the same probability because they come from the same source package.

Revision history for this message
Marek Wrzosek (marek-wrzosek) wrote :

Yeah, I forgot about this report. If (Pseudo)Random Number Generator is seeded that way, then in most cases this will be fine, only in bad-luck cases (when source packages are different but debs depends on each other) my analysis will apply. The good news is that, even if this situation occurs, the upgrade part of the process is fine, only the data analysis part must concern different probability distribution (when mapping number of bug reports to percentage of users).
PS. Sorry for my English, I wasn't using it for a while... in last months.

Revision history for this message
Marek Wrzosek (marek-wrzosek) wrote :

Extending this Linux packages example. What if some proprietary device driver called X? This driver is not from Linux source package, but every update of Linux will cause the need of rebuilding X's deb packages. Sources are independent but deb packages of X depends on Linux's deb packages. If you change RNG's probability distribution from uniform to normal, then probability distribution will stay normal, but parameters will change (e.g. average).
You can compare this two m-files:
uniform.m:

figure(1)
hist(rand(1,100000),100,1)
for i = 2:10
 figure(i)
 hist(max(rand(i,100000)),100,1)
endfor

normal.m:

figure(1)
hist(randn(1,100000),100,1)
for i = 2:10
 figure(i)
 hist(max(randn(i,100000)),100,1)
endfor

If you combine normal distribution with seeding RNG with (source package name, update version, client-machine-id), then this will keep number of random variables to low values and combined probability distribution will be closer to normal.
In this scenario Linux's packages probability distribution will be normal (because it is independent) and X's will be like for i = 2 from 'normal.m' file. You can calculate parameters of probability distributions for dependent groups of deb packages (coming from the same source packages) using lists of installed packages from users that report bugs.

Revision history for this message
Marek Wrzosek (marek-wrzosek) wrote :

After some consideration - maybe real life scenario will be better. Regard Linux, Nvidia, ATI and some DVB-T dongle without driver in current kernel. There will be 6 groups of users:
a) only Linux, (i=1)
b) Linux and Nvidia, (i=2)
c) Linux and ATI, (i=2)
d) Linux and DVB-T dongle, (i=2)
e) Linux, Nvidia and DVB-T dongle, (i=3)
f) Linux, ATI and DVB-T dongle. (i=3)
These are all kernel or kernel modules so they could share the same threshold value. You can calculate percentage of users for Linux-only using normal distribution and use it directly for group a), for other groups will be easy to calculate parameters of their distribution using the i value (number of random variables) and recalculate pristine threshold for all packages from b) to f) groups accordingly, then compare it with maximum value of random generated numbers for Linux, graphics and DVB-T. That way will be possible to determine percentage of users that installed newer versions of packages and this percentage can be equal for all groups (from a) to f)). The value of i can be estimated using list of installed packages for every user reporting bug.
For other software packages it should be similar as for Linux example.

I can be wrong, so correct me if you find any mistake.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.