Comment 18 for bug 1926139

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

From the 'Where problems could occur' section.
Considerations on regressions and approaches.

[Where problems could occur]

isc-dhcp is a core package, and any change comes with the risk that users would not be able to receive dhcp leases with dhclient, leaving their systems with no IP address and unreachable, and could potentially cripple images that depend on it, e.g. Microsoft Azure uses dhclient called from cloud-init, instead of systemd-networkd, so a regression could potentially affect all Ubuntu users on Azure.

Additionally, the code is called whenever sockets are constructed, and isc-dhcp-server could also be affected.

We have mitigated the risks of regression as best as possible by adding as much detail as possible to this launchpad bug, so it is clear how the race operates and how the patch fixes the issue.

Mauricio has additionally added a environment variable and a kernel command line parameter, that when present, disables the fix from operating. If a regression were to occur, users can add these parameters to their deployments to work around any issues.

Mauricio and Matthew have decided that the individual fix route is best in terms of lessening regression risk, as the alternate solution would be to disable threading on bind9-libs.

Disabling threading on bind9-libs, while complete as a solution, and removes the risk of a future regression caused by thread concurrency issues that are currently undetected, comes with the fact that it removes publicly exported symbols from bind9-libs, and adds others, and changes the entire library from multithreaded to single threaded. If any users happen to use bind9-libs outside of isc-dhcp, they would see their applications either fail to work due to missing symbols, or performance would change.

Disabling threading on bind9-libs is shelved, and can be looked at in the future if necessary.

Back to the individual fix solution, Chris Patterson, has been testing this solution at scale on Azure, and in 13k instances, has not had a failure. With the gdb reproducer, we are confident that adding the mutex will not prevent other parts of the software from functioning correctly.