Comment 8 for bug 1926139

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Matthew,

Thanks for the excellent analysis and considerate fix proposal, as always!

I looked at this for the last couple of days, for potential sponsorship.

I have attentively gone through the SRU template and Other Info section,
and considered the proposal to switch bind9-libs into --disable-threads,
with the goal of not only address this issue, but also prevent others:

> So, we have two options for a fix for Focal and Jammy:
>
> 1) We disable threading for dhclient.
> 2) We add in a mutex to resolve this particular concurrency issue.
> [...]
> I think if we fix the problem, another issue will crop up in six months
> time, and it will be another concurrency issue.

...

I'm aware you realize such change is concerning :) thus explained it well.

Changing this is Focal (around for almost 3 years) brings regression risk
to an amount I have the _impression_ the SRU team would not be okay with.

And even though I agree with your analysis, proposal and risk assessment,
I'm a bit concerned too, specially as this touches DHCP / IP addressing.

(I'm also very aware this is ultimately their call, not mine at all. :)

...

However, considering how much work and time have likely gone into this
(and internal status) I can't just say 'no' without trying to help out.

I'd like to bring a different opinion.

The reason it's concerning is the very same reason 2) is reasonable:

This concurrency issue (and potential for other concurrency issues)
has been around with Focal since 2020/04 (~3 years), and until now,
its impact does not seem to statistically significant:

> This happens about once every 100 reboots on bare metal, or [...]
> affecting between ~0.3% to 2% of deployments on Microsoft Azure.

So, if there's a way to fix this particular concurrency issue with
less regression risk, that might be worth it, as it would build on
top of dhclient's life on Focal, instead of starting it over again.

...

So, while reviewing the source code for your analysis, I had ideas.

First, a synthetic reproducer with GDB that works every time.

Second, a patch that addressed the issue with the test above.
(It's not final form, I'd like to add a way to turn it off.)

...

Could you please review and verify both, and share your
thoughts on possibly going with that proposal instead?

Of course, if you disagree with the argument or approach,
or if turns out not to work on your end/tests, that's OK!

We would defer this to the Foundations team and SRU team.

- Test steps in the next comment.
- Test packages in ppa:mfo/lp1926139 [1].
- Debdiff attached for reference (code has details).

(Right now only Focal patches/packages are available.
I can go look at Jammy depending on your feedback.)

Hope this helps, after all.
Thanks again!

[1] https://launchpad.net/~mfo/+archive/ubuntu/lp1926139