Comment 4 for bug 1983605

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Andreas and I have been spending time on and off investigating this issue. Until today, we were able to determine that the failure happens even when we build exim4, gnutls28 and nettle with -O2 instead of -O3 (which is the default optimization level on ppc64el). Disabling -O3 was the very first thing that came to my mind when I saw the problem (I've been bitten by this problem a few times before...), so it was a bit surprising to see that, this time, it isn't the reason behind the bug.

We then looked at the autopkgtest logs for exim4 on ppc64el, and were able to determine that, most likely, the issue started happening around the time the package was updated from version 4.94 to 4.95 (4.94.2-7ubuntu2 to 4.95-2ubuntu2, according to https://autopkgtest.ubuntu.com/packages/exim4/jammy/ppc64el). This morning, Andreas did a rebuild of exim4 4.94.2-7ubuntu2 and was able to confirm that the problem did *not* manifest with this version. That's good because now we are able to narrow our investigation.

Today I had more time to work on this issue, and was able to make good progress. Here's what I tried and what I found:

- Knowing that the issue is most likely in exim4, I decided to build the software directly from upstream and verify if I could reproduce the problem there. This served the purpose of making sure that the bug is present in upstream as well, and not in our downstream patches.

- I was able to successfully replicate the failure with upstream's 4.95 tag, and the non-failure with upstream's 4.94.

- This allowed me to do a bisect of the code, which resulted in:

f50a063dc0b96ac95b3a7bc0aebad3b3f2534c02 is the first bad commit
commit f50a063dc0b96ac95b3a7bc0aebad3b3f2534c02
Author: Jeremy Harris <email address hidden>
Date: Tue Jun 22 23:04:59 2021 +0100

    TLS: as server, reject connections with ALPN indicating non-smtp use

- On the one hand, that's great. I double checked and indeed, the problem started happening after the commit above was pushed.

- I wanted to understand better what was going on, so I built exim4 (still from upstream) using "-O0 -g3". I was ready to start debugging it, when...

- ... I wasn't able to reproduce the problem anymore. My attention was once again turned to optimization.

- I then decided to take a closer look at the package again. I confirmed that I could still reproduce the issue using -O2 and -O1. When I tried using -O0, I initially wasn't able to build the package because -O0 interacts badly with "hardening=+all" (which is expected behaviour). And then it dawned on me that the problem could actually be some of the hardening features we use, because they're much more likely to break something than -O0 vs. -O2.

- I had a feeling that the problem could be related to PIE, so I did a build only with PIE disabled, and voilĂ : I could not reproduce the bug anymore.

So that's it. Where do we go from here?

- On exim4, it's enough to disable PIE on ppc64el only. That's an easy workaround and, IMHO, should not be the cause of much concern from the SRU team.

- Generally speaking, it might be interesting to continue investigating this issue a bit more. For example, I'm still unsure as to why we're not seeing (and, as best as can I tell, have never seen) this problem in Debian's exim4. Is it related to some Ubuntu-specific GCC modification? That remains to be seen.