bind9 segfaults on certain stressful scenarios

Bug #1997375 reported by Maxxer
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
bind9 (Ubuntu)
Fix Released
Undecided
Sergio Durigan Junior
Focal
Fix Released
High
Sergio Durigan Junior

Bug Description

[ Impact ]

On certain scenarios where bind9's resolver is put under stress, a segmentation fault can happen on isc__nm_tcpdns_send/isc__nm_tcp_send. This happens because isc__nm_tcpdns_send is not asynchronous and accessed socket internal fields in an unsafe manner, leading to race conditions and the subsequent crash.

[ Test Plan ]

Unfortunately, after several attempts I wasn't able to reproduce the issue in a reliable manner. For that reason, I have been relying on the community to perform tests and determine the right fix for the issue. Some members of the community have deployments where the segmentation fault occurs after some time (typically less than 1 month). Therefore, the test plan for this bug will involve asking these kind community members to help us by installing the bind9 package from focal-proposed and leave it running for some time. The expectation here is that the segmentation fault will not manifest with the new package.

[ Where problems could occur ]

The backported patch is not entirely trivial, although it is well contained within the tcpdns code. The intention is to split tcpdns into a new, asynchronous thread which will ultimately make accessing internal socket fields safe. As is common with general code overhauls, this one also introduces a chance for some bad interaction between tcpdns and its users.

[ Other Info ]

The positive side here is that this code has been incorporated into bind9 upstream 2 years ago, and there have been no regressions reported against it to the best of my knowledge. On top of that, at least 3 community members have extensively tested a PPA with this backport and all of them reported back saying that the issue has been fixed.

It's also important to note that this backport addresses solely the bug experienced by the community users. During the review of the MP to fix the bug, Andreas found another patch that looked like it should be backported as well, but we were not sure. I raised this with upstream here:

https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/3721#note_345081

and, as can be seen, their reply was not very encouraging. Having in mind that (a) the backport in question does solve the problems experienced by the community, (b) we have been actively working to get an MRE for bind9 on Jammy and Focal, (c) when the MRE is in place we will be able to update bind9 and get the latest code that fixes this and many other issues, and (d) it'd be very risky and somewhat unfeasible to backport all of the related fixes pointed by upstream, I decided to move forward with this SRU as is.

[ Original Description ]

The server acts as Samba AD DC

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: bind9 1:9.16.1-0ubuntu2.11
ProcVersionSignature: Ubuntu 5.4.0-122.138-generic 5.4.192
Uname: Linux 5.4.0-122-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.24
Architecture: amd64
CasperMD5CheckResult: skip
Date: Tue Nov 22 14:05:57 2022
RelatedPackageVersions:
 bind9utils N/A
 apparmor 2.13.3-7ubuntu5.1
SourcePackage: bind9
UpgradeStatus: No upgrade log present (probably fresh install)
mtime.conffile..etc.bind.named.conf.local: 2022-07-19T06:39:58.037514
mtime.conffile..etc.bind.named.conf.options: 2022-08-12T09:04:29.109483
mtime.conffile..etc.default.named: 2022-07-15T15:04:10.495478

Related branches

CVE References

Revision history for this message
Maxxer (lorenzo-milesi) wrote :
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thank you for taking the time to file a bug report.

Unfortunately it is not possible to determine the root cause of the issue you've experienced based only on the log file attached to the bug. We will need more information to proceed with the investigation. Can you reliably reproduce the problem? If yes, are you able to obtain a coredump when the problem happens? It would also be extremely valuable if you could provide a step-by-step procedure to trigger the issue.

Since there is not enough information in your report to begin triage or to
differentiate between a local configuration problem and a bug in Ubuntu, I
am marking this bug as "Incomplete". We would be grateful if you would:
provide a more complete description of the problem, explain why you
believe this is a bug in Ubuntu rather than a problem specific to your
system, and then change the bug status back to "New".

For local configuration issues, you can find assistance here:
http://www.ubuntu.com/support/community

Changed in bind9 (Ubuntu):
status: New → Incomplete
Revision history for this message
John Edwards (john-cornerstonelinux) wrote :
Download full text (5.9 KiB)

Since 8 Nov 2022 I've been experiencing similar problems with BIND named 9.16.1-0ubuntu2.11 having segmentation faults or aborting across 8 of approx 30 machines running Ubuntu 22.04.

Before that named seems very solid. Machines are up to date with security and other updates so last upgrade of bind9 would have been in Sep 2022.

It occurs intermittently every few days and I have not found a definite trigger. None of the machines run Samba AD. They do run ISC DHCPd with DynamicDNS talking to named, but there are 10 machines running similar configurations which have not reported any problems.

Kernels tried:
5.4.0-131-generic
5.4.0-132-generic
5.15.0-53-generic (2 machines moved to newer kernel as test, but one still segfaults)

CPUs are mostly i3-2120, with one i3-2100, one i3-2120T, one i3-4160 and one i3-7100. There are other servers with i3-2120 CPUs running similar configuration which have not shown problems.

Most of the syslog messages show SEGV segmentation fault:
Nov 24 14:57:19 hostname-removed named[819]: validating la1-c1-lo2.lo2.r.salesforceliveagent.com/A: no valid signature found
Nov 24 14:57:22 hostname-removed kernel: [107820.613513] isc-worker0002[965]: segfault at 8 ip 00007f31746fc166 sp 00007f316da28530 error 4 in libisc.so.1601.0.0[7f31746da000+46000]
Nov 24 14:57:22 hostname-removed kernel: [107820.613540] Code: 00 00 48 8d 3d ab b4 02 00 e8 66 39 fe ff 66 0f 1f 44 00 00 f3 0f 1e fa 41 57 41 56 41 55 41 54 55 53 48 83 ec 08 4c 8b 67 10 <41> 83 7c 24 08 02 0f 85 be 00 00 00 49 89 fd 49 8b 7c 24 10 48 89
Nov 24 14:57:38 hostname-removed systemd[1]: named.service: Main process exited, code=killed, status=11/SEGV

Although occasionally we get an ABRT abort instead:
Nov 23 13:27:03 hostname-removed named [875]: validating 0o4g4comidnn5vr4tkpu77jgtdrkdnrn.ia4.r.salesforceliveagent.com/NSEC3: no valid signature found
Nov 23 13:28:41 hostname-removed named [875]: netmgr.c:687: REQUIRE((__builtin_expect(!!((sock) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(sock))->magic == ((('N') << 24 | ('M') << 16 | ('S') << 8 | ('K')))), 1))) failed, back trace
Nov 23 13:28:41 hostname-removed named [875]: #0 0x565143fa9e43 in ??
Nov 23 13:28:41 hostname-removed named [875]: #1 0x7f3c78719ac0 in ??
Nov 23 13:28:41 hostname-removed named [875]: #2 0x7f3c7873178a in ??
Nov 23 13:28:41 hostname-removed named [875]: #3 0x7f3c78732240 in ??
Nov 23 13:28:41 hostname-removed named [875]: #4 0x7f3c7873618b in ??
Nov 23 13:28:41 hostname-removed named [875]: #5 0x7f3c789ff707 in ??
Nov 23 13:28:41 hostname-removed named [875]: #6 0x7f3c78a00fe9 in ??
Nov 23 13:28:41 hostname-removed named [875]: #7 0x7f3c78a0f9b0 in ??
Nov 23 13:28:41 hostname-removed named [875]: #8 0x7f3c78a179a7 in ??
Nov 23 13:28:41 hostname-removed named [875]: #9 0x7f3c78a1916e in ??
Nov 23 13:28:41 hostname-removed named [875]: #10 0x7f3c78a196cd in ??
Nov 23 13:28:41 hostname-removed named [875]: #11 0x7f3c78a1a3c9 in ??
Nov 23 13:28:41 hostname-removed named [875]: #12 0x7f3c78a204c6 in ??
Nov 23 13:28:41 hostname-removed named [875]: #13 0x7f3c78740fa1 in ??
Nov 23 13:28:41 hostname-removed named [875]: #14 0x7f3c78208609 in ??
Nov 23 13:28:41 hostn...

Read more...

Changed in bind9 (Ubuntu):
status: Incomplete → New
Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

Another bug report with similar symptoms is https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1954854 - although that was from Dec 2021 and occurred soon after an upgrade from Ubuntu 18.04 to 20.04, whereas most of my servers have been running Ubuntu 20.04 for 6 months to a year.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in bind9 (Ubuntu):
status: New → Confirmed
Revision history for this message
Chris Puttick (cputtick) wrote :

We have seen this error on 2 bind servers, both Samba AD DCs, both virtualised on kvm. They are both in the same organisation but not the same site, and the other 5 in the organisation have not been impacted. All are at the same version (9.16.1-0ubuntu2.11), both stayed functional following that update for >1 month, both failed with a similar message but at different dates/times:

Nov 08 12:33:30 <server> kernel: isc-worker0001[3857097]: segfault at 8 ip 00007f1b2d75f166 sp 00007f1b2932a530 error 4 in libisc.so.1601.0.0[7f1b2d73d000+46000]
Nov 08 12:33:30 <server> kernel: Code: 00 00 48 8d 3d ab b4 02 00 e8 66 39 fe ff 66 0f 1f 44 00 00 f3 0f 1e fa 41 57 41 56 41 55 41 54 55 53 48 83 ec 08 4c 8b 67 10 <41> 83 7c 24 08 02 0f 85 be 00 00 00 49 89 fd 49 8b 7c 24 10 48 89

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thank you for the update. I see that this bug is affecting more people now.

Could any of you please run "apport-collect 1997375"? If that doesn't do anything, then could you verify whether you're able to obtain a coredump of the crash and attach it to this bug? This would help us investigate the issue further.

Thanks.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

If the coredump is too big to attach to the bug, could you drop it somewhere where we can fetch it?

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

Thanks very much for continuing to look into this Sergio.

I've sent the location of the core dump (~190MB) to you in a private messages on Launchpad. Let me know when you've managed to successfully download it so that I can remove it.

We don't have apport installed on the servers but I'll look to see if I can run it safely.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote : Re: [Bug 1997375] Re: isc-worker0003 segfault at 8 ip 00007f2361995166 sp 00007f235b2da530 error 4 in libisc. so.1601.0.0[7f2361973000+46000]

On Friday, November 25 2022, John Edwards wrote:

> Thanks very much for continuing to look into this Sergio.
>
> I've sent the location of the core dump (~190MB) to you in a private
> messages on Launchpad. Let me know when you've managed to successfully
> download it so that I can remove it.
>
> We don't have apport installed on the servers but I'll look to see if I
> can run it safely.

Hi John,

Thank you for providing the coredump. I've downloaded it successfully;
feel free to remove it.

As for apport, let me take a look and see if I can inspect the coredump
as is first; no need to fiddle with apport if you don't have it
available for now.

Meanwhile, it would be great if we could find a way to reproduce the
problem. I understand that it may be hard, but let me know if you find
anything.

Thanks,

--
Sergio
GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14

Revision history for this message
John Edwards (john-cornerstonelinux) wrote (last edit ): Re: isc-worker0003 segfault at 8 ip 00007f2361995166 sp 00007f235b2da530 error 4 in libisc. so.1601.0.0[7f2361973000+46000]

Unfortunately no apport possible from this machine because apport-cli is refusing to upload an crash file to an existing bug report (and I won't upload without checking and removing client info):

$ apport-cli --crash-file=apport.bind9._dwo7sph.apport --update-bug=1997375
Usage: apport-cli [options] [symptom|pid|package|program path|.apport/.crash file]

apport-cli: error: -u/--update-bug option cannot be used together with options for a new report

So what I've done is create a tarball with the report and the various named.conf.* files and uploaded as "apport-1997375.tar.bz2" to the same location as the core so Sergio can download it.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thanks, John. I grabbed the extra file already.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

I spent some time investigating this issue today and found a possible fix in an upstream Merge Request:

https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/3721

I'm not entirely confident that this is the same issue because the MR talks about a crash that happens on isc__nm_tcpdns_send, while the coredump provided by John crashes on isc__nm_tcp_send. Either way, I went ahead and backported the patch to the current version of bind9 on Focal. You can find builds of the new package in the following PPA:

https://launchpad.net/~sergiodj/+archive/ubuntu/bind9-bug1997375/+packages

John et al: since you haven't provided instructions on how to reproduce the bug yet, would you be able to give the PPA above a try and report back with results, please?

Thank you in advance.

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

Hi Sergio. Thanks for continuing to look into this problem.

I have installed those test packages on one of the servers which is most effected where BIND named usually suffers 2 or 3 segfaults a week. As we still haven't figured out a trigger for this problem I suspect we will have to wait and see how well the new packages behave.

Revision history for this message
Chris Smith (chris-smith8000) wrote :

I've also seen this issue occur across 6 servers, only running bind9 - and also started occurring in the same week which John described (8th Nov).

I will look into using the ppa too and see if this resolves the issue and report back.

$lsb_release -rd
Description: Ubuntu 20.04.3 LTS
Release: 20.04

$ apt-cache policy bind9
bind9:
  Installed: 1:9.16.1-0ubuntu2.11
  Candidate: 1:9.16.1-0ubuntu2.11

syslog output

Nov 28 21:23:03 userdns02 kernel: [1795011.607189] isc-worker0001[137345]: segfault at 118 ip 00007f38f2d7a0c7 sp 00007f38ee935a50 error 4 in libisc.so.1601.0.0[7f38f2d56000+46000]
Nov 28 21:23:03 userdns02 kernel: [1795011.607207] Code: 02 00 b9 df 01 00 00 be 38 00 00 00 48 8b 78 10 e8 6e 6a ff ff 66 0f ef c0 49 89 c4 48 8b 45 10 4c 89 e6 48 8b 80 f0 00 00 00 <48> 8b 80 18 01 00 00 41 0f 11 04 24 41 0f 11 44 24 20 41 0f 11 44

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

Quick update to say that I've seen no segmentation faults since 29 Nov on either those servers running the standard package 9.16.1-0ubuntu2.11, or running Sergio's patched 9.16.1-0ubuntu2.12~ppa1. Longest gap before that was 3 days without a segfault on any server.

The servers were upgraded over the weekend 3/4 Dec to Ubuntu kernel security update 5.4.0-135 https://launchpad.net/ubuntu/+source/linux/5.4.0-135.152

Currently I'm wondering if the trigger for the problem might be external. The servers are only accessible from the LAN, use different DNS servers (usually from an ISP) as forwarders, and the BIND querylog do not really show much of a pattern on what queries are made before the crash.

But the problem is intermittent so I will continue to monitor the situation. Has anyone else had a problem during December?

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :
Download full text (3.4 KiB)

I spoke too soon. Just had a segmentation fault this morning on a server running the standard package 9.16.1-0ubuntu2.11 (so without Sergio's patches).

Syslog messages including recent DNS query log below:

Dec 5 11:13:02 ff0133 named[825]: client @0x7f487c0f8cc0 192.168.33.143#8873 (safebrowsing.googleapis.com): query: safebrowsing.googleapis.com IN TYPE65 + (192.168.33.3)
Dec 5 11:13:02 ff0133 named[825]: client @0x7f487c0f8cc0 192.168.33.105#55867 (star.c10r.facebook.com): query: star.c10r.facebook.com IN A + (192.168.33.3)
Dec 5 11:13:02 ff0133 named[825]: client @0x7f487816dcb0 192.168.33.117#53723 (lh3.googleusercontent.com): query: lh3.googleusercontent.com IN TYPE65 +T (192.168.33.3)
Dec 5 11:13:02 ff0133 named[825]: client @0x7f4874157130 192.168.33.117#53722 (lh3.googleusercontent.com): query: lh3.googleusercontent.com IN A +T (192.168.33.3)
Dec 5 11:13:02 ff0133 named[825]: client @0x7f4878010ef0 127.0.0.1#38013 (135.118.46.217.in-addr.arpa): query: 135.118.46.217.in-addr.arpa IN PTR + (127.0.0.1)
Dec 5 11:13:02 ff0133 named[825]: client @0x7f486c0e1920 192.168.33.134#50184 (tp.a40771798-frontier.imdb.com): query: tp.a40771798-frontier.imdb.com IN TYPE65 + (192.168.33.3)
Dec 5 11:13:02 ff0133 named[825]: client @0x7f486c0f4060 192.168.33.134#53467 (tp.a40771798-frontier.imdb.com): query: tp.a40771798-frontier.imdb.com IN A + (192.168.33.3)
Dec 5 11:13:02 ff0133 named[825]: client @0x7f487411adb0 192.168.33.134#53415 (d14x35054ycmgy.cloudfront.net): query: d14x35054ycmgy.cloudfront.net IN TYPE65 + (192.168.33.3)
Dec 5 11:13:03 ff0133 named[825]: client @0x7f487816dcb0 192.168.33.117#53725 (lh5.googleusercontent.com): query: lh5.googleusercontent.com IN TYPE65 +T (192.168.33.3)
Dec 5 11:13:03 ff0133 named[825]: client @0x7f4874157130 192.168.33.117#53724 (lh5.googleusercontent.com): query: lh5.googleusercontent.com IN A +T (192.168.33.3)
Dec 5 11:13:03 ff0133 named[825]: client @0x7f4878010ef0 127.0.0.1#38060 (135.118.46.217.in-addr.arpa): query: 135.118.46.217.in-addr.arpa IN PTR + (127.0.0.1)
Dec 5 11:13:03 ff0133 named[825]: client @0x7f4878010ef0 127.0.0.1#58461 (135.118.46.217.in-addr.arpa): query: 135.118.46.217.in-addr.arpa IN PTR + (127.0.0.1)
Dec 5 11:13:03 ff0133 named[825]: client @0x7f486c0af250 127.0.0.1#39097 (135.118.46.217.in-addr.arpa): query: 135.118.46.217.in-addr.arpa IN PTR + (127.0.0.1)
Dec 5 11:13:03 ff0133 named[825]: client @0x7f487816dcb0 192.168.33.117#53727 (www.rbs.co.uk): query: www.rbs.co.uk IN TYPE65 +T (192.168.33.3)
Dec 5 11:13:03 ff0133 named[825]: client @0x7f4874157130 192.168.33.117#53726 (www.rbs.co.uk): query: www.rbs.co.uk IN A +T (192.168.33.3)
Dec 5 11:13:03 ff0133 named[825]: client @0x7f487c178ea0 192.168.33.117#53729 (www.rbs.co.uk): query: www.rbs.co.uk IN TYPE65 +T (192.168.33.3)
Dec 5 11:13:03 ff0133 named[825]: client @0x7f487415f7d0 192.168.33.117#53728 (www.rbs.co.uk): query: www.rbs.co.uk IN A +T (192.168.33.3)
Dec 5 11:13:03 ff0133 kernel: [221093.288331] isc-worker0002[970]: segfault at 8 ip 00007f489451c166 sp 00007f488d848530 error 4 in libisc.so.1601.0.0[7f48944fa000+46000]
Dec 5 11:13:03 ff0133 kernel: [221093.288344] Code: 00 00 48 8d 3d ab b4 02 00 e8 66...

Read more...

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hi John,

Thank you very much for the update. So the package I provided via the PPA is still off the hook, right? That's good news; if the patch is indeed the one I backported, then we can think about proceeding with the SRU.

But I don't want to jinx it, so let's wait a couple more weeks before deciding.

Thanks again.

Revision history for this message
Benjamin Guebert (bguebert) wrote :

We have had this same problem on one of our servers. Version is BIND 9.16.1. It wasn't very common before, but we've had it happen 3 times this week. Here is the error from logs:

Dec 8 04:26:27 XXXXXXX kernel: [2824711.133384] isc-worker0007[710557]: segfault at 8 ip 00007f1c09ddb166 sp 00007f1bff192530 error 4 in libisc.so.1601.0.0[7f1c09db9000+46000]
Dec 8 04:26:27 XXXXXXX kernel: [2824711.133402] Code: 00 00 48 8d 3d ab b4 02 00 e8 66 39 fe ff 66 0f 1f 44 00 00 f3 0f 1e fa 41 57 41 56 41 55 41 54 55 53 48 83 ec 08 4c 8b 67 10 <41> 83 7c 24 08 02 0f 85 be 00 00 00 49 89 fd 49 8b 7c 24 10 48 89

To work around it for now we modified our systemd service to restart bind automatically with these changes:

Added to [Unit]:
StartLimitIntervalSec=500
StartLimitBurst=5

Added to [Service]:
Restart=on-failure
RestartSec=5s

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote : Re: [Bug 1997375] Re: isc-worker0003 segfault at 8 ip 00007f2361995166 sp 00007f235b2da530 error 4 in libisc. so.1601.0.0[7f2361973000+46000]

On Thursday, December 08 2022, Benjamin Guebert wrote:

> We have had this same problem on one of our servers. Version is BIND
> 9.16.1. It wasn't very common before, but we've had it happen 3 times
> this week. Here is the error from logs:
>
> Dec 8 04:26:27 XXXXXXX kernel: [2824711.133384]
> isc-worker0007[710557]: segfault at 8 ip 00007f1c09ddb166 sp
> 00007f1bff192530 error 4 in libisc.so.1601.0.0[7f1c09db9000+46000]
> Dec 8 04:26:27 XXXXXXX kernel: [2824711.133402] Code: 00 00 48 8d 3d
> ab b4 02 00 e8 66 39 fe ff 66 0f 1f 44 00 00 f3 0f 1e fa 41 57 41 56
> 41 55 41 54 55 53 48 83 ec 08 4c 8b 67 10 <41> 83 7c 24 08 02 0f 85 be
> 00 00 00 49 89 fd 49 8b 7c 24 10 48 89
>
> To work around it for now we modified our systemd service to restart
> bind automatically with these changes:
>
> Added to [Unit]:
> StartLimitIntervalSec=500
> StartLimitBurst=5
>
> Added to [Service]:
> Restart=on-failure
> RestartSec=5s

Hello Benjamin,

Would it be possible for you to test the package from the PPA I provided
above? I backported a possible fix for this problem there, but since
this is a non-deterministic failure it'd be good to have more people
testing the package and reporting back the results.

Thanks,

--
Sergio
GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14

Revision history for this message
Benjamin Guebert (bguebert) wrote : Re: isc-worker0003 segfault at 8 ip 00007f2361995166 sp 00007f235b2da530 error 4 in libisc. so.1601.0.0[7f2361973000+46000]

Hi Sergio,

I set up your backport to run on our machine. I'll post back here if there are any problems, but it is working ok so far. Thank you for your help.

Ben

Revision history for this message
Benjamin Guebert (bguebert) wrote :

We've been running the package from the PPA for a week now and have not had the segfault come up again.

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

Quick summary of the past 2 weeks:

12 segmentation faults and 1 abort on the 8 Ubuntu 20.04 servers which are not running the patched packages.

No segmentation faults on the 4 Ubuntu 20.04 servers which are running the patched packages.

4 Ubuntu 22.04 servers which are running similar configurations have not yet shown any problems, but for the most part they are on the quieter networks.

No problems on any Ubuntu 18.04 server (but we only have 2 still in production). No Ubuntu 22.04 servers currently in production.

I've still not found a way to trigger the problem, nor a common factor in which stock Ubuntu 20.04 servers are effected or not - beyond the obvious thing that the busier servers seem more likely to exhibit the problem.

So the patched packages in the PPA look like they fix the problem to me, but I'm not enough of an expert on BIND to know if what they patch might cause a regression or problem elsewhere.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote : Re: [Bug 1997375] Re: isc-worker0003 segfault at 8 ip 00007f2361995166 sp 00007f235b2da530 error 4 in libisc. so.1601.0.0[7f2361973000+46000]

On Friday, December 16 2022, John Edwards wrote:

> Quick summary of the past 2 weeks:
>
> 12 segmentation faults and 1 abort on the 8 Ubuntu 20.04 servers which
> are not running the patched packages.
>
> No segmentation faults on the 4 Ubuntu 20.04 servers which are running
> the patched packages.
>
> 4 Ubuntu 22.04 servers which are running similar configurations have not
> yet shown any problems, but for the most part they are on the quieter
> networks.
>
> No problems on any Ubuntu 18.04 server (but we only have 2 still in
> production). No Ubuntu 22.04 servers currently in production.
>
> I've still not found a way to trigger the problem, nor a common factor
> in which stock Ubuntu 20.04 servers are effected or not - beyond the
> obvious thing that the busier servers seem more likely to exhibit the
> problem.
>
> So the patched packages in the PPA look like they fix the problem to me,
> but I'm not enough of an expert on BIND to know if what they patch might
> cause a regression or problem elsewhere.

Thank you very much for the status update, John.

It does indeed look like the backported patch is the solution to this
problem. It would have been great if we could figure out the steps to
reproduce this problem, but sometimes it's just too hard to do it. I
think we can proceed with the SRU and inform that it will be tested by
the community.

Today is my last day before the EOY break, so I won't be able to make
any progress here until I'm back in the beginning of January.
Meanwhile, feel free to keep me informed in case something unexpected
happens.

Thanks,

--
Sergio
GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14

Revision history for this message
Chris Smith (chris-smith8000) wrote : Re: isc-worker0003 segfault at 8 ip 00007f2361995166 sp 00007f235b2da530 error 4 in libisc. so.1601.0.0[7f2361973000+46000]

Just to add our experience - No crashes on any of the servers running the PPA over the last 25 days - whereas the servers on the original version still crash occasionally.

Thanks

Revision history for this message
Vincent Maroun (vmaroun) wrote :

I stumbled upon this thread on 15-Dec and installed the patch then. Previously was crashing several times a week after my upgrade from 18.04 LTS to 20.04 LTS. No crashes since patch installed.

Running Ubuntu in a BHyve VM. No obvious triggers but I was too focused on CPU/memory because of some other unrelated issues I was chasing.

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

I hope Sergio had a good break, as did most of our BIND servers.

No segfaults during the Christmas break (25 Dec to 2 Jan), but this morning (first day back at work for most people) 2 servers running the original unpatched version (9.16.1-0ubuntu2.11) each had a segfault.

Still no similar problems on machines running the patched version (9.16.1-0ubuntu2.12~ppa1), nor on the public BIND servers which are authoritative for our own domains and are still running the unpatched version.

So if there is a possible trigger for the problem it looks to be either related to the amount of requests from client machines, or something that those client machines (mostly Windows) are putting in their requests which the servers and network hardware which stayed running over the break do not (or maybe both together).

It might be interesting to hear from anyone who has this problem on a server where there are no client machines running Windows using it.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thank you John and everybody else who was kind enough to comment on this bug. I intend to start working on SRUing it tomorrow, and will let you know once it's time to help with the verification (since we've been unable to come up with a reproducer so far).

Changed in bind9 (Ubuntu):
assignee: nobody → Sergio Durigan Junior (sergiodj)
no longer affects: bind9 (Ubuntu Jammy)
Changed in bind9 (Ubuntu Focal):
assignee: nobody → Sergio Durigan Junior (sergiodj)
status: New → Confirmed
Changed in bind9 (Ubuntu):
status: Confirmed → Fix Released
tags: added: server-todo
Changed in bind9 (Ubuntu Focal):
importance: Undecided → High
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :
summary: - isc-worker0003 segfault at 8 ip 00007f2361995166 sp 00007f235b2da530
- error 4 in libisc. so.1601.0.0[7f2361973000+46000]
+ bind9 segfaults on certain stressful scenarios
no longer affects: bind
description: updated
description: updated
description: updated
Changed in bind9 (Ubuntu Focal):
status: Confirmed → In Progress
description: updated
Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

Hi. Just a quick question - is this patch related in anyway to CVE-2022-3094, which is patched in package 9.18.1-1ubuntu1.3?

https://launchpad.net/ubuntu/+source/bind9/1:9.18.1-1ubuntu1.3

Or would the patch to fix this problem need to be applied to that package as well?

I can test either the 9.18.1-1ubuntu1.3 security update for CVE-2022-3094, or a new patched package with both patches applied.

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

After upgrade to package version 9.16.1-0ubuntu2.12 the problem has returned:

Jan 26 13:37:03 ff0149 systemd[1]: named.service: Main process exited, code=killed, status=11/SEGV

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

Sorry I linked to the wrong package page for the security update in comment #30 yesterday (25 Jan).

For Ubuntu 20.04 (Focal Fossa) it should have been:

https://launchpad.net/ubuntu/+source/bind9/1:9.16.1-0ubuntu2.12

I incorrectly linked to the Ubuntu 22.04 (Jammy Jellyfish) package, which I am not in a position to test for this problem.

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

Package version 9.16.1-0ubuntu2.12 is causing segmentation faults on a couple of other servers as well (ones which were running OK with Sergio's patched package).

Revision history for this message
Chris Smith (chris-smith8000) wrote :

We're also having segfaults again on the newly released 1:9.16.1-0ubuntu2.12

Revision history for this message
Charlie Chapa (cwc102) wrote :

We are also seeing segfaults return after newly released 1:9.16.1-0ubuntu2.12.

Jan 30 11:24:32 {HOST NAME REMOVED} systemd[1]: named.service: Main process exited, code=killed, status=11/SEGV
Jan 30 11:24:32 {HOST NAME REMOVED} systemd[1]: named.service: Failed with result 'signal'.

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

One of the effected servers running bind9 package version 9.16.1-0ubuntu2.12 had a slightly different abort today (rather than the usual segmentation fault):

Feb 6 11:30:25 ff0148 named[2651528]: netmgr.c:687: REQUIRE((__builtin_expect(!!((sock) != ((void *)0)), 1) && __builtin_expect(!!(((const
 isc__magic_t *)(sock))->magic == ((('N') << 24 | ('M') << 16 | ('S') << 8 | ('K')))), 1))) failed, back trace
Feb 6 11:30:25 ff0148 named[2651528]: #0 0x55c71806be43 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #1 0x7fb6e2b74ac0 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #2 0x7fb6e2b8c78a in ??
Feb 6 11:30:25 ff0148 named[2651528]: #3 0x7fb6e2b8d240 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #4 0x7fb6e2b9118b in ??
Feb 6 11:30:25 ff0148 named[2651528]: #5 0x7fb6e2e5a707 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #6 0x7fb6e2e5bfe9 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #7 0x7fb6e2e6a9b0 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #8 0x7fb6e2e729a7 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #9 0x7fb6e2e79de6 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #10 0x7fb6e2e75936 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #11 0x7fb6e2e7b4c6 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #12 0x7fb6e2b9bfa1 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #13 0x7fb6e2663609 in ??
Feb 6 11:30:25 ff0148 named[2651528]: #14 0x7fb6e2582133 in ??
Feb 6 11:30:25 ff0148 named[2651528]: exiting (due to assertion failure)
Feb 6 11:30:26 ff0148 whoopsie-upload-all[3207865]: /var/crash/_usr_sbin_named.107.crash already marked for upload, skipping
Feb 6 11:30:29 ff0148 whoopsie-upload-all[3207884]: /var/crash/_usr_sbin_named.107.crash already marked for upload, skipping
Feb 6 11:30:34 ff0148 systemd[1]: named.service: Main process exited, code=killed, status=6/ABRT
Feb 6 11:30:34 ff0148 systemd[1]: named.service: Failed with result 'signal'.
Feb 6 11:30:35 ff0148 systemd[1]: named.service: Scheduled restart job, restart counter is at 8.

Revision history for this message
Benjamin Guebert (bguebert) wrote :

I am getting the segfault again after upgrading too. Same error message that Charlie posted.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thanks for the update. That's because my upload is still in the SRU queue and hasn't been accepted yet. Once it is, you will see a comment on this bug asking for help with testing and validating the fix.

Revision history for this message
Benjamin Guebert (bguebert) wrote (last edit ):

Just in case it helps, I found some more info about the crash in kern.log for a segfault that happened this morning:

Feb 24 05:59:02 servername kernel: [6557584.979894] isc-worker0002[2036687]: segfault at 8 ip 00007f9d55371166 sp 00007f9d4cf2c530 error 4 in libisc.so.1601.0.0[7f9d5534f000+46000]

Revision history for this message
Benjamin Guebert (bguebert) wrote : Re: [Bug 1997375] Re: bind9 segfaults on certain stressful scenarios

Oh ok, sorry I didn't realize that update wasn't the one you were working
on.

On Thu, Feb 23, 2023 at 9:16 PM Sergio Durigan Junior <
<email address hidden>> wrote:

> Thanks for the update. That's because my upload is still in the SRU
> queue and hasn't been accepted yet. Once it is, you will see a comment
> on this bug asking for help with testing and validating the fix.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1997375
>
> Title:
> bind9 segfaults on certain stressful scenarios
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1997375/+subscriptions
>
>

Revision history for this message
Andreas Hasenack (ahasenack) wrote : Please test proposed package

Hello Maxxer, or anyone else affected,

Accepted bind9 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/bind9/1:9.16.1-0ubuntu2.13 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in bind9 (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hello folks (Benjamin, John et al),

As you may have noticed, the package was finally accepted into -proposed and now it needs to be verified in order to fully migrate to the release pocket. As explained in comment #28, I will need to your help to verify that the new package indeed fixes the failure. Comment #41 should have all the information you need to proceed with the verification, but let me know if you have any questions.

Thanks!

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Hello Maxxer, or anyone else affected,

Accepted bind9 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/bind9/1:9.16.1-0ubuntu2.14 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Chris Smith (chris-smith8000) wrote :

We're now testing focal-proposed 1:9.16.1-0ubuntu2.14, and it's been in production for 48 hours with a load of around 400 requests p/s, no issues seen so far.

Revision history for this message
Benjamin Guebert (bguebert) wrote (last edit ):

We're running the proposed version now. No errors so far today after updating this morning. One thing to watch is that they also added the Restart=on-failure to the default config with this update too. So if it still has the segfaults, you'll have to find them in the log since bind will likely restart and keep running on its own now. They talk about it over on bug #2006054

Revision history for this message
Chris Smith (chris-smith8000) wrote :

focal-proposed 1:9.16.1-0ubuntu2.14 now running for over a week with around 600 requests p/s no issues, have seen older versions crash in the same period.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thank you both for the feedback. I am tagging this bug as verification-done, then.

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bind9 - 1:9.16.1-0ubuntu2.14

---------------
bind9 (1:9.16.1-0ubuntu2.14) focal; urgency=medium

  * d/bind9.named.service: restart the named service on failure.
    (LP: #2006054)

bind9 (1:9.16.1-0ubuntu2.13) focal; urgency=medium

  * d/p/lp1997375-segfault-isc-nm-tcp-send.patch: Fix segfault on
    isc__nm_tcpdns_send by moving the tcpdns processing to another
    thread. (LP: #1997375)

 -- Athos Ribeiro <email address hidden> Fri, 03 Mar 2023 12:37:25 -0300

Changed in bind9 (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Update Released

The verification of the Stable Release Update for bind9 has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Benjamin Guebert (bguebert) wrote :

We've been running ours a week now too and we haven't had the segfault problem. Thanks for you help.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.