Comment 3 for bug 1997375

Revision history for this message
John Edwards (john-cornerstonelinux) wrote : Re: isc-worker0003 segfault at 8 ip 00007f2361995166 sp 00007f235b2da530 error 4 in libisc. so.1601.0.0[7f2361973000+46000]

Since 8 Nov 2022 I've been experiencing similar problems with BIND named 9.16.1-0ubuntu2.11 having segmentation faults or aborting across 8 of approx 30 machines running Ubuntu 22.04.

Before that named seems very solid. Machines are up to date with security and other updates so last upgrade of bind9 would have been in Sep 2022.

It occurs intermittently every few days and I have not found a definite trigger. None of the machines run Samba AD. They do run ISC DHCPd with DynamicDNS talking to named, but there are 10 machines running similar configurations which have not reported any problems.

Kernels tried:
5.4.0-131-generic
5.4.0-132-generic
5.15.0-53-generic (2 machines moved to newer kernel as test, but one still segfaults)

CPUs are mostly i3-2120, with one i3-2100, one i3-2120T, one i3-4160 and one i3-7100. There are other servers with i3-2120 CPUs running similar configuration which have not shown problems.

Most of the syslog messages show SEGV segmentation fault:
Nov 24 14:57:19 hostname-removed named[819]: validating la1-c1-lo2.lo2.r.salesforceliveagent.com/A: no valid signature found
Nov 24 14:57:22 hostname-removed kernel: [107820.613513] isc-worker0002[965]: segfault at 8 ip 00007f31746fc166 sp 00007f316da28530 error 4 in libisc.so.1601.0.0[7f31746da000+46000]
Nov 24 14:57:22 hostname-removed kernel: [107820.613540] Code: 00 00 48 8d 3d ab b4 02 00 e8 66 39 fe ff 66 0f 1f 44 00 00 f3 0f 1e fa 41 57 41 56 41 55 41 54 55 53 48 83 ec 08 4c 8b 67 10 <41> 83 7c 24 08 02 0f 85 be 00 00 00 49 89 fd 49 8b 7c 24 10 48 89
Nov 24 14:57:38 hostname-removed systemd[1]: named.service: Main process exited, code=killed, status=11/SEGV

Although occasionally we get an ABRT abort instead:
Nov 23 13:27:03 hostname-removed named [875]: validating 0o4g4comidnn5vr4tkpu77jgtdrkdnrn.ia4.r.salesforceliveagent.com/NSEC3: no valid signature found
Nov 23 13:28:41 hostname-removed named [875]: netmgr.c:687: REQUIRE((__builtin_expect(!!((sock) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(sock))->magic == ((('N') << 24 | ('M') << 16 | ('S') << 8 | ('K')))), 1))) failed, back trace
Nov 23 13:28:41 hostname-removed named [875]: #0 0x565143fa9e43 in ??
Nov 23 13:28:41 hostname-removed named [875]: #1 0x7f3c78719ac0 in ??
Nov 23 13:28:41 hostname-removed named [875]: #2 0x7f3c7873178a in ??
Nov 23 13:28:41 hostname-removed named [875]: #3 0x7f3c78732240 in ??
Nov 23 13:28:41 hostname-removed named [875]: #4 0x7f3c7873618b in ??
Nov 23 13:28:41 hostname-removed named [875]: #5 0x7f3c789ff707 in ??
Nov 23 13:28:41 hostname-removed named [875]: #6 0x7f3c78a00fe9 in ??
Nov 23 13:28:41 hostname-removed named [875]: #7 0x7f3c78a0f9b0 in ??
Nov 23 13:28:41 hostname-removed named [875]: #8 0x7f3c78a179a7 in ??
Nov 23 13:28:41 hostname-removed named [875]: #9 0x7f3c78a1916e in ??
Nov 23 13:28:41 hostname-removed named [875]: #10 0x7f3c78a196cd in ??
Nov 23 13:28:41 hostname-removed named [875]: #11 0x7f3c78a1a3c9 in ??
Nov 23 13:28:41 hostname-removed named [875]: #12 0x7f3c78a204c6 in ??
Nov 23 13:28:41 hostname-removed named [875]: #13 0x7f3c78740fa1 in ??
Nov 23 13:28:41 hostname-removed named [875]: #14 0x7f3c78208609 in ??
Nov 23 13:28:41 hostname-removed named [875]: #15 0x7f3c78127133 in ??
Nov 23 13:28:41 hostname-removed named [875]: exiting (due to assertion failure)
Nov 23 13:28:50 hostname-removed systemd[1]: named.service: Main process exited, code=killed, status=6/ABRT

I can get coredumps, but obviously they are rather big (100+ MB).

Using backtrace in gdb on one of them gives:
#0 0x00007fbfa0611166 in isc.nm_tcp_send ()
   from /usr/lib/x86_64-linux-gnu/libisc.so.1601
#1 0x00007fbfa08da707 in ?? () from /usr/lib/x86_64-linux-gnu/libns.so.1601
#2 0x00007fbfa08dbfe9 in ns_client_send ()
   from /usr/lib/x86_64-linux-gnu/libns.so.1601
#3 0x00007fbfa08ea9b0 in ?? () from /usr/lib/x86_64-linux-gnu/libns.so.1601
#4 0x00007fbfa08f29a7 in ns_query_done ()
   from /usr/lib/x86_64-linux-gnu/libns.so.1601
#5 0x00007fbfa08f416e in ?? () from /usr/lib/x86_64-linux-gnu/libns.so.1601
#6 0x00007fbfa08f46cd in ?? () from /usr/lib/x86_64-linux-gnu/libns.so.1601
#7 0x00007fbfa08f53c9 in ?? () from /usr/lib/x86_64-linux-gnu/libns.so.1601
#8 0x00007fbfa08fb4c6 in ?? () from /usr/lib/x86_64-linux-gnu/libns.so.1601
#9 0x00007fbfa061bfa1 in ?? () from /usr/lib/x86_64-linux-gnu/libisc.so.1601
#10 0x00007fbfa00e3609 in start_thread (arg=<optimised out>)
    at pthread_create.c:477
#11 0x00007fbfa0002133 in clone ()

The BIND named configuration is pretty similar across machines effected and not effected. There are a few local domains, recursion is allowed for local networks, and forwarders are usually a couple of the ISP's DNS servers. The BIND config files would need to scrubbed of any customer info before being supplied.

I've enabled logging of DNS queries using 'sudo rndc querylog on' and most of the traffic before the segfault, for example:
query: preview.web.skype.com IN TYPE65 +T
query: preview.web.skype.com IN A +T

Occasionally there are "no valid signature found" messages, for example:
Nov 24 14:52:47 hostname-removed named[819]: validating o115.p8.mailjet.com/A: no valid signature found
Nov 24 14:52:48 hostname-removed named[819]: validating o115.p8.mailjet.com/TXT: no valid signature found
Nov 24 14:53:55 hostname-removed named[819]: validating la1-c1-lo2.lo2.r.salesforceliveagent.com/A: no valid signature found
Nov 24 14:55:37 hostname-removed named[819]: validating la1-c1-lo2.lo2.r.salesforceliveagent.com/A: no valid signature found
Nov 24 14:57:19 hostname-removed named[819]: validating la1-c1-lo2.lo2.r.salesforceliveagent.com/A: no valid signature found

Lastly I should mention that on Uubntu 20.04 BIND named is not set to be restarted by systemd if it exits abnormally. After this has happened a couple of times I added a local systemd config to automatically restart it containing:
[Service]
Restart=on-failure

As well as a custom 'ExecStopPost' command to email us a warning and summary of the log files.

Is anything else I can provide to help?