[SRU] DHCP Cluster crashes after a few hours

Bug #1872118 reported by Andrew Welham on 2020-04-10
96
This bug affects 16 people
Affects Status Importance Assigned to Milestone
DHCP
New
Unknown
bind9-libs (Ubuntu)
Status tracked in Groovy
Focal
High
Jorge Niedbalski
Groovy
High
Jorge Niedbalski

Bug Description

[Description]

isc-dhcp-server uses libisc-export (coming from bind9-libs package) for handling the socket event(s) when configured in peer mode (master/secondary). It's possible that a sequence of messages dispatched by the master that requires acknowledgment from its peers holds a socket
in a pending to send state, a timer or a subsequent write request can be scheduled into this socket and the !sock->pending_send assertion
will be raised when trying to write again at the time data hasn't been flushed entirely and the pending_send flag hasn't been reset to 0 state.

If this race condition happens, the following stacktrace will be
hit:

(gdb) info threads
  Id Target Id Frame
* 1 Thread 0x7fb4ddecb700 (LWP 3170) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
  2 Thread 0x7fb4dd6ca700 (LWP 3171) __lll_lock_wait (futex=futex@entry=0x7fb4de6d2028, private=0) at lowlevellock.c:52
  3 Thread 0x7fb4de6cc700 (LWP 3169) futex_wake (private=<optimized out>, processes_to_wake=1, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:364
  4 Thread 0x7fb4de74f740 (LWP 3148) futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb4de6cd0d0) at ../sysdeps/nptl/futex-internal.h:183

(gdb) frame 2
#2 0x00007fb4dec85985 in isc_assertion_failed (file=file@entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line@entry=3361, type=type@entry=isc_assertiontype_insist,
    cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
(gdb) bt
#1 0x00007fb4deaa7859 in __GI_abort () at abort.c:79
#2 0x00007fb4dec85985 in isc_assertion_failed (file=file@entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line@entry=3361, type=type@entry=isc_assertiontype_insist,
    cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
#3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
#4 process_fd (writeable=<optimized out>, readable=<optimized out>, fd=11, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4054
#5 process_fds (writefds=<optimized out>, readfds=0x7fb4de6d1090, maxfd=13, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4211
#6 watcher (uap=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4397
#7 0x00007fb4dea68609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8 0x00007fb4deba4103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(gdb) frame 3
#3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
4041 in ../../../../lib/isc/unix/socket.c
(gdb) p sock->pending_send
$2 = 1

[TEST CASE]

1) Install isc-dhcp-server in 2 focal machine(s).
2) Configure peer/cluster mode as follows:
   Primary configuration: https://pastebin.ubuntu.com/p/XYj648MghK/
   Secondary configuration: https://pastebin.ubuntu.com/p/PYkcshZCWK/
2) Run dhcpd as follows in both machine(s)

# dhcpd -f -d -4 -cf /etc/dhcp/dhcpd.conf --no-pid ens4

3) Leave the cluster running for a long (2h) period until the crash/race condition is reproduced.

[REGRESSION POTENTIAL]

* The fix will prevent the assertion to happen in the dispatch_send
path, later versions of isch-dhcp upstream lack this logic and entirely removed the existence of this flag. Therefore, removing the need for this
assertion at process_fd shouldn't be problematic.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in isc-dhcp (Ubuntu):
status: New → Confirmed
Andrew Welham (andreww-ubuntu) wrote :

each server starts if the other is not running, but starting them together caused this issue

mm (mtl-0) wrote :

My current workaround: downgrade isc-dhcp-server to 18.04 package version.

Andrew Welham (andreww-ubuntu) wrote :

Thats not a bad idea, they It seems to be they are ignoring the critical issue. i.e you CAN@T RUN A PRODUCTION CLUSTER

Andrew Welham (andreww-ubuntu) wrote :

MM , Thanks I install the isc-dhcp-server from the Bionic-Security repository, and it appears to work.
For anyone else here is how to install an older version of the server,and then block updates to the focal version

PLEASE USE AT YOUR OWN RISK

#Backup your old config
sudo mkdir /etc/dhcp.20.04
sudo cp -r /etc/dhcp /etc/dhcp.20.04

#Remove your old isc-dhcp server
sudo apt remove isc-dhcp-server

#Add the old Bionic repository
add-apt-repository 'deb http://security.ubuntu.com/ubuntu bionic-security main'
sudo apt update

#Install isc-dhcp-server from the bionic-security repository
sudo apt-get install -t bionic-security/main isc-dhcp-server
sudo systemctl restart isc-dhcp-server

#Block upgrades of the package from the Focal Repository
edit /etc/apt/preferences.d/repo and add the following lines

Package: isc-dhcp-server
Pin: release a=bionic-security
Pin-Priority: 900

atom (adamcstephens) wrote :

Any update on this? I just spent a few hours troubleshooting this only to find this bug.

Andrew Welham (andreww-ubuntu) wrote :

No response, I actually wondering if I logged the bug in the wrong place. Anyone building a cluster of DHCP servers on Ubuntu 20.04 should run into this. The above or no one is using Ubuntu for DHCP clusters ?

Joakim Soendojo (yoakim) wrote :

I do have this problem running Ubuntu 20.04.
Have been fiddling around with config, starting and stopping, apparmor and what not.
Currently running it in single node with services stopped on the backup.

Not very keen on downgrading but might have to go that path or migrate to KEA.

A very big thanks for the downgrade fix Andrew.

Andrew Welham (andreww-ubuntu) wrote :

Thats what I did, but had a couple of outages, I had a choice of extending the lease time or downgrade. I did also try to disable apparmor, and dhcp gave an error (can't remember what though) and crashed I guess AppArmor just detected this and killed the process. So far about a week in ad every thing looks good.
The other option i guess (until they fix this) is to build the latest release from source. I did not do this due to no automatic patching

Peter A. Bigot (pab-u) wrote :

I just moved primary dns and dhcp from 16.04.6 LTS to 20.04 LTS and also observe this behavior.

Primary is isc-dhcp-server 4.4.1-2.1ubuntu5 amd64. Crashes regularly.

Secondary is isc-dhcp-server 4.3.5-3+deb9 armhf which runs fine.

tags: added: focal rls-ff-incoming
Andrew Welham (andreww-ubuntu) wrote :

Looking at the history around isc-dhcp, it appears if i read it properly that isc-dhcp sever for Ubuntu has been abandoned.
If this is the case should it even be in the focal build

Tomek Mrugalski (spam-ubuntu) wrote :

Can someone affected by this bug compile 4.4.2 and see if this problem persists? You can get the latest sources from here: https://ftp.isc.org/isc/dhcp/4.4.2/ Compilation is trivial (untar, ./configure && make). The failover is enabled by default, so no need for extra switches to configure.

If it does crash, please report it upstream here: https://gitlab.isc.org/isc-projects/dhcp/-/issues

Ubuntu 20.04 has packages for 4.4.1 (released Feb 2018). The latest released from ISC (upstream) is 4.4.2, which went out on Jan 2020.

Andrew Welham (andreww-ubuntu) wrote :

tried it on a single server in the cluster and it looks like it better, i.e it actually launches, however there is no -user or -group options, with different directories and it must be started as root (with no change of id_ to start, so not secure. I guess thee other changes are added by ubuntu

Tomek Mrugalski (spam-ubuntu) wrote :

Andrew, thanks for checking this up. Ok, so it looks like a problem
with Ubuntu patches.

Andrew Welham (andreww-ubuntu) wrote :

Tomek, Please remember I tried the version 4.4.2 as suggested and as you say Ubuntu used 4.4.1, so it could be something to do with the 4.4.1 release or the ubuntu changes. Either way it looks like its not going to be fixed. Given this how do you apply to have a ISC-DHCP_Server removed form the Ubuntu bundle for no maintenance.

Tomek Mrugalski (spam-ubuntu) wrote :

I don't have specific recommendation on this.

The best course of action would be to find a new maintainer that would update the packages to 4.4.2. The upstream (ISC) does very infrequent releases. We had 4.4.2 coming out in Jan 2020. We don't know when the next one (4.4.3) will come out, but we're no longer actively developing it. We're fixing serious bugs reported by customers (and addressing security issues if found), so the future releases are driven by that.

My point here is that this isn't that much work to do. Perhaps existing maintainers could be convinced to do the package updates?

Also, the alternative - ISC Kea - is much superior in many regards. It's modern, can (but doesn't have to) use MySQL, Postgres and other databases, has much better performance, is extensible with hooks, much cleaner configuration using JSON, ability to alter almost every aspect of its configuration during runtime, the dev version already support multi-threading etc. This is the ultimate solution to that problem: migrate to Kea and stop using isc-dhcp altogether.

Finally, ISC provides packages for Kea for many systems, including Ubuntu. We can work with package maintainers if there are specific patches that are needed. We don't and we don't be able to do that for isc-dhcp.

Richard Laager (rlaager) wrote :

I was able to reproduce this with 4.4.2 plus the Ubuntu packaging. I did not try with stock 4.4.2 from source.

Changed in dhcp:
status: Unknown → New
Jorge Niedbalski (niedbalski) wrote :

Hello,

I am trying to setup a reproducer for the mentioned issue. I have 2
machines acting as peers with the following versions:

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal

# dpkg -l |grep -i isc-dh
ii isc-dhcp-client 4.4.1-2.1ubuntu5 amd64 DHCP client for automatically obtaining an IP address
ii isc-dhcp-common 4.4.1-2.1ubuntu5 amd64 common manpages relevant to all of the isc-dhcp packages
ii isc-dhcp-server 4.4.1-2.1ubuntu5 amd64 ISC DHCP server for automatic IP address assignment

=====

Primary configuration: https://pastebin.ubuntu.com/p/XYj648MghK/
Secondary configuration: https://pastebin.ubuntu.com/p/PYkcshZCWK/

Started with:

# dhcpd -f -d -4 -cf /etc/dhcp/dhcpd.conf --no-pid ens4

---> Raised some DHCP requests to these servers.

balanced pool 560b8c263f40 12 total 221 free 111 backup 110 lts 0 max-misbal 33
Sending updates to failover-partner.
failover peer failover-partner: peer moves from recover-done to normal
failover peer failover-partner: Both servers normal
DHCPDISCOVER from 52:54:00:2d:53:93 via ens4
DHCPOFFER on 10.19.101.120 to 52:54:00:2d:53:93 (glistening-elephant) via ens4
DHCPREQUEST for 10.19.101.120 (10.19.101.236) from 52:54:00:2d:53:93 (glistening-elephant) via ens4
DHCPACK on 10.19.101.120 to 52:54:00:2d:53:93 (glistening-elephant) via ens4

DHCPREQUEST for 10.19.101.120 from 52:54:00:2d:53:93 (glistening-elephant) via ens4
DHCPACK on 10.19.101.120 to 52:54:00:2d:53:93 (glistening-elephant) via ens4
DHCPREQUEST for 10.19.101.121 from 52:54:00:53:a3:d8 (valiant-motmot) via ens4
DHCPACK on 10.19.101.121 to 52:54:00:53:a3:d8 (valiant-motmot) via ens4

---

failover peer failover-partner: Both servers normal
balancing pool 5606b2c95f10 12 total 221 free 221 backup 0 lts -110 max-own (+/-)22
balanced pool 5606b2c95f10 12 total 221 free 221 backup 0 lts -110 max-misbal 33
balancing pool 5606b2c95f10 12 total 221 free 111 backup 110 lts 0 max-own (+/-)22
balanced pool 5606b2c95f10 12 total 221 free 111 backup 110 lts 0 max-misbal 33
DHCPDISCOVER from 52:54:00:2d:53:93 via ens4: load balance to peer failover-partner
DHCPREQUEST for 10.19.101.120 (10.19.101.236) from 52:54:00:2d:53:93 via ens4: lease owned by peer

So far (after 1.5h) no crash has been reported in any of the servers.

Questions:

1) Anything missed from the provided configuration?
2) Is this load or concurrency related? meaning a specific amount of leases needs to be allocated for this crash to happen?

I will take a look to an existing crash/coredump.

Jorge Niedbalski (niedbalski) wrote :
Download full text (3.2 KiB)

Hello,

I checked the backtrace of a crashed dhcpd running on 4.4.1-2.1ubuntu5.

(gdb) info threads
  Id Target Id Frame
* 1 Thread 0x7fb4ddecb700 (LWP 3170) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
  2 Thread 0x7fb4dd6ca700 (LWP 3171) __lll_lock_wait (futex=futex@entry=0x7fb4de6d2028, private=0) at lowlevellock.c:52
  3 Thread 0x7fb4de6cc700 (LWP 3169) futex_wake (private=<optimized out>, processes_to_wake=1, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:364
  4 Thread 0x7fb4de74f740 (LWP 3148) futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb4de6cd0d0) at ../sysdeps/nptl/futex-internal.h:183

(gdb) frame 2
#2 0x00007fb4dec85985 in isc_assertion_failed (file=file@entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line@entry=3361, type=type@entry=isc_assertiontype_insist,
    cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
(gdb) bt
#1 0x00007fb4deaa7859 in __GI_abort () at abort.c:79
#2 0x00007fb4dec85985 in isc_assertion_failed (file=file@entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line@entry=3361, type=type@entry=isc_assertiontype_insist,
    cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
#3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
#4 process_fd (writeable=<optimized out>, readable=<optimized out>, fd=11, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4054
#5 process_fds (writefds=<optimized out>, readfds=0x7fb4de6d1090, maxfd=13, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4211
#6 watcher (uap=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4397
#7 0x00007fb4dea68609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8 0x00007fb4deba4103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(gdb) frame 3
#3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
4041 in ../../../../lib/isc/unix/socket.c
(gdb) p sock->pending_send
$2 = 1

The code is crashing on this assertion: https://gitlab.isc.org/isc-projects/bind9/-/blob/v9_11_3/lib/isc/unix/socket.c#L3364
This was already reported and marked as fixed in debian (?) via [0]

""Now if a wakeup event occurres the socket would be dispatched for
processing regardless which kind of event (timer?) triggered the wakeup.
At least I did not find any sanity checks in process_fds() except
SOCK_DEAD(sock).

This leads to the following situation: The sock is not dead yet but it
is still pending when it is dispatched again.

I would now check sock->pending_send before calling dispatch_send().This
 would at least prevent the assertion failure - well knowing that the
situation described above ( not dead but still pending and alerting ) is
not a very pleasant one - until someone comes up with a better solution.
"""

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=430065#20

** Follow up questions:

0) The reproducer doesn't seems consistent and seems to be related to a race
condition associated with a internal timer/futex.
1) Can anyone confirm that a pristine...

Read more...

Jorge Niedbalski (niedbalski) wrote :

** Follow up from my previous comment, I've built a PPA with a fix similar to the one
pointed in 430065.

https://launchpad.net/~niedbalski/+archive/ubuntu/fix-1872118

* I'd appreciate if anyone can test that PPA with focal and report back if the
problem keeps reproducible with that version.

In case it does, please upload the crash file / coredump and the configuration
file used.

Thank you.

Andrew Welham (andreww-ubuntu) wrote :

As my systems are pinned not to accept isc-dhcp-server upgrades, as I run the 18.04 version as it works. I extracted the dhcpd file from your packages and dropped it in place.
It ran for about 5 hours with thousands of requests.
Then it crashed with

Aug 4 13:13:29 gw2-focal sh[63448]: ../../../../lib/isc/unix/socket.c:3361: INSIST(!sock->pending_send) failed, back trace
Aug 4 13:13:29 gw2-focal sh[63448]: #0 0x7f0ed6a02a4a in ??
Aug 4 13:13:29 gw2-focal sh[63448]: #1 0x7f0ed6a02980 in ??
Aug 4 13:13:29 gw2-focal sh[63448]: #2 0x7f0ed6a3e7e1 in ??
Aug 4 13:13:29 gw2-focal sh[63448]: #3 0x7f0ed67e5609 in ??
Aug 4 13:13:29 gw2-focal sh[63448]: #4 0x7f0ed6921103 in ??

In the syslog

The start of the crash (up to the base64 dump) is

ProblemType: Crash
Architecture: amd64
Date: Tue Aug 4 13:13:30 2020
DistroRelease: Ubuntu 20.04
ExecutablePath: /usr/sbin/dhcpd
ExecutableTimestamp: 1596530249
ProcCmdline: dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf
ProcEnviron: Error: [Errno 13] Permission denied: 'environ'
ProcMaps: Error: [Errno 13] Permission denied: 'maps'
ProcStatus:
 Name: dhcpd
 Umask: 0022
 State: D (disk sleep)
 Tgid: 63448
 Ngid: 0
 Pid: 63448
 PPid: 1
 TracerPid: 0
 Uid: 113 113 113 113
 Gid: 118 118 118 118
 FDSize: 128
 Groups:
 NStgid: 63448
 NSpid: 63448
 NSpgid: 63448
 NSsid: 63448
 VmPeak: 235996 kB
 VmSize: 170540 kB
 VmLck: 0 kB
 VmPin: 0 kB
 VmHWM: 12468 kB
 VmRSS: 12468 kB
 RssAnon: 6216 kB
 RssFile: 6252 kB
 RssShmem: 0 kB
 VmData: 30872 kB
 VmStk: 132 kB
 VmExe: 592 kB
 VmLib: 5424 kB
 VmPTE: 88 kB
 VmSwap: 0 kB
 HugetlbPages: 0 kB
 CoreDumping: 1
 THP_enabled: 1
 Threads: 4
 SigQ: 0/31897
 SigPnd: 0000000000000000
 ShdPnd: 0000000000000000
 SigBlk: 0000000000000000
 SigIgn: 0000000000001000
 SigCgt: 0000000180000000
 CapInh: 0000000000000000
 CapPrm: 0000000000000000
 CapEff: 0000000000000000
 CapBnd: 0000003fffffffff
 CapAmb: 0000000000000000
 NoNewPrivs: 0
 Seccomp: 0
 Speculation_Store_Bypass: thread vulnerable
 Cpus_allowed: f
 Cpus_allowed_list: 0-3
 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000
0000,00000000,00000000,00000001
 Mems_allowed_list: 0
 voluntary_ctxt_switches: 40
 nonvoluntary_ctxt_switches: 19
Signal: 6
Uname: Linux 5.4.0-42-generic x86_64
UserGroups: N/A

Jorge Niedbalski (niedbalski) wrote :

Hello Andrew,

The fix is on the libisc-export1105 library (check dependencies on: https://launchpad.net/~niedbalski/+archive/ubuntu/fix-1872118/+packages), just replacing
the dhcpd binary won't be enough.

If you can install isc-dhcp-server and its dependencies from the PPA and test, would be great.

Thanks for any feedback.

Andrew Welham (andreww-ubuntu) wrote :

I've done the apt install on one of the failover server (Its the one that mostly crashes)

Listing... Done
isc-dhcp-server/focal,now 4.4.1-2.1ubuntu6~ppa1 amd64 [installed]

still crashed

syslog shows
Aug 4 17:40:44 gw2-focal sh[74365]: ../../../../lib/isc/unix/socket.c:3361: INSIST(!sock->pending_send) failed, back trace
Aug 4 17:40:44 gw2-focal sh[74365]: #0 0x7f2302a7fa4a in ??
Aug 4 17:40:44 gw2-focal sh[74365]: #1 0x7f2302a7f980 in ??
Aug 4 17:40:44 gw2-focal sh[74365]: #2 0x7f2302abb7e1 in ??
Aug 4 17:40:44 gw2-focal sh[74365]: #3 0x7f2302862609 in ??
Aug 4 17:40:44 gw2-focal sh[74365]: #4 0x7f230299e103 in ??
Aug 4 17:40:44 gw2-focal systemd[1]: isc-dhcp-server.service: Main process exited, code=killed, status=6/ABRT
Aug 4 17:40:44 gw2-focal systemd[1]: isc-dhcp-server.service: Failed with result 'signal'.

text part of crash shows

ProblemType: Crash
Architecture: amd64
Date: Tue Aug 4 17:44:17 2020
DistroRelease: Ubuntu 20.04
ExecutablePath: /usr/sbin/dhcpd
ExecutableTimestamp: 1596511208
ProcCmdline: dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf
ProcEnviron: Error: [Errno 13] Permission denied: 'environ'
ProcMaps: Error: [Errno 13] Permission denied: 'maps'
ProcStatus:
 Name: dhcpd
 Umask: 0022
 State: D (disk sleep)
 Tgid: 74528
 Ngid: 0
 Pid: 74528
 PPid: 1
 TracerPid: 0
 Uid: 113 113 113 113
 Gid: 118 118 118 118
 FDSize: 128
 Groups:
 NStgid: 74528
 NSpid: 74528
 NSpgid: 74528
 NSsid: 74528
 VmPeak: 235996 kB
 VmSize: 170540 kB
 VmLck: 0 kB
 VmPin: 0 kB
 VmHWM: 12088 kB
 VmRSS: 11904 kB
 RssAnon: 5784 kB
 RssFile: 6120 kB
 RssShmem: 0 kB
 VmData: 30568 kB
 VmStk: 132 kB
 VmExe: 592 kB
 VmLib: 5424 kB
 VmPTE: 88 kB
 VmSwap: 0 kB
 HugetlbPages: 0 kB
 CoreDumping: 1
 THP_enabled: 1
 Threads: 4
 SigQ: 0/31897
 SigPnd: 0000000000000000
 ShdPnd: 0000000000000000
 SigBlk: 0000000000000000
 SigIgn: 0000000000001000
 SigCgt: 0000000180000000
 CapInh: 0000000000000000
 CapPrm: 0000000000000000
 CapEff: 0000000000000000
 CapBnd: 0000003fffffffff
 CapAmb: 0000000000000000
 NoNewPrivs: 0
 Seccomp: 0
 Speculation_Store_Bypass: thread vulnerable
 Cpus_allowed: f
 Cpus_allowed_list: 0-3
 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000
0000,00000000,00000000,00000001
 Mems_allowed_list: 0
 voluntary_ctxt_switches: 22
 nonvoluntary_ctxt_switches: 20
Signal: 6
Uname: Linux 5.4.0-42-generic x86_64
UserGroups: N/A
CoreDump: base64

Andrew Welham (andreww-ubuntu) wrote :

error in the above description should have been
I've done the apt install on one of SERVERS the failover server (Its the one that mostly crashes)

Listing... Done
isc-dhcp-server/focal,now 4.4.1-2.1ubuntu6~ppa1 amd64 [installed]

still crashed

syslog shows
Aug 4 17:40:44 gw2-focal sh[74365]: ../../../../lib/isc/unix/socket.c:3361: INSIST(!sock->pending_send) failed, back trace
Aug 4 17:40:44 gw2-focal sh[74365]: #0 0x7f2302a7fa4a in ??
Aug 4 17:40:44 gw2-focal sh[74365]: #1 0x7f2302a7f980 in ??
Aug 4 17:40:44 gw2-focal sh[74365]: #2 0x7f2302abb7e1 in ??
Aug 4 17:40:44 gw2-focal sh[74365]: #3 0x7f2302862609 in ??
Aug 4 17:40:44 gw2-focal sh[74365]: #4 0x7f230299e103 in ??
Aug 4 17:40:44 gw2-focal systemd[1]: isc-dhcp-server.service: Main process exited, code=killed, status=6/ABRT
Aug 4 17:40:44 gw2-focal systemd[1]: isc-dhcp-server.service: Failed with result 'signal'.

text part of crash shows

ProblemType: Crash
Architecture: amd64
Date: Tue Aug 4 17:44:17 2020
DistroRelease: Ubuntu 20.04
ExecutablePath: /usr/sbin/dhcpd
ExecutableTimestamp: 1596511208
ProcCmdline: dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf
ProcEnviron: Error: [Errno 13] Permission denied: 'environ'
ProcMaps: Error: [Errno 13] Permission denied: 'maps'
ProcStatus:
 Name: dhcpd
 Umask: 0022
 State: D (disk sleep)
 Tgid: 74528
 Ngid: 0
 Pid: 74528
 PPid: 1
 TracerPid: 0
 Uid: 113 113 113 113
 Gid: 118 118 118 118
 FDSize: 128
 Groups:
 NStgid: 74528
 NSpid: 74528
 NSpgid: 74528
 NSsid: 74528
 VmPeak: 235996 kB
 VmSize: 170540 kB
 VmLck: 0 kB
 VmPin: 0 kB
 VmHWM: 12088 kB
 VmRSS: 11904 kB
 RssAnon: 5784 kB
 RssFile: 6120 kB
 RssShmem: 0 kB
 VmData: 30568 kB
 VmStk: 132 kB
 VmExe: 592 kB
 VmLib: 5424 kB
 VmPTE: 88 kB
 VmSwap: 0 kB
 HugetlbPages: 0 kB
 CoreDumping: 1
 THP_enabled: 1
 Threads: 4
 SigQ: 0/31897
 SigPnd: 0000000000000000
 ShdPnd: 0000000000000000
 SigBlk: 0000000000000000
 SigIgn: 0000000000001000
 SigCgt: 0000000180000000
 CapInh: 0000000000000000
 CapPrm: 0000000000000000
 CapEff: 0000000000000000
 CapBnd: 0000003fffffffff
 CapAmb: 0000000000000000
 NoNewPrivs: 0
 Seccomp: 0
 Speculation_Store_Bypass: thread vulnerable
 Cpus_allowed: f
 Cpus_allowed_list: 0-3
 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000
0000,00000000,00000000,00000001
 Mems_allowed_list: 0
 voluntary_ctxt_switches: 22
 nonvoluntary_ctxt_switches: 20
Signal: 6
Uname: Linux 5.4.0-42-generic x86_64
UserGroups: N/A
CoreDump: base64

Andrew Welham (andreww-ubuntu) wrote :

also noticed this APPARMOR is not happy now
Aug 4 09:37:37 gw2-focal kernel: [257023.493017] audit: type=1400 audit(1596530257.080:51): apparmor="DENIED" operation="open" profile="/usr/sbin/dhcpd" name="/proc/63448/task/63463/comm" pid=63448 comm="dhcpd" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=0
Aug 4 09:37:37 gw2-focal kernel: [257023.493726] audit: type=1400 audit(1596530257.084:52): apparmor="DENIED" operation="open" profile="/usr/sbin/dhcpd" name="/proc/63448/task/63464/comm" pid=63448 comm="dhcpd" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=0
Aug 4 09:37:37 gw2-focal kernel: [257023.494441] audit: type=1400 audit(1596530257.084:53): apparmor="DENIED" operation="open" profile="/usr/sbin/dhcpd" name="/proc/63448/task/63465/comm" pid=63448 comm="dhcpd" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=0
Aug 4 13:49:39 gw2-focal kernel: [272145.694002] audit: type=1400 audit(1596545379.184:54): apparmor="DENIED" operation="open" profile="/usr/sbin/dhcpd" name="/proc/67792/task/67804/comm" pid=67792 comm="dhcpd" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=0
Aug 4 13:49:39 gw2-focal kernel: [272145.694558] audit: type=1400 audit(1596545379.184:55): apparmor="DENIED" operation="open" profile="/usr/sbin/dhcpd" name="/proc/67792/task/67805/comm" pid=67792 comm="dhcpd" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=0
Aug 4 13:49:39 gw2-focal kernel: [272145.694837] audit: type=1400 audit(1596545379.184:56): apparmor="DENIED" operation="open" profile="/usr/sbin/dhcpd" name="/proc/67792/task/67806/comm" pid=67792 comm="dhcpd" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=0
Aug 4 15:20:30 gw2-focal kernel: [277597.307707] audit: type=1400 audit(1596550830.760:57): apparmor="DENIED" operation="open" profile="/usr/sbin/dhcpd" name="/proc/69853/task/69865/comm" pid=69853 comm="dhcpd" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=0
Aug 4 15:20:30 gw2-focal kernel: [277597.308533] audit: type=1400 audit(1596550830.760:58): apparmor="DENIED" operation="open" profile="/usr/sbin/dhcpd" name="/proc/69853/task/69867/comm" pid=69853 comm="dhcpd" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=0
Aug 4 15:20:30 gw2-focal kernel: [277597.309043] audit: type=1400 audit(1596550830.760:59): apparmor="DENIED" operation="open" profile="/usr/sbin/dhcpd" name="/proc/69853/task/69868/comm" pid=69853 comm="dhcpd" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=0
Aug 4 17:19:27 gw2-focal kernel: [284733.739418] audit: type=1400 audit(1596557967.150:60): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/dhcpd" pid=73881 comm="apparmor_parser"

Jorge Niedbalski (niedbalski) wrote :

Hello Andrew,

Correct me if I am wrong but it seems your system isn't running with libisc-export1105:amd64 1:9.11.16+dfsg-3~ppa1 (?)

I am running the following packages from the PPA, please note that libisc-export1105 is required (that's
where the fix is located).

root@dhcpd1:/home/ubuntu# dpkg -l | grep ppa1
ii isc-dhcp-server 4.4.1-2.1ubuntu6~ppa1 amd64 ISC DHCP server for automatic IP address assignment
ii libirs-export161 1:9.11.16+dfsg-3~ppa1 amd64 Exported IRS Shared Library
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC Shared Library
ii libisccfg-export163 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC CFG Shared Library

---

Sent update done message to failover-partner
failover peer failover-partner: peer moves from recover to recover-done
failover peer failover-partner: peer update completed.
failover peer failover-partner: I move from recover to recover-done
failover peer failover-partner: peer moves from recover-done to normal
failover peer failover-partner: I move from recover-done to normal
failover peer failover-partner: Both servers normal
balancing pool 55d0a88a4f10 12 total 221 free 221 backup 0 lts -110 max-own (+/-)22
balanced pool 55d0a88a4f10 12 total 221 free 221 backup 0 lts -110 max-misbal 33

---

balanced pool 55eb2fe58f40 12 total 221 free 111 backup 110 lts 0 max-misbal 33
Sending updates to failover-partner.
failover peer failover-partner: peer moves from recover-done to normal
failover peer failover-partner: Both servers normal

---
DHCPDISCOVER from 52:54:00:d1:eb:66 via ens4: load balance to peer failover-partner
DHCPREQUEST for 10.19.101.120 (10.19.101.233) from 52:54:00:d1:eb:66 via ens4: lease owned by peer

On failover:

peer failover-partner: disconnected
failover peer failover-partner: I move from normal to communications-interrupted
DHCPDISCOVER from 52:54:00:e8:14:0a via ens4
DHCPOFFER on 10.19.101.10 to 52:54:00:e8:14:0a (shapely-peccary) via ens4
DHCPREQUEST for 10.19.101.10 (10.19.101.127) from 52:54:00:e8:14:0a (shapely-peccary) via ens4
DHCPACK on 10.19.101.10 to 52:54:00:e8:14:0a (shapely-peccary) via ens4

I'll leave this running until I can reproduce the crash or assume that the fix
works.

Please let me know if you can reproduce with those packages.

Thanks,

Jorge Niedbalski

Andrew Welham (andreww-ubuntu) wrote :

isc-dhcp-server failed to start after that, a reboot fixed it and it appeared to stabilise. Then as a test i upgraded the primary server.
when the primary server came on line the secondary went off line and the stability of the environment went down fast.
Remove the patched version and installed the version from 18.04 to be able to stay online

Andrew Welham (andreww-ubuntu) wrote :
Download full text (11.0 KiB)

My config files are

PRIMARY SERVER CONFIG

default-lease-time 600;
max-lease-time 7200;

failover peer "failover-partner" {
        primary;
        address 192.168.0.2;
        port 519;
        peer address 192.168.0.3;
        peer port 521;
        max-response-delay 60;
        max-unacked-updates 10;
        mclt 3600;
        split 128;
        load balance max seconds 3;
    }

omapi-port 7911;
omapi-key omapi_key;

key omapi_key {
     algorithm hmac-md5;
     secret REMOVED;
}

subnet 192.168.0.0 netmask 255.255.255.0 {
 option routers 192.168.0.1;
 option domain-name-servers 192.168.0.2, 192.168.0.3;
 option domain-name "mydomain";
 option domain-search ".", "mydomain";
 option ntp-servers NTPHOST1, NTPHOST2;
 log-facility local7;
 include "/etc/dhcp/192.168.0-dhcp-static-leases.conf";
# include"/etc/dhcp/BANNED-MACs-dhcp-static-leases.conf";
 pool {
        failover peer "failover-partner";
        range 192.168.0.6 192.168.0.254;
        deny unknown-clients;
 }
}

subnet 192.168.1.0 netmask 255.255.255.0 {
 deny unknown-clients;
 option routers 192.168.1.1;
 option domain-name-servers 192.168.1.2, 192.168.1.3;
 option domain-name "mydomain";
 option domain-search ".", "mydomain";
 option ntp-servers NTPHOST1, NTPHOST2;
 log-facility local7;
 include "/etc/dhcp/192.168.1-dhcp-static-leases.conf";
# include"/etc/dhcp/BANNED-MACs-dhcp-static-leases.conf";
 pool {
         failover peer "failover-partner";
         range 192.168.1.6 192.168.1.254;
         deny unknown-clients;

 }
}

subnet 192.168.2.0 netmask 255.255.255.0 {
 deny unknown-clients;
 option routers 192.168.2.1;
 option domain-name-servers 192.168.2.2, 192.168.2.3;
 option domain-name "mydomain";
 option domain-search ".", "mydomain";
 option ntp-servers NTPHOST1, NTPHOST2;
 log-facility local7;
 include "/etc/dhcp/192.168.2-dhcp-static-leases.conf";
# include"/etc/dhcp/BANNED-MACs-dhcp-static-leases.conf";
 pool {
        failover peer "failover-partner";
        range 192.168.2.6 192.168.2.254;
        deny unknown-clients;

 }
}

subnet 192.168.3.0 netmask 255.255.255.0 {
 deny unknown-clients;
 option routers 192.168.3.1;
 option domain-name-servers 192.168.3.2, 192.168.3.3;
 option domain-name "mydomain";
 option domain-search ".", "mydomain";
 option ntp-servers NTPHOST1, NTPHOST2;
 log-facility local7;
 include "/etc/dhcp/192.168.3-dhcp-static-leases.conf";
# include"/etc/dhcp/BANNED-MACs-dhcp-static-leases.conf";
 pool {
        failover peer "failover-partner";
        range 192.168.3.6 192.168.3.254;
        deny unknown-clients;
 }
}

subnet 192.168.4.0 netmask 255.255.255.0 {
 deny unknown-clients;
 option routers 192.168.4.1;
 option domain-name-servers 192.168.4.2, 192.168.4.3;
 option domain-name "mydomain";
 option domain-search ".", "mydomain";
 option ntp-servers NTPHOST1, NTPHOST2;
 log-facility local7;
 include "/etc/dhcp/192.168.4-dhcp-static-leases.conf";
# include"/etc/dhcp/BANNED-MACs-dhcp-static-leases.conf";
 pool {
        failover peer "failover-partner";
        range 192.168.4.6 192.168.4.254;
        deny unknown-clients;
 }
}

subnet 192.168.5.0 netmask 255.255.255.0 {
 deny unknown-clients;
 option routers 192.168.5.1;
 o...

Jorge Niedbalski (niedbalski) wrote :

Andrew,

Thank you for your input.

** Do you have any logs or a crash report to take a look after you upgraded these systems?

On my test lab , I am counting for 3+ hours without a crash.

root@dhcpd1:/home/ubuntu# dpkg -l | grep ppa1
ii isc-dhcp-server 4.4.1-2.1ubuntu6~ppa1 amd64 ISC DHCP server for automatic IP address assignment
ii libirs-export161 1:9.11.16+dfsg-3~ppa1 amd64 Exported IRS Shared Library
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC Shared Library
ii libisccfg-export163 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC CFG Shared Library

---

DHCPACK on 10.19.101.120 to 52:54:00:d1:eb:66 (sleek-kodiak) via ens4
balancing pool 555643e55f40 12 total 221 free 111 backup 110 lts 0 max-own (+/-)22
balanced pool 555643e55f40 12 total 221 free 111 backup 110 lts 0 max-misbal 33
balancing pool 555643e55f40 12 total 221 free 111 backup 110 lts 0 max-own (+/-)22
balanced pool 555643e55f40 12 total 221 free 111 backup 110 lts 0 max-misbal 33

---

balancing pool 5595dff0df10 12 total 221 free 111 backup 110 lts 0 max-own (+/-)22
balanced pool 5595dff0df10 12 total 221 free 111 backup 110 lts 0 max-misbal 33
balancing pool 5595dff0df10 12 total 221 free 111 backup 110 lts 0 max-own (+/-)22
balanced pool 5595dff0df10 12 total 221 free 111 backup 110 lts 0 max-misbal 33

---

Richard Laager (rlaager) wrote :

I ran:
sudo apt install \
    isc-dhcp-server=4.4.1-2.1ubuntu6~ppa1 \
    libdns-export1109=1:9.11.16+dfsg-3~ppa1 \
    libirs-export161=1:9.11.16+dfsg-3~ppa1 \
    libisc-export1105=1:9.11.16+dfsg-3~ppa1 && \
        sudo systemctl restart isc-dhcp-server

The restart at the end was just for extra good measure, to make sure I was running on the new libraries.

I'm coming up on 3 hours running, which is a good sign.

Andrew Welham (andreww-ubuntu) wrote :

Last time in only did
sudo apt install \
    isc-dhcp-server=4.4.1-2.1ubuntu6~ppa1 \ sudo systemctl restart isc-dhcp-server

This time i ried as someone else had done.

sudo apt install \
    isc-dhcp-server=4.4.1-2.1ubuntu6~ppa1 \
    libdns-export1109=1:9.11.16+dfsg-3~ppa1 \
    libirs-export161=1:9.11.16+dfsg-3~ppa1 \
    libisc-export1105=1:9.11.16+dfsg-3~ppa1 && \

I've uploaded the full crash dump

Chris Newcomer (cnewcomer) wrote :

Running over a day with the packages from the PPA in place:
● isc-dhcp-server.service - ISC DHCP IPv4 server
     Loaded: loaded (/lib/systemd/system/isc-dhcp-server.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2020-08-04 16:23:19 UTC; 1 day 1h ago
       Docs: man:dhcpd(8)
   Main PID: 592 (dhcpd)
      Tasks: 4 (limit: 4620)
     Memory: 6.4M
     CGroup: /system.slice/isc-dhcp-server.service
             └─592 dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf

root@unique-macaw:/home/ubuntu# dpkg -l | grep ppa1
ii isc-dhcp-client 4.4.1-2.1ubuntu6~ppa1 amd64 DHCP client for automatically obtaining an IP address
ii isc-dhcp-common 4.4.1-2.1ubuntu6~ppa1 amd64 common manpages relevant to all of the isc-dhcp packages
ii isc-dhcp-server 4.4.1-2.1ubuntu6~ppa1 amd64 ISC DHCP server for automatic IP address assignment
ii libdns-export1109 1:9.11.16+dfsg-3~ppa1 amd64 Exported DNS Shared Library
ii libirs-export161 1:9.11.16+dfsg-3~ppa1 amd64 Exported IRS Shared Library
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC Shared Library
ii libisccfg-export163 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC CFG Shared Library

Jorge Niedbalski (niedbalski) wrote :

OK, I have no crashes to report for the last 24 hours with the PPA included here.

● isc-dhcp-server.service - ISC DHCP IPv4 server
     Loaded: loaded (/lib/systemd/system/isc-dhcp-server.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2020-08-04 14:58:11 UTC; 1 day 1h ago
       Docs: man:dhcpd(8)
   Main PID: 1202 (dhcpd)
      Tasks: 5 (limit: 5882)
     Memory: 6.3M
     CGroup: /system.slice/isc-dhcp-server.service
             └─592 dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf

root@dhcpd1:/home/ubuntu# dpkg -l | grep ppa1
ii isc-dhcp-server 4.4.1-2.1ubuntu6~ppa1 amd64 ISC DHCP server for automatic IP address assignment
ii libirs-export161 1:9.11.16+dfsg-3~ppa1 amd64 Exported IRS Shared Library
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC Shared Library
ii libisccfg-export163 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC CFG Shared Library

Andrew, what's the current version of libisc-export1105:amd64 installed in your system? can you provide a dpkg -l output and a systemctl status for the dhcpd service? Did you restarted/killed the dhcpd processes after upgrading that library?

Thanks in advance.

Richard Laager (rlaager) wrote :

Jorge, I have been running for 25 hours on the patched version with no crashes on either server.

Andrew Welham (andreww-ubuntu) wrote :

Yes i restarted isc-dhcp-server with systemctl restart isc-dhcp-server ( even tried a reboot once)

# dpkg -l | grep ppa1
ii isc-dhcp-server 4.4.1-2.1ubuntu6~ppa1 amd64 ISC DHCP server for automatic IP address assignment
ii libdns-export1109 1:9.11.16+dfsg-3~ppa1 amd64 Exported DNS Shared Library
ii libirs-export161 1:9.11.16+dfsg-3~ppa1 amd64 Exported IRS Shared Library
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC Shared Library
ii libisccfg-export163 1:9.11.16+dfsg-3~ppa1 amd64 Exported ISC CFG Shared Library

# systemctl status isc-dhcp-server
● isc-dhcp-server.service - ISC DHCP IPv4 server
     Loaded: loaded (/lib/systemd/system/isc-dhcp-server.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-08-06 07:26:09 BST; 2min 39s ago
       Docs: man:dhcpd(8)
   Main PID: 42301 (dhcpd)
      Tasks: 4 (limit: 9569)
     Memory: 7.6M
     CGroup: /system.slice/isc-dhcp-server.service
             └─42301 dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf

Aug 06 07:28:47 gw2-focal dhcpd[42301]: Remove host declaration test or remove 192.168.0.90
Aug 06 07:28:47 gw2-focal dhcpd[42301]: from the dynamic address pool for 192.168.0.0/24
Aug 06 07:28:47 gw2-focal dhcpd[42301]: DHCPREQUEST for 192.168.0.90 (192.168.0.2) from MACREMOVED via ens3: ignored (not authoritative).
Aug 06 07:28:47 gw2-focal dhcpd[42301]: DHCPREQUEST for 192.168.6.161 (192.168.6.3) from MACREMOVED (test) via ens13: ignored (not authoritative).
Aug 06 07:28:47 gw2-focal dhcpd[42301]: DHCPREQUEST for 192.168.6.161 (192.168.6.3) from MACREMOVED (test) via ens12: ignored (not authoritative).
Aug 06 07:28:47 gw2-focal dhcpd[42301]: DHCPREQUEST for 192.168.6.161 (192.168.6.3) from MACREMOVED (test) via ens11: ignored (not authoritative).
Aug 06 07:28:47 gw2-focal dhcpd[42301]: DHCPREQUEST for 192.168.6.161 (192.168.6.3) from MACREMOVED (test) via ens10: ignored (not authoritative).
Aug 06 07:28:47 gw2-focal dhcpd[42301]: DHCPREQUEST for 192.168.6.161 (192.168.6.3) from MACREMOVED (test) via ens9: ignored (not authoritative).
Aug 06 07:28:47 gw2-focal dhcpd[42301]: DHCPREQUEST for 192.168.6.161 (192.168.6.3) from MACREMOVED (test) via ens8: ignored (not authoritative).
Aug 06 07:28:47 gw2-focal dhcpd[42301]: DHCPREQUEST for 192.168.6.161 (192.168.6.3) from MACREMOVED (test) via ens3: ignored (not authoritative).

Andrew Welham (andreww-ubuntu) wrote :

Not sure if this helps i coped the coredump file to a test system install gdb
and got the following
# gdb /usr/sbin/dhcpd ./CoreDump
GNU gdb (Ubuntu 9.1-0ubuntu1) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/dhcpd...
(No debugging symbols found in /usr/sbin/dhcpd)
[New LWP 63464]
[New LWP 63463]
[New LWP 63448]
[New LWP 63465]

warning: Unexpected size of section `.reg-xstate/63464' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dh'.
Program terminated with signal SIGABRT, Aborted.

warning: Unexpected size of section `.reg-xstate/63464' in core file.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f0ed5c48700 (LWP 63464))]

Jorge Niedbalski (niedbalski) wrote :

Hello Andrew,

I just reviewed the core file that you provided. Thread 1 is the thread that panics
on assertion because sock.pending_send is already set. This is the condition I prevented
in the PPA, so *shouldn't* be hitting the frame 3

In my test systems I don't hit this condition, dispatch_send isn't called if pending_send
is set.

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f39a41f5700 (LWP 18780))]
#1 0x00007f39a4dd1859 in __GI_abort () at abort.c:79
79 in abort.c
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f39a4dd1859 in __GI_abort () at abort.c:79
#2 0x00007f39a4faf985 in isc_assertion_failed (file=<optimized out>, line=<optimized out>, type=<optimized out>, cond=<optimized out>) at ../../../lib/isc/assertions.c:52
#3 0x00007f39a4feb7e1 in dispatch_send (sock=0x7f39a4a03730) at ../../../../lib/isc/unix/socket.c:3380
#4 process_fd (writeable=<optimized out>, readable=<optimized out>, fd=0, manager=0x7f39a49fa010) at ../../../../lib/isc/unix/socket.c:4054
#5 process_fds (writefds=<optimized out>, readfds=0x16, maxfd=-1533038191, manager=0x7f39a49fa010) at ../../../../lib/isc/unix/socket.c:4211
#6 watcher (uap=0x7f39a49fa010) at ../../../../lib/isc/unix/socket.c:4397
[...]
(gdb) frame 3
#3 0x00007f39a4feb7e1 in dispatch_send (sock=0x7f39a4a03730) at ../../../../lib/isc/unix/socket.c:3380
3380 ../../../../lib/isc/unix/socket.c: No such file or directory.
(gdb) info locals
iev = 0x0
ev = <optimized out>
sender = 0x2
iev = <optimized out>
ev = <optimized out>
sender = <optimized out>

(gdb) p sock
$1 = (isc__socket_t *) 0x7f39a4a03730
(gdb) p sock.pending_send
$2 = 1

Can you check your library links, etc?

ubuntu@dhcpd1:~$ ldd /usr/sbin/dhcpd | grep export
 libirs-export.so.161 => /lib/x86_64-linux-gnu/libirs-export.so.161 (0x00007f5cb62e5000)
 libdns-export.so.1109 => /lib/x86_64-linux-gnu/libdns-export.so.1109 (0x00007f5cb60b0000)
 libisc-export.so.1105 => /lib/x86_64-linux-gnu/libisc-export.so.1105 (0x00007f5cb6039000)
 libisccfg-export.so.163 => /lib/x86_64-linux-gnu/libisccfg-export.so.163 (0x00007f5cb5df5000)
ubuntu@dhcpd1:~$ dpkg -S /lib/x86_64-linux-gnu/libisc-export.so.1105
libisc-export1105:amd64: /lib/x86_64-linux-gnu/libisc-export.so.1105
ubuntu@dhcpd1:~$ apt-cache policy libisc-export1105 | grep -i ppa
  Installed: 1:9.11.16+dfsg-3~ppa1
  Candidate: 1:9.11.16+dfsg-3~ppa1
 *** 1:9.11.16+dfsg-3~ppa1 500
        500 http://ppa.launchpad.net/niedbalski/1872188-dbg/ubuntu focal/main amd64 Packages

Andrew Welham (andreww-ubuntu) wrote :

Jorge, Here you go

andrew@gw2-focal:~$ ldd /usr/sbin/dhcpd | grep export
        libirs-export.so.161 => /lib/x86_64-linux-gnu/libirs-export.so.161 (0x00 007fa030e7b000)
        libdns-export.so.1109 => /lib/x86_64-linux-gnu/libdns-export.so.1109 (0x 00007fa030c46000)
        libisc-export.so.1105 => /lib/x86_64-linux-gnu/libisc-export.so.1105 (0x 00007fa030bcf000)
        libisccfg-export.so.163 => /lib/x86_64-linux-gnu/libisccfg-export.so.163 (0x00007fa03098b000)
andrew@gw2-focal:~$ ldd /usr/sbin/dhcpd | grep export
        libirs-export.so.161 => /lib/x86_64-linux-gnu/libirs-export.so.161 (0x00007fa9fbeaf000)
        libdns-export.so.1109 => /lib/x86_64-linux-gnu/libdns-export.so.1109 (0x00007fa9fbc7a000)
        libisc-export.so.1105 => /lib/x86_64-linux-gnu/libisc-export.so.1105 (0x00007fa9fbc03000)
        libisccfg-export.so.163 => /lib/x86_64-linux-gnu/libisccfg-export.so.163 (0x00007fa9fb9bf000)
andrew@gw2-focal:~$ dpkg -S /lib/x86_64-linux-gnu/libisc-export.so.1105
libisc-export1105:amd64: /lib/x86_64-linux-gnu/libisc-export.so.1105
andrew@gw2-focal:~$ apt-cache policy libisc-export1105 | grep -i ppa
  Installed: 1:9.11.16+dfsg-3~ppa1
  Candidate: 1:9.11.16+dfsg-3~ppa1
 *** 1:9.11.16+dfsg-3~ppa1 500
        500 http://ppa.launchpad.net/niedbalski/fix-1872118/ubuntu focal/main amd64 Packages
andrew@gw2-focal:~$ ls -als /lib/x86_64-linux-gnu/libirs-export.so.161

Andrew Welham (andreww-ubuntu) wrote :

I think on of the links was wrong, just remade it. and going to test again

Andrew Welham (andreww-ubuntu) wrote :

looks like it may be fixed , been running about 5 hours about 400k request and still alive. early signs are good, with the patch and my dodgy link removed. Will leave running over night

Richard Laager (rlaager) wrote :

Jorge, it sounds like ISC might think there is a more fundamental issue here:
https://gitlab.isc.org/isc-projects/dhcp/-/issues/121#note_152804

Jorge Niedbalski (niedbalski) wrote :

Hello @Andrew, @rlaager,

Any crashes to report before I propose this patch? my env is running this patch for close to 3 days without any failures.

Thanks,

Richard Laager (rlaager) wrote :

No crashes to report.

Andrew Welham (andreww-ubuntu) wrote :

Since i replaced the bad link , the patch appears to be working. been running for 14 hours and over 1.2 Million requests both servers running fine. Many thanks for all your effots.

Andrew Welham (andreww-ubuntu) wrote :

Pretty sure its working now ;) been running 24 hours and over 2 million requests. Thanks for all the effort Jorge

just to clarify, the change is in bind9-libs and fixes the issue also for my servers.

the change
+--- bind9-libs-9.11.16+dfsg.orig/lib/isc/unix/socket.c
++++ bind9-libs-9.11.16+dfsg/lib/isc/unix/socket.c
+@@ -4050,7 +4050,7 @@ check_write:
+ if (!SOCK_DEAD(sock)) {
+ if (sock->connecting)
+ dispatch_connect(sock);
+- else
++ else if (!sock->pending_send)
+ dispatch_send(sock);
+ }
+ unwatch_write = true;

doesn't change the ABI/API of the library, so no-change rebuilding isc-dhcp package against the fix is not useful.

Changed in bind9-libs (Ubuntu Focal):
status: New → In Progress
Changed in bind9-libs (Ubuntu Groovy):
status: New → In Progress
Changed in isc-dhcp (Ubuntu Focal):
status: New → In Progress
Changed in isc-dhcp (Ubuntu Groovy):
status: Confirmed → In Progress
Changed in bind9-libs (Ubuntu Focal):
assignee: nobody → Jorge Niedbalski (niedbalski)
Changed in bind9-libs (Ubuntu Groovy):
assignee: nobody → Jorge Niedbalski (niedbalski)
Changed in isc-dhcp (Ubuntu Focal):
assignee: nobody → Jorge Niedbalski (niedbalski)
Changed in isc-dhcp (Ubuntu Groovy):
assignee: nobody → Jorge Niedbalski (niedbalski)
summary: - DHCP Cluster crashes after a few hours
+ [SRU] DHCP Cluster crashes after a few hours
description: updated
description: updated
Jorge Niedbalski (niedbalski) wrote :
Jorge Niedbalski (niedbalski) wrote :
Jorge Niedbalski (niedbalski) wrote :

Uploaded debdiff(s) for groovy and focal. This will require a follow up rebuild change for isc-dhcp, once the library change lands.

Richard Laager (rlaager) wrote :

Jorge, I agree with Gianfranco Costamagna that a rebuild of isc-dhcp is NOT required. Why do you think it is?

Presumably BIND also uses these libraries? If so, it seems like the Test Case should involve making sure BIND still seems to work, and that BIND should be mentioned in the Regression Potential. My DHCP servers also run BIND for recursive DNS and that has been fine with the patch applied.

Andreas Hasenack (ahasenack) wrote :

bind9 in focal uses the 9.16.x versions of these libraries, packaged separately. It's just 9.11.x that was packaged as bind9-libs because of legacy applications like isc-dhcp that do not work with the 9.16 version of bind9.

The attachment "lp-1872118-groovy.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch

I uploaded in groovy (sorry, I didn't find the debdiff and it was partially wrong) and focal for review.

Changed in isc-dhcp (Ubuntu Focal):
status: In Progress → Invalid
Changed in isc-dhcp (Ubuntu Groovy):
status: In Progress → Incomplete
status: Incomplete → Invalid

I'm marking isc-dhcp tasks as invalid!

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bind9-libs - 1:9.11.19+dfsg-1ubuntu1

---------------
bind9-libs (1:9.11.19+dfsg-1ubuntu1) groovy; urgency=medium

  [ Jorge Niedbalski <email address hidden> ]
  * debian/patches/0010-fix-1872118.patch: Check if pending_send
    if set before calling dispatch_send. Fixes LP: #1872118.

 -- Gianfranco Costamagna <email address hidden> Tue, 11 Aug 2020 15:25:14 +0200

Changed in bind9-libs (Ubuntu Groovy):
status: In Progress → Fix Released
Richard Laager (rlaager) wrote :

Excellent. I'm available to test the -proposed update for focal whenever it is ready.

Changed in bind9-libs (Ubuntu Focal):
importance: Undecided → High

Hello Andrew, or anyone else affected,

Accepted bind9-libs into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/bind9-libs/1:9.11.16+dfsg-3~ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in bind9-libs (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Andrew Welham (andreww-ubuntu) wrote :

Timo,
Just installed the updates I will let you know
I installed as follows

apt-get install isc-dhcp-server

apt-get install libdns-export1109/focal-proposed
apt-get install libirs-export161/focal-proposed
apt-get install libisc-export1105/focal-proposed

The installed version are
GW1
# dpkg -l | grep isc-dhcp-server
ii isc-dhcp-server 4.4.1-2.1ubuntu5 amd64 ISC DHCP server for automatic IP address assignment
# dpkg -l | grep libdns-export1109
ii libdns-export1109 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported DNS Shared Library
# dpkg -l | grep libirs-export161
ii libirs-export161 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported IRS Shared Library
# dpkg -l | grep libisc-export1105
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported ISC Shared Library
#

GW2
# dpkg -l | grep isc-dhcp-server
ii isc-dhcp-server 4.4.1-2.1ubuntu5 amd64 ISC DHCP server for automatic IP address assignment
# dpkg -l | grep libdns-export1109
ii libdns-export1109 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported DNS Shared Library
# dpkg -l | grep libirs-export161
ii libirs-export161 1:9.11.16+dfsg-3~build1 amd64 Exported IRS Shared Library
# dpkg -l | grep libisc-export1105
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~build1 amd64 Exported ISC Shared Library

Can you please confirm the correct versions are installed

Andrew Welham (andreww-ubuntu) wrote :

not looking good,

Aug 15 20:14:55 gw2-focal sh[741]: ../../../../lib/isc/unix/socket.c:4359: fatal error: select() failed: Bad file descriptor
Aug 15 20:14:55 gw2-focal systemd[1]: isc-dhcp-server.service: Main process exited, code=killed, status=6/ABRT
Aug 15 20:14:55 gw2-focal systemd[1]: isc-dhcp-server.service: Failed with result 'signal'.

have the correct files been installed ?

Andrew Welham (andreww-ubuntu) wrote :
Download full text (3.9 KiB)

Just tried this
apt-get -y remove isc-dhcp-server
apt-get -y remove libdns-export1109/focal-proposed
apt-get -y remove libirs-export161/focal-proposed
apt-get -y remove libisc-export1105/focal-proposed
apt-get -y install isc-dhcp-server
apt-get -y install libdns1109/focal-proposed
apt-get -y install libirs161/focal-proposed
apt-get -y install libisc1105/focal-proposed

GW1
# dpkg -l | grep isc-dhcp-server
ii isc-dhcp-server 4.4.1-2.1ubuntu5 amd64 ISC DHCP server for automatic IP address assignment
# dpkg -l | grep libdns
ii libdns-export1100 1:9.11.3+dfsg-1ubuntu1.12 amd64 Exported DNS Shared Library
ii libdns-export1109 1:9.11.16+dfsg-3~build1 amd64 Exported DNS Shared Library
ii libdns1109:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 DNS Shared Library used by BIND
# dpkg -l | grep libirs
ii libirs-export160 1:9.11.3+dfsg-1ubuntu1.12 amd64 Exported IRS Shared Library
ii libirs-export161 1:9.11.16+dfsg-3~build1 amd64 Exported IRS Shared Library
ii libirs161:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 DNS Shared Library used by BIND
# dpkg -l | grep libisc
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~build1 amd64 Exported ISC Shared Library
ii libisc-export169:amd64 1:9.11.3+dfsg-1ubuntu1.12 amd64 Exported ISC Shared Library
ii libisc1105:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 ISC Shared Library used by BIND
ii libisccc161:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 Command Channel Library used by BIND
ii libisccfg-export160 1:9.11.3+dfsg-1ubuntu1.12 amd64 Exported ISC CFG Shared Library
ii libisccfg-export163 1:9.11.16+dfsg-3~build1 amd64 Exported ISC CFG Shared Library
ii libisccfg163:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 Config File Handling Library used by BIND

GW2
# dpkg -l | grep isc-dhcp-server
ii isc-dhcp-server 4.4.1-2.1ubuntu5 amd64 ISC DHCP server for automatic IP address assignment
# dpkg -l | grep libdns
ii libdns-export1100 1:9.11.3+dfsg-1ubuntu1.12 amd64 Exported DNS Shared Library
ii libdns-export1109 1:9.11.16+dfsg-3~build1 amd64 Exported DNS Shared Library
ii libdns1109:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 DNS Shared Library used by BIND
# dpkg -l | grep libirs
ii libirs-export160 1:9.11.3+dfsg-1ubuntu1.12 amd64 Exported IRS Shared Library
ii libirs-export161 1:9.11.16+dfsg-3~build1 amd64 Exported IRS Shared Library
ii libirs161:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 DNS Shared Library used by BIND
# dpkg -l | grep libisc
ii libisc-export1105:amd64 1:9.11...

Read more...

Richard Laager (rlaager) wrote :

Andrew, 1:9.11.16+dfsg-3~build1 is wrong. The correct version is 1:9.11.16+dfsg-3~ubuntu1 (~ubuntu1 instead of ~build1).

Andrew Welham (andreww-ubuntu) wrote :

I managed to get it working last night, but the install was a bit of a mess. It did how ever run over night.

This morning i carried out the following
sudo apt install \
    isc-dhcp-server \
    libdns-export1109/focal-proposed \
    libirs-export161/focal-proposed \
    libisc-export1105/focal-proposed

The versions are below,and i will leave this to run as a test

GW1
# dpkg -l | grep libdns-export1109
ii libdns-export1109 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported DNS Shared Library
# dpkg -l | grep libirs-export161
ii libirs-export161 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported IRS Shared Library
# dpkg -l | grep libisc-export1105
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported ISC Shared Library

GW2

# dpkg -l | grep libdns-export1109
ii libdns-export1109 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported DNS Shared Library
# dpkg -l | grep libirs-export161
ii libirs-export161 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported IRS Shared Library
# dpkg -l | grep libisc-export1105
ii libisc-export1105:amd64 1:9.11.16+dfsg-3~ubuntu1 amd64 Exported ISC Shared Library

Richard Laager (rlaager) wrote :

First I reverted isc-dhcp-server back to the original focal version, since I had an updated version from the PPA:
$ sudo apt install isc-dhcp-server=4.4.1-2.1ubuntu5 isc-dhcp-common=4.4.1-2.1ubuntu5

Then I install the update packages:
$ sudo apt update
$ sudo apt install libdns-export1109/focal-proposed libirs-export161/focal-proposed libisc-export1105/focal-proposed
$ dpkg --status libdns-export1109 libirs-export161 libisc-export1105 | grep Version
Version: 1:9.11.16+dfsg-3~ubuntu1
Version: 1:9.11.16+dfsg-3~ubuntu1
Version: 1:9.11.16+dfsg-3~ubuntu1

Then I restarted dhcpd:
sudo systemctl restart isc-dhcp-server

It has been running for four hours on both systems.

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Andrew Welham (andreww-ubuntu) wrote :

22 hours no issues
# systemctl status isc-dhcp-server
● isc-dhcp-server.service - ISC DHCP IPv4 server
     Loaded: loaded (/lib/systemd/system/isc-dhcp-server.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2020-08-16 09:43:12 BST; 22h ago
       Docs: man:dhcpd(8)
   Main PID: 730 (dhcpd)
      Tasks: 4 (limit: 9571)
     Memory: 10.9M
     CGroup: /system.slice/isc-dhcp-server.service
             └─730 dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf

no longer affects: isc-dhcp (Ubuntu)
no longer affects: isc-dhcp (Ubuntu Focal)
no longer affects: isc-dhcp (Ubuntu Groovy)
Changed in bind9-libs (Ubuntu Groovy):
importance: Undecided → High
Richard Laager (rlaager) wrote :

Likewise, it's been stable 24 hours here.

Also in my environment, after upgrading no errors so far.
Fix confirmed.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bind9-libs - 1:9.11.16+dfsg-3~ubuntu1

---------------
bind9-libs (1:9.11.16+dfsg-3~ubuntu1) focal; urgency=medium

   * debian/patches/fix-1872118.patch: Check if pending_send
    if set before calling dispatch_send to avoid assertion.
    Fixes LP: #1872118.

 -- Jorge Niedbalski <email address hidden> Tue, 11 Aug 2020 15:43:31 -0400

Changed in bind9-libs (Ubuntu Focal):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for bind9-libs has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.