MAAS 2.9 named crash due to limited allowed open files count

Bug #1901999 reported by Michał Ajduk
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Medium
Alexsander de Souza
2.9
Won't Fix
Medium
Unassigned
snapd
Triaged
Medium
Unassigned

Bug Description

# ENVIRONMENT
Ubuntu bionic
MAAS version (SNAP):
maas 2.9.0~beta7-9078-g.fb8df407f 10323 2.9/edge canonical* -

  MAAS was cleanly installed. KVM POD setup works.

  MAAS status:
bind9 NOT-RUNNING
dhcpd RUNNING pid 17097, uptime 15:38:50
dhcpd6 STOPPED Not started
http RUNNING pid 37832, uptime 15:58:18
ntp RUNNING pid 13840, uptime 15:39:47
proxy RUNNING pid 27269, uptime 16:01:44
rackd RUNNING pid 26523, uptime 16:01:49
regiond RUNNING pid 26524, uptime 16:01:49
syslog RUNNING pid 27272, uptime 16:01:44

# PROBLEM DESCRIPTION

MAAS managed named server crashes [2]. MAAS 2.9 comes with Bind 9.10 which contains some performance optimisations [1]:
2. ISC_SOCKET_MAXSOCKETS changed from 4096 to 21000

"named -S $number" can set it higher but no lower

This is the maximum number of sockets named can use. Again see lib/isc/unix/socket.c in the source tree; see also the documentation of "-S" in the named(8) man pages.

Note that the default LimitNOFILE on bionic is 4096:
~# systemctl show snap.maas.supervisor.service | grep LimitNOFILE
LimitNOFILE=4096
LimitNOFILESoft=1024
~#grep 'open files' /proc/$(pgrep named)/limits
Max open files 1024 4096 files

This is due to a fact that that snap.maas.supervisor.service does not contain LimitNOFILE stanza.

Note that changing /etc/security/limits.conf does not affect that behavior.

Impact:
Named is configured with 21k sockets and 4k allowed open files. When it stars receiving queries for livepatch.canonical.com directed to temporarily unreachable forwarder DNS server, the queries directed to forwarder DNS use up 4k of allowd sockets. Named crashs as it is configured with 21k allowed sockets and tries to allocate above the 4096 limit.

Details
I’ve noticed however that after sime time of correct oprtation MAAS managed bind crashes wit following log message in inamed.log:
28-Oct-2020 12:34:37.872 max open files (4096) is smaller than max sockets (21000)
...
28-Oct-2020 13:24:54.653 uv_export failed: permission denied
28-Oct-2020 13:24:54.653 uv_export failed: permission denied
28-Oct-2020 13:24:54.653 listening on IPv6 interface vnet8, fe80::fc54:ff:fe03:7f67%28#53
28-Oct-2020 13:24:54.673 udp.c:83: INSIST(csock->fd >= 0) failed, back trace
28-Oct-2020 13:24:54.673 #0 0x555e9f8c2e43 in ??
28-Oct-2020 13:24:54.673 #1 0x7fc90796eac0 in ??
28-Oct-2020 13:24:54.673 #2 0x7fc90798bf4d in ??
28-Oct-2020 13:24:54.673 #3 0x7fc907c5f82b in ??
28-Oct-2020 13:24:54.673 #4 0x7fc907c605d0 in ??
28-Oct-2020 13:24:54.673 #5 0x7fc907c60c1e in ??
28-Oct-2020 13:24:54.673 #6 0x555e9f8e0a6b in ??
28-Oct-2020 13:24:54.673 #7 0x555e9f8e406e in ??
28-Oct-2020 13:24:54.673 #8 0x7fc907995fe1 in ??
28-Oct-2020 13:24:54.673 #9 0x7fc90745e609 in ??
28-Oct-2020 13:24:54.673 #10 0x7fc90737f293 in ??
28-Oct-2020 13:24:54.673 exiting (due to assertion failure)

# WORKAROUND:
systemctl edit snap.maas.supervisor.service and put:
[Service]
LimitNOFILE=65535
LimitNOFILESoft=65535

systemctl daemon-reload
sustemctl restart snap.maas.supervisor

References:
[1] https://kb.isc.org/docs/aa-01314
[2] https://discourse.maas.io/t/maas-2-9-named-crashes-due-to-too-low-allowed-files-limit/3528

Related branches

Alberto Donato (ack)
Changed in maas:
status: New → Triaged
milestone: none → 2.10-next
importance: Undecided → Medium
Revision history for this message
Paweł Stołowski (stolowski) wrote :

Hi,
I understand that obviously the workaround is not ideal since snap.maas.supervisor.service file is managed by snapd and will get re-generated next time maas snap gets updated.

Do you plan to patch named in your snap for this issue (lower the maximum socket count etc), or is it expected that snapd manages this setting for snaps that need it? If so, then it should probably be discussed on the forum (or snapd irc channel).

Changed in snapd:
status: New → Triaged
importance: Undecided → Medium
Changed in maas:
milestone: 3.0.0 → 3.0.1
Changed in maas:
status: Triaged → Fix Committed
assignee: nobody → Alexsander de Souza (alexsander-souza)
Changed in maas:
status: Fix Committed → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 3.0.1 → 3.2.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.