[2.3a1] named stuck on reload, DNS broken

Bug #1710278 reported by Mark Shuttleworth on 2017-08-11
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
BIND
Undecided
Unassigned
MAAS
Critical
Blake Rouse
2.2
Critical
Blake Rouse
bind9 (Ubuntu)
High
Unassigned

Bug Description

Am running an HA MAAS, but every few days named gets stuck on one of the region controllers.

systemd thinks the service is running, but it doe not respond to any commands or requests. Also, it doesn't respond to signals other than kill -9. service restarts hang, rndc hangs.

I have attached logs and a core dump of named.

Related branches

Mark Shuttleworth (sabdfl) wrote :
Mark Shuttleworth (sabdfl) wrote :
Mike Pontillo (mpontillo) wrote :
Download full text (4.6 KiB)

Assuming the debug symbols I grabbed[1] for my install of bind9 on Xenial match yours (I have bind9 version 1:9.10.3.dfsg.P4-8ubuntu1.7 installed per "apt-cache policy bind9"), I did the following to grab a traceback:

$ sudo apt-get install bind9-dbgsym libdns162-dbgsym libisc160-dbgsym
$ gdb /usr/sbin/named core
(gdb) set pagination off
(gdb) thread apply all bt
... [2] ...

Looking at the backtrace in [2], the interesting parts to me are threads 8, 11 and 20, which are possibly involved in a deadlock[3]. Looks like one of the threads is reloading the configuration (something we would expect MAAS to do), and the other is calling dns_resolver_shutdown() via view_flushanddetach().

[1]: https://wiki.ubuntu.com/Debug%20Symbol%20Packages
[2]: http://paste.ubuntu.com/25292729/
[3]:
Thread 8 (Thread 0x7f95226aa700 (LWP 3203)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f952a351efe in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f94d4014fe8) at ../nptl/pthread_mutex_lock.c:135
#2 0x00007f952b7a0794 in dns_view_weakdetach (viewp=viewp@entry=0x7f9504389780) at ../../../lib/dns/view.c:597
#3 0x00007f952b7993de in destroy (val=0x7f9504389750) at ../../../lib/dns/validator.c:3891
#4 0x00007f952b79927b in dns_validator_destroy (validatorp=validatorp@entry=0x7f9519462628) at ../../../lib/dns/validator.c:3915
#5 0x00007f952b76b9d1 in validated (task=<optimized out>, event=0x7f95194625d0) at ../../../lib/dns/resolver.c:4722
#6 0x00007f952a9a6360 in dispatch (manager=0x7f952be3b010) at ../../../lib/isc/task.c:1130
#7 run (uap=0x7f952be3b010) at ../../../lib/isc/task.c:1302
#8 0x00007f952a34f6ba in start_thread (arg=0x7f95226aa700) at pthread_create.c:333
#9 0x00007f9529a993dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 11 (Thread 0x7f9520ea7700 (LWP 3206)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f952a9a516b in isc__task_beginexclusive (task0=<optimized out>) at ../../../lib/isc/task.c:1717
#2 0x0000557c34997dc1 in load_configuration (filename=<optimized out>, server=server@entry=0x7f952be44010, first_time=first_time@entry=isc_boolean_false) at ../../../bin/named/server.c:5651
#3 0x0000557c3499a826 in loadconfig (server=0x7f952be44010) at ../../../bin/named/server.c:7162
#4 0x0000557c3499ad48 in reload (server=0x7f952be44010) at ../../../bin/named/server.c:7183
#5 ns_server_reloadcommand (server=0x7f952be44010, args=args@entry=0x7f94fc120af0 "reload", text=text@entry=0x7f9520ea6590) at ../../../bin/named/server.c:7416
#6 0x0000557c34975db5 in ns_control_docommand (message=<optimized out>, text=text@entry=0x7f9520ea6590) at ../../../bin/named/control.c:102
#7 0x0000557c34978b97 in control_recvmessage (task=0x7f952be51010, event=<optimized out>) at ../../../bin/named/controlconf.c:458
#8 0x00007f952a9a6360 in dispatch (manager=0x7f952be3b010) at ../../../lib/isc/task.c:1130
#9 run (uap=0x7f952be3b010) at ../../../lib/isc/task.c:1302
#10 0x00007f952a34f6ba in start_thread (arg=0x7f9520ea7700) at pthread_create.c:333
#11 0x00007f9529a993dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

...

Read more...

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
Mike Pontillo (mpontillo) wrote :

This is technically Invalid for MAAS unless there is something unsupported about how we're using BIND, but I'm marking it Triaged for now so we don't lose visibility (in case a fix in MAAS itself turns out to be required).

It would be nice if the service monitoring in MAAS detected this condition, but that feels like it should be handled in a separate bug.

tags: added: server-next
Andreas Hasenack (ahasenack) wrote :

named is being asked to reload its zones quite frequently, sometimes within the same second:
Aug 11 16:31:08 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:17 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:18 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:22 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:26 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:29 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:30 maas named[3174]: received control channel command 'reload'
(...)
Aug 11 17:07:35 maas named[3174]: received control channel command 'reload'
Aug 11 17:07:35 maas named[3174]: received control channel command 'reload'

Eventually it gets stuck:
Aug 11 17:15:16 maas named[3174]: received control channel command 'reload'
Aug 11 17:15:16 maas named[3174]: loading configuration from '/etc/bind/named.conf'
Aug 11 17:15:16 maas named[3174]: reading built-in trusted keys from file '/etc/bind/bind.keys'
<stuck>

An idea to try to reproduce this would be to issue such aggressive reloads on a multi-core machine.

Mike Pontillo (mpontillo) wrote :

Right; to attempt to reproduce the issue, I would aggressively reload (changing the zone files each time) while at the same time sending a large amount of queries to the server (for records in locally authoritative zones?).

Changed in bind9 (Ubuntu):
status: New → Triaged
importance: Undecided → High

In MAAS, we should:

 * throttle reloads (at least make sure a reload is complete before we
trigger the next one)
 * monitor the actual service from the perspective of rackd's (perhaps
have rackd's do a dig @region-controller for a name we send them
whenever they talk to the region controller)
 * log loudly, kill and restart when the service monitoring fails

Mark

Mike Pontillo (mpontillo) wrote :

I'm +1 on throttling reloads; I think that is the most obvious and critical work item for the MAAS team to address. I have filed that as bug #1710308.

I'm also +1 on better service monitoring using actual queries; I've filed that as bug #1710310. I think something equivalent to 'dig @127.0.0.1 <test-query>' on the region should be enough to detect a deadlock condition, but I like the idea of monitoring it from the rack's perspective as well (though that feels more like a non-fatal warning, because we don't want to restart bind in the event of random firewall hiccups).

Finally, I think your last bullet requires more discussion before we can work on it. MAAS currently uses sudoers rules specific to the init system to start and stop services like bind9; we do not currently have permission to 'kill -9' arbitrary processes. I'm concerned that if we go down that road, we would open up the possibility that MAAS could erroneously (or due to a malicious attack) believe that bind9 isn't working and repeatedly kill it without good cause, or be convinced to 'kill -9' an incorrect process.

In summary, I think the most urgent thing for MAAS to do is throttle reloads. That should greatly reduce the window of opportunity for the deadlock to occur. In parallel, this should be addressed upstream in bind9.

Mike Pontillo (mpontillo) wrote :

I attempted to reproduce the bind9 issue by doing the following (in two separate sessions):

# Queue 10,000 concurrent reloads (also tried removing the & to make it less parallel)
i=0; while [ $i -lt 10000 ]; do (/usr/sbin/rndc reload&); let i=$i+1; done

# Hammer the DNS server with queries
while [ 1 ]; do dig @127.0.0.1 <maas-hostname>; done

Everything works properly when I do this by itself. But if I have parallel reload requests running *and* I make manual changes to the DNS zones in /etc/bind/maas, I have observed bind9 behaving badly, including (eventually) what seemed to be the deadlock (but my bind9 was older, so my debug symbols didn't match).[1] Then I observed a similar state where after I updated the zone file, it was as if nothing changed (bind9 was returning old data, which didn't resolve itself until I did "service bind9 restart").

It's my impression that the problem is worse when I do reloads in parallel. So this is more evidence pointing to "we should ensure MAAS never tries to reload bind9 twice in parallel".

[1]:
First observed extreme sluggishness in resolving queries, which resolved itself after several seconds.
Then observed a crash (which the system subsequently recovered from): http://paste.ubuntu.com/25293751/
Then observed a deadlock with the same symptoms.

Mark Shuttleworth (sabdfl) wrote :

On 12/08/17 01:11, Mike Pontillo wrote:
> Finally, I think your last bullet requires more discussion before we can
> work on it. MAAS currently uses sudoers rules specific to the init
> system to start and stop services like bind9; we do not currently have
> permission to 'kill -9' arbitrary processes. I'm concerned that if we go
> down that road, we would open up the possibility that MAAS could
> erroneously (or due to a malicious attack) believe that bind9 isn't
> working and repeatedly kill it without good cause, or be convinced to
> 'kill -9' an incorrect process.

This bug causes named to be unresponsive to anything other than kill -9.

MAAS installed, configured, started, and validates named's behaviour.
Assume there is no operator. Since kill -9 is necessary on occasion, it
follows that MAAS must have and must use that ability.

I could see MAAS trying it a few times and then giving up with a big
alert to the operators. But I absolutely think MAAS should treat this as
a bug in named which should be logged and managed nicely but nonetheless
handled transparently to users.

Mark

Mark Shuttleworth (sabdfl) wrote :

To avoid reloads in parallel, I think we should:

 * verify the reload happened (perhaps checking zone serial?)
 * make sure we defer and subsequent reload at least 10 seconds

Mark

Andreas Hasenack (ahasenack) wrote :

Maybe dns dynamic updates could be used instead of zone reloads if just a
few IPs were added or removed.

On Aug 12, 2017 06:22, "Mark Shuttleworth" <email address hidden>
wrote:

> To avoid reloads in parallel, I think we should:
>
> * verify the reload happened (perhaps checking zone serial?)
> * make sure we defer and subsequent reload at least 10 seconds
>
> Mark
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1710278
>
> Title:
> [2.3a1] named stuck on reload, DNS broken
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions
>

Mike Pontillo (mpontillo) wrote :

Mark, if you observe the deadlock again, can you run "systemctl stop bind9", wait a few minutes (at least 2, but maybe up to 5), and then check if bind9 successfully stops? It looks like systemd will (by default) resort to more aggressive methods to kill a service if it doesn't stop after ~90 seconds.

If the normal method of killing the bind9 service works, we can still avoid adding that scope and risk to MAAS. Rather, if we detect bind9 behaving badly, a stop/start cycle would also allow bind9 to properly shut down in most cases, and avoid any other bugs in BIND we might see as a side-effect of a "kill -9 <bind9-pid>" approach. (A human operator could troubleshoot those side effects, but it's more difficult for MAAS to anticipate, for example, why BIND might now fail to start up because of a lock file that was left on the filesystem when the 'kill -9' occurred.)

Mark Shuttleworth (sabdfl) wrote :

OK, will do, thanks :)

Changed in maas:
status: Triaged → In Progress
assignee: nobody → Blake Rouse (blake-rouse)
milestone: none → 2.3.0
Mark Shuttleworth (sabdfl) wrote :

OK, restarting via 'sudo service bind9 restart' does work in the end, it just takes a long time. The downside is that MAAS is not going to have an effective DNS for a few minutes, which is unacceptable.

Mike Pontillo (mpontillo) wrote :

Thanks. I wonder if we shouldn't fix this in Ubuntu by tweaking the systemd control files so that the timeout values are more acceptable for production use.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.3.0 → 2.3.0alpha2
Changed in maas:
status: Fix Committed → Fix Released
Joshua Powers (powersj) on 2019-02-19
tags: removed: server-next
Mark Shuttleworth (sabdfl) wrote :

Hold on, I think this bug is still problematic for MAAS and Ubuntu.

Joshua Powers (powersj) wrote :

Hey Mark, was cleaning up bug tags; still consider this an issue.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers