[2.3a1] named stuck on reload, DNS broken

Bug #1710278 reported by Mark Shuttleworth on 2017-08-11
36
This bug affects 3 people
Affects Status Importance Assigned to Milestone
BIND
Undecided
Unassigned
MAAS
Status tracked in 2.7
2.2
Critical
Blake Rouse
2.6
Critical
Blake Rouse
2.7
Critical
Blake Rouse
bind9 (Ubuntu)
Status tracked in Eoan
Bionic
Medium
Dan Streetman
Disco
Medium
Dan Streetman
Eoan
Medium
Dan Streetman

Bug Description

Am running an HA MAAS, but every few days named gets stuck on one of the region controllers.

systemd thinks the service is running, but it doe not respond to any commands or requests. Also, it doesn't respond to signals other than kill -9. service restarts hang, rndc hangs.

I have attached logs and a core dump of named.

Related branches

Mark Shuttleworth (sabdfl) wrote :
Mark Shuttleworth (sabdfl) wrote :
Mike Pontillo (mpontillo) wrote :
Download full text (4.6 KiB)

Assuming the debug symbols I grabbed[1] for my install of bind9 on Xenial match yours (I have bind9 version 1:9.10.3.dfsg.P4-8ubuntu1.7 installed per "apt-cache policy bind9"), I did the following to grab a traceback:

$ sudo apt-get install bind9-dbgsym libdns162-dbgsym libisc160-dbgsym
$ gdb /usr/sbin/named core
(gdb) set pagination off
(gdb) thread apply all bt
... [2] ...

Looking at the backtrace in [2], the interesting parts to me are threads 8, 11 and 20, which are possibly involved in a deadlock[3]. Looks like one of the threads is reloading the configuration (something we would expect MAAS to do), and the other is calling dns_resolver_shutdown() via view_flushanddetach().

[1]: https://wiki.ubuntu.com/Debug%20Symbol%20Packages
[2]: http://paste.ubuntu.com/25292729/
[3]:
Thread 8 (Thread 0x7f95226aa700 (LWP 3203)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f952a351efe in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f94d4014fe8) at ../nptl/pthread_mutex_lock.c:135
#2 0x00007f952b7a0794 in dns_view_weakdetach (viewp=viewp@entry=0x7f9504389780) at ../../../lib/dns/view.c:597
#3 0x00007f952b7993de in destroy (val=0x7f9504389750) at ../../../lib/dns/validator.c:3891
#4 0x00007f952b79927b in dns_validator_destroy (validatorp=validatorp@entry=0x7f9519462628) at ../../../lib/dns/validator.c:3915
#5 0x00007f952b76b9d1 in validated (task=<optimized out>, event=0x7f95194625d0) at ../../../lib/dns/resolver.c:4722
#6 0x00007f952a9a6360 in dispatch (manager=0x7f952be3b010) at ../../../lib/isc/task.c:1130
#7 run (uap=0x7f952be3b010) at ../../../lib/isc/task.c:1302
#8 0x00007f952a34f6ba in start_thread (arg=0x7f95226aa700) at pthread_create.c:333
#9 0x00007f9529a993dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 11 (Thread 0x7f9520ea7700 (LWP 3206)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f952a9a516b in isc__task_beginexclusive (task0=<optimized out>) at ../../../lib/isc/task.c:1717
#2 0x0000557c34997dc1 in load_configuration (filename=<optimized out>, server=server@entry=0x7f952be44010, first_time=first_time@entry=isc_boolean_false) at ../../../bin/named/server.c:5651
#3 0x0000557c3499a826 in loadconfig (server=0x7f952be44010) at ../../../bin/named/server.c:7162
#4 0x0000557c3499ad48 in reload (server=0x7f952be44010) at ../../../bin/named/server.c:7183
#5 ns_server_reloadcommand (server=0x7f952be44010, args=args@entry=0x7f94fc120af0 "reload", text=text@entry=0x7f9520ea6590) at ../../../bin/named/server.c:7416
#6 0x0000557c34975db5 in ns_control_docommand (message=<optimized out>, text=text@entry=0x7f9520ea6590) at ../../../bin/named/control.c:102
#7 0x0000557c34978b97 in control_recvmessage (task=0x7f952be51010, event=<optimized out>) at ../../../bin/named/controlconf.c:458
#8 0x00007f952a9a6360 in dispatch (manager=0x7f952be3b010) at ../../../lib/isc/task.c:1130
#9 run (uap=0x7f952be3b010) at ../../../lib/isc/task.c:1302
#10 0x00007f952a34f6ba in start_thread (arg=0x7f9520ea7700) at pthread_create.c:333
#11 0x00007f9529a993dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

...

Read more...

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
Mike Pontillo (mpontillo) wrote :

This is technically Invalid for MAAS unless there is something unsupported about how we're using BIND, but I'm marking it Triaged for now so we don't lose visibility (in case a fix in MAAS itself turns out to be required).

It would be nice if the service monitoring in MAAS detected this condition, but that feels like it should be handled in a separate bug.

tags: added: server-next
Andreas Hasenack (ahasenack) wrote :

named is being asked to reload its zones quite frequently, sometimes within the same second:
Aug 11 16:31:08 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:17 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:18 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:22 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:26 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:29 maas named[3174]: received control channel command 'reload'
Aug 11 16:31:30 maas named[3174]: received control channel command 'reload'
(...)
Aug 11 17:07:35 maas named[3174]: received control channel command 'reload'
Aug 11 17:07:35 maas named[3174]: received control channel command 'reload'

Eventually it gets stuck:
Aug 11 17:15:16 maas named[3174]: received control channel command 'reload'
Aug 11 17:15:16 maas named[3174]: loading configuration from '/etc/bind/named.conf'
Aug 11 17:15:16 maas named[3174]: reading built-in trusted keys from file '/etc/bind/bind.keys'
<stuck>

An idea to try to reproduce this would be to issue such aggressive reloads on a multi-core machine.

Mike Pontillo (mpontillo) wrote :

Right; to attempt to reproduce the issue, I would aggressively reload (changing the zone files each time) while at the same time sending a large amount of queries to the server (for records in locally authoritative zones?).

Changed in bind9 (Ubuntu):
status: New → Triaged
importance: Undecided → High

In MAAS, we should:

 * throttle reloads (at least make sure a reload is complete before we
trigger the next one)
 * monitor the actual service from the perspective of rackd's (perhaps
have rackd's do a dig @region-controller for a name we send them
whenever they talk to the region controller)
 * log loudly, kill and restart when the service monitoring fails

Mark

Mike Pontillo (mpontillo) wrote :

I'm +1 on throttling reloads; I think that is the most obvious and critical work item for the MAAS team to address. I have filed that as bug #1710308.

I'm also +1 on better service monitoring using actual queries; I've filed that as bug #1710310. I think something equivalent to 'dig @127.0.0.1 <test-query>' on the region should be enough to detect a deadlock condition, but I like the idea of monitoring it from the rack's perspective as well (though that feels more like a non-fatal warning, because we don't want to restart bind in the event of random firewall hiccups).

Finally, I think your last bullet requires more discussion before we can work on it. MAAS currently uses sudoers rules specific to the init system to start and stop services like bind9; we do not currently have permission to 'kill -9' arbitrary processes. I'm concerned that if we go down that road, we would open up the possibility that MAAS could erroneously (or due to a malicious attack) believe that bind9 isn't working and repeatedly kill it without good cause, or be convinced to 'kill -9' an incorrect process.

In summary, I think the most urgent thing for MAAS to do is throttle reloads. That should greatly reduce the window of opportunity for the deadlock to occur. In parallel, this should be addressed upstream in bind9.

Mike Pontillo (mpontillo) wrote :

I attempted to reproduce the bind9 issue by doing the following (in two separate sessions):

# Queue 10,000 concurrent reloads (also tried removing the & to make it less parallel)
i=0; while [ $i -lt 10000 ]; do (/usr/sbin/rndc reload&); let i=$i+1; done

# Hammer the DNS server with queries
while [ 1 ]; do dig @127.0.0.1 <maas-hostname>; done

Everything works properly when I do this by itself. But if I have parallel reload requests running *and* I make manual changes to the DNS zones in /etc/bind/maas, I have observed bind9 behaving badly, including (eventually) what seemed to be the deadlock (but my bind9 was older, so my debug symbols didn't match).[1] Then I observed a similar state where after I updated the zone file, it was as if nothing changed (bind9 was returning old data, which didn't resolve itself until I did "service bind9 restart").

It's my impression that the problem is worse when I do reloads in parallel. So this is more evidence pointing to "we should ensure MAAS never tries to reload bind9 twice in parallel".

[1]:
First observed extreme sluggishness in resolving queries, which resolved itself after several seconds.
Then observed a crash (which the system subsequently recovered from): http://paste.ubuntu.com/25293751/
Then observed a deadlock with the same symptoms.

Mark Shuttleworth (sabdfl) wrote :

On 12/08/17 01:11, Mike Pontillo wrote:
> Finally, I think your last bullet requires more discussion before we can
> work on it. MAAS currently uses sudoers rules specific to the init
> system to start and stop services like bind9; we do not currently have
> permission to 'kill -9' arbitrary processes. I'm concerned that if we go
> down that road, we would open up the possibility that MAAS could
> erroneously (or due to a malicious attack) believe that bind9 isn't
> working and repeatedly kill it without good cause, or be convinced to
> 'kill -9' an incorrect process.

This bug causes named to be unresponsive to anything other than kill -9.

MAAS installed, configured, started, and validates named's behaviour.
Assume there is no operator. Since kill -9 is necessary on occasion, it
follows that MAAS must have and must use that ability.

I could see MAAS trying it a few times and then giving up with a big
alert to the operators. But I absolutely think MAAS should treat this as
a bug in named which should be logged and managed nicely but nonetheless
handled transparently to users.

Mark

Mark Shuttleworth (sabdfl) wrote :

To avoid reloads in parallel, I think we should:

 * verify the reload happened (perhaps checking zone serial?)
 * make sure we defer and subsequent reload at least 10 seconds

Mark

Andreas Hasenack (ahasenack) wrote :

Maybe dns dynamic updates could be used instead of zone reloads if just a
few IPs were added or removed.

On Aug 12, 2017 06:22, "Mark Shuttleworth" <email address hidden>
wrote:

> To avoid reloads in parallel, I think we should:
>
> * verify the reload happened (perhaps checking zone serial?)
> * make sure we defer and subsequent reload at least 10 seconds
>
> Mark
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1710278
>
> Title:
> [2.3a1] named stuck on reload, DNS broken
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions
>

Mike Pontillo (mpontillo) wrote :

Mark, if you observe the deadlock again, can you run "systemctl stop bind9", wait a few minutes (at least 2, but maybe up to 5), and then check if bind9 successfully stops? It looks like systemd will (by default) resort to more aggressive methods to kill a service if it doesn't stop after ~90 seconds.

If the normal method of killing the bind9 service works, we can still avoid adding that scope and risk to MAAS. Rather, if we detect bind9 behaving badly, a stop/start cycle would also allow bind9 to properly shut down in most cases, and avoid any other bugs in BIND we might see as a side-effect of a "kill -9 <bind9-pid>" approach. (A human operator could troubleshoot those side effects, but it's more difficult for MAAS to anticipate, for example, why BIND might now fail to start up because of a lock file that was left on the filesystem when the 'kill -9' occurred.)

Mark Shuttleworth (sabdfl) wrote :

OK, will do, thanks :)

Changed in maas:
status: Triaged → In Progress
assignee: nobody → Blake Rouse (blake-rouse)
milestone: none → 2.3.0
Mark Shuttleworth (sabdfl) wrote :

OK, restarting via 'sudo service bind9 restart' does work in the end, it just takes a long time. The downside is that MAAS is not going to have an effective DNS for a few minutes, which is unacceptable.

Mike Pontillo (mpontillo) wrote :

Thanks. I wonder if we shouldn't fix this in Ubuntu by tweaking the systemd control files so that the timeout values are more acceptable for production use.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.3.0 → 2.3.0alpha2
Changed in maas:
status: Fix Committed → Fix Released
Joshua Powers (powersj) on 2019-02-19
tags: removed: server-next
Mark Shuttleworth (sabdfl) wrote :

Hold on, I think this bug is still problematic for MAAS and Ubuntu.

Joshua Powers (powersj) wrote :

Hey Mark, was cleaning up bug tags; still consider this an issue.

Sam Lee (sjwl) wrote :

Mark,

Do you have any updated repro steps?

I'm seeing this failure with MAAS v2.5.3. I suspect when v2.5 moved the DNS logic from region to rack controller, that some of the mitigation logic was lost and thus this bug manifests more frequently.

When I compare our v2.5.3 install from our v2.4.2 install, the amount of rndc reloads is vastly more on v2.5.3.

[2.4.2]
journalctl -b -u bind9.service |grep received.control
Jun 22 00:22:05 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload'
Jun 22 00:22:08 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload'
Jun 22 00:22:54 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload'
Jun 24 16:27:06 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload'
Jun 25 13:53:34 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload'
Jun 25 13:53:41 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload'
Jun 25 13:54:51 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload'
Jun 25 13:55:22 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload'

[2.5.3]
journalctl -b -u bind9.service |grep received.control
Jun 26 14:23:59 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:04 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:09 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:11 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:15 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:18 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:22 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:27 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:31 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:36 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:40 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'
Jun 26 14:24:42 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload'

I had to trim the 2.5.3 output because it was way too long to fit in this comment, but as you can see 2.5.3 is spamming reload as compared to 2.4.2. 2.4.2 it may reload 4 times for the _entire day_ whereas 2.5.3 is doing hundreds if not thousands a day.

Mark Shuttleworth (sabdfl) wrote :

On 6/26/19 5:12 PM, Sam Lee wrote:
> When I compare our v2.5.3 install from our v2.4.2 install, the amount of
> rndc reloads is vastly more on v2.5.3.

Hi Sam, I don't see these issues any more, on 18.04 and 2.6. I see
reloads every few minutes on a stable MAAS (i.e. without a lot of
activity). Since 2.6 is brand new (2.6.0) you might want to hold off on
upgrading unless your cluster is for test purposes, but let us know if
you still see this on 2.6 when you get there.

Mark

Sam Lee (sjwl) wrote :

Hi Mark,

Still seeing it with 18.04 and 2.6. The sweet spot seems to be when MAAS is receiving lots of DNS requests while simultaneously doing DNS reloads (as you alluded to in this case).

I'm attempting to setup a simplified repro scenario which basically will do this:

1) enlist 50+ new machines on a untagged subnet *with DNS left blank* forcing nodes to DNS query MAAS
2) Leave machines PXE interface with Autoassign IP (so every deploy/releaes forces a DNS reload)
3) deploy and release (repeat until error)

will report back with findings.

Mark Shuttleworth (sabdfl) wrote :

OK, interesting. I really don't like the reloading strategy but am not
sure that BIND gives us many better options. Let us know what you find.

Mark

Sam Lee (sjwl) wrote :

OK - I was able to repro again, and this time with MAAS 2.6.

Here are the steps

PREP WORK
1) Have 50 machines in Ready state with one interface enabled configured as 'Autoassign' to Default VLAN PXE subnet (auto assign so that every deploy/release causes MAAS to reload DNS)
2) Clear out any DNS entries in the PXE subnet (this forces nodes to send DNS queries to MAAS)
3) Settings-> Network Services -> DNS -> Upstream DNS -> enter valid upstream DNS IP
4) Settings-> Network Services -> DNS -> DNSSEC -> Automatic (for some reason this breaks Upstream DNS)
5) Verify that Upstream DNS is broken
a) Rescue Mode one machine
b) ssh to Rescue machine
c) dig www.google.com
d) (dig should timeout/fail)
e) MAAS->Settings-> Network Services -> DNS -> DNSSEC -> Disable
f) dig www.google.com
g) (dig should succeed)
h) MAAS->Settings-> Network Services -> DNS -> DNSSEC -> Automatic
i) Release Rescue machine

REPRO
1) run repro.py (attached, WARNING this code will use all machines available to MAAS)
2) wait up to 3 hours, checking if bind9 is hung by regularly running `sudo rndc status` on MAAS

monitoring steps (optional)
(See DNS Query activity)
in one ssh window to Maas run
sudo tcpdump dst <your-rack-controller-ip> -i ens3 and dst port 53
(See DNS reloads, and why)
in another ssh window to Maas run
sudo tail -f /var/log/maas/regiond.log |grep Reloaded -A 3

Sam Lee (sjwl) wrote :

repro.py attached

Sam Lee (sjwl) wrote :

repro.py attempts to trigger DNS queries during DNS Reloads.

It does so by first deploying all 50 machines.
Then one-by-one (not all at once!) release a machine, wait, deploy machine, move to next machine.

At some point a machine will be releasing (Reloads) while others are starting to deploy (DNS Queries). This is the sweet spot.

If one simply deploys all 50 machines simultaneously, then the DNS Reload would occur but without any DNS queries (because all machines have yet to PXE boot).

Sam Lee (sjwl) wrote :

I'm not sure why a "broken" Upstream DNS helps repro this bug, but I was not able to repro when the Upstream DNS was working.

Changed in bind9 (Ubuntu):
assignee: nobody → Blake Rouse (blake-rouse)
Dan Streetman (ddstreet) wrote :

This is a deadlock in bind9 code; thread A runs ns_client_endrequest->dns_view_detach->view_flushanddetach, which includes:

  LOCK(&view->lock);
  if (!RESSHUTDOWN(view))
    dns_resolver_shutdown(view->resolver);

inside dns_resolver_shutdown, for each resolver bucket, it does:

  for (i = 0; i < res->nbuckets; i++) {
    LOCK(&res->buckets[i].lock);

at this point, one of the bucket locks is held, and thread A is blocked holding view->lock, but waiting for the view->resolver->bucket[i].lock.

meanwhile, thread B runs dispatch->validated, and does:

  bucketnum = fctx->bucketnum;
  LOCK(&res->buckets[bucketnum].lock);

then while still holding that lock calls dns_validator_destroy->destroy->dns_view_weakdetach

which does:

  LOCK(&view->lock);

leaving thread A and thread B in a deadlock, with thread A waiting for the bucket.lock that thread B holds, and thread B waiting for the view->lock that thread A holds.

Dan Streetman (ddstreet) wrote :

This deadlock doesn't appear to be fixed in the latest upstream

Dan Streetman (ddstreet) wrote :

> Test build in ppa:

eh, there is still a view->attributes field that needs lock protection, this isn't ready yet.

Dan Streetman (ddstreet) wrote :

Ok, updated the test build in ppa with locking for view->attributes as well, should fix this particular bind9 deadlock.

Changed in bind:
assignee: nobody → Dan Streetman (ddstreet)
assignee: Dan Streetman (ddstreet) → nobody
Dan Streetman (ddstreet) on 2019-07-18
Changed in bind9 (Ubuntu Eoan):
assignee: Blake Rouse (blake-rouse) → Dan Streetman (ddstreet)
Changed in bind9 (Ubuntu Disco):
assignee: nobody → Dan Streetman (ddstreet)
Changed in bind9 (Ubuntu Bionic):
assignee: nobody → Dan Streetman (ddstreet)
Changed in bind9 (Ubuntu Disco):
importance: Undecided → Medium
Changed in bind9 (Ubuntu Bionic):
importance: Undecided → Medium
Changed in bind9 (Ubuntu Eoan):
status: Triaged → In Progress
Changed in bind9 (Ubuntu Disco):
status: New → In Progress
Changed in bind9 (Ubuntu Bionic):
status: New → In Progress
Changed in bind9 (Ubuntu Eoan):
importance: High → Medium
Dan Streetman (ddstreet) wrote :

first deadlock from comment 27 fixed/workedaround.

Next deadlock is:

thread A is the same as comment 27, holding view->lock and waiting for the bucket lock in dns_resolver_shutdown.

now, thread B calls dispatch->authvalidated->nsecvalidate->create_fetch->dns_resolver_createfetch->dns_resolver_createfetch3 which does:

  LOCK(&res->buckets[bucketnum].lock);

then calls fctx_create->fcount_incr which does:

  LOCK(&fctx->res->lock);

where fctx->res is the view. So again, deadlock.

Dan Streetman (ddstreet) wrote :

Another test build ready.
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1710278

Where will the next deadlock be? :)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.