intermittent log errors in continuous loop

Bug #1891726 reported by Patricia Domingues
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Expired
Medium
Unassigned

Bug Description

seeing these critical failures intermittently:

`MAAS version: 2.8.1 (8567-g.c4825ca06)`

regiond.log:
```
2020-08-14 17:33:02 -: [critical] WSGI application error
...
          File "/snap/maas/7808/usr/lib/python3/dist-packages/twisted/python/failure.py", line 385, in raiseException
            raise self.value.with_traceback(self.tb)
        builtins.AttributeError: 'NoneType' object has no attribute 'writeHeaders'
```

rackd.log:
```
2020-08-14 17:34:34 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).
...
          File "/snap/maas/7808/lib/python3.6/site-packages/provisioningserver/rpc/clusterservice.py", line 1534, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
        twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
```

Revision history for this message
Patricia Domingues (patriciasd) wrote :

attaching rackd.log and regiond.log

description: updated
Revision history for this message
Patricia Domingues (patriciasd) wrote :

not sure if it is related, but named was causing high CPU usage:
```
Tasks: 155 total, 3 running, 152 sleeping, 0 stopped, 0 zombie
%Cpu(s): 20.0 us, 0.0 sy, 0.0 ni, 80.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 64285.7 total, 61679.9 free, 1982.3 used, 623.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 62303.4 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 176167 root 20 0 2530252 232640 9708 S 127.8 0.4 1240:37 named
 175821 root 20 0 1041904 202132 22444 S 107.9 0.3 333:56.23 python3
 175819 root 20 0 1118808 210480 23112 S 107.6 0.3 339:09.09 python3
 175823 root 20 0 1112448 197032 22940 S 107.3 0.3 326:32.64 python3
 175822 root 20 0 1038156 198684 22680 S 107.0 0.3 329:51.00 python3
 241016 postgres 20 0 257300 52424 46020 R 4.6 0.1 0:02.60 postgres
 241086 postgres 20 0 257348 50936 44692 S 4.3 0.1 0:01.12 postgres
```

MAAS was not working, after restarting it, a few hours later, the log failures have returned.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Is the bind9 debian package installed on that machine?

If so, please remove it, since that causes problems like that.

If not, is lxd installed and configured?

Changed in maas:
status: New → Incomplete
Revision history for this message
Patricia Domingues (patriciasd) wrote :

Yes! Bind9 is installed:
```
rc bind9 1:9.11.3+dfsg-1ubuntu1.12 amd64 Internet Domain Name Server
ii bind9-dnsutils 1:9.16.1-0ubuntu2.2 amd64 Clients provided with BIND 9
ii bind9-host 1:9.16.1-0ubuntu2.2 amd64 DNS Lookup Utility
ii bind9-libs:amd64 1:9.16.1-0ubuntu2.2 amd64 Shared Libraries used by BIND 9`
```
This MAAS was a deb-based 2.6.2 and than it was upgraded to snap 2.7.1 -> 2.8.1.
I'm deleting it and will update here if I see the same, many thanks.

Revision history for this message
dann frazier (dannf) wrote :

Thanks Patricia - from that output, it appears that some bind9 utilities are installed, but not the daemon. @bjornt: I assume you were concerned about the daemon?

dann frazier (dannf)
Changed in maas:
status: Incomplete → New
Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :

Just a quick note here: there are two bind9 binaries running, one provided by the MAAS snap (running inside the container) and other from the host. This also happens on r815 and should not be causing any problem by itself.

However in the past we had a MAAS instance running on bates (on the host, not VMs or containers). That instance was long removed and that machine received several upgrades since then but some configuration remains on /etc/bind/ Not sure if bind9 (from host) can be using it and causing some conflict for a particular hostname that is only queried in some very specific moments.

ubuntu@bates:/etc/bind$ ps aux | grep named
bind 1838 0.0 2.4 6720356 1621412 ? Ssl Jul21 23:56 /usr/sbin/named -f -u bind
ubuntu 2049998 0.0 0.0 10760 2508 pts/23 S+ 09:07 0:00 grep --color=auto named
1000000 2796318 122 0.2 2530772 146496 ? Sl Aug18 1828:35 /snap/maas/7808/usr/sbin/named -c /var/snap/maas/7808/bind/named.conf -g

ubuntu@r815:~$ ps aux | grep named
bind 2349 0.2 0.3 5460076 228816 ? Ssl Jul01 156:13 /usr/sbin/named -f -u bind
1000000 1623520 73.1 0.3 2534152 233768 ? Sl Aug17 2448:33 /snap/maas/7808/usr/sbin/named -c /var/snap/maas/7808/bind/named.conf -g
ubuntu 1729275 0.0 0.0 13628 2648 pts/30 S+ 09:09 0:00 grep --color=auto named

Revision history for this message
Björn Tillenius (bjornt) wrote :

Ok, so that extra bind instance is definitely the cause of the problems. You can't install MAAS when you have bind running on the machine.

Since we're already monitoring the MAAS bind process and start it if it's not running, we should put in a check to see if another bind process is running. Then we could report that bind is broken, rather than trying to compete with the other process.

Bill will add a section in the install instructions about this as well.

Changed in maas:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Patricia Domingues (patriciasd) wrote :

This MAAS instance is a container hosted on bates machine, from where Alexandre commented:

```
ubuntu@bates:/etc/bind$ ps aux | grep named
bind 1838 0.0 2.4 6720356 1621412 ? Ssl Jul21 23:56 /usr/sbin/named -f -u bind
ubuntu 2049998 0.0 0.0 10760 2508 pts/23 S+ 09:07 0:00 grep --color=auto named
1000000 2796318 122 0.2 2530772 146496 ? Sl Aug18 1828:35 /snap/maas/7808/usr/sbin/named -c /var/snap/maas/7808/bind/named.conf -g
```

What do you mean by extra bind instance: the one in the Maas host? or the bind9 packages installed on MAAS instance (I've deleted them) ?

Revision history for this message
Patricia Domingues (patriciasd) wrote :

investigating this issue today, noticed that MAAS' DNS was pointing out to its host_IP:
```
   nameservers:
      addresses:
        - host_IP
        - ip_1
```

and as its host also has bind9 running, it should be causing the issue on MAAS.
I've removed `host_IP` and restarted MAAS.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Issue root caused to a manual misconfiguration of bind9 on MAAS LXD container. Marking "invalid".

Changed in maas:
status: Triaged → Invalid
Revision history for this message
dann frazier (dannf) wrote :

Wait, I'm confused. RE: Comments #6 & #7 - how is bind9 running on a LXD *host* causing issues with MAAS running in a LXD *container*? We previously verified that there is no other bind process running in the MAAS container.

And RE: comment #9 - why would it be a problem for a MAAS LXD container to point at a DNS server that happens to be running on the LXD host? I mean, if that DNS server is broken I can imagine other problems, but would this be a symptom?

I know that configuration changes have made this problem seem to go away, but I'd like to confirm that we've really identified the root cause, since that's the reason we're eating our own dog food :)

dann frazier (dannf)
Changed in maas:
status: Invalid → New
Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :

Oh, the whole situation was a bit confusing, actually.

The first symptom we noticed was the DNS servers that the machines deployed by this MAAS were getting in their netplan config: 10.229.0.101 (bates' address, primary) and 10.229.0.1 (the correct one, secondary). So, the interference from bates actually happened but only because cloud-init was telling the machines to use it. Why? The MAAS container was configured (statically) to use these servers and MAAS passed this configuration for the machines it deployed.

Once we removed bates' address from the container DNS configuration the new machines only received the correct one.

Why we have a DNS server running on bates is still a mystery for me. It may be an old thing, from the time MAAS ran there, that can be safely removed or some service/system may be using it.

Revision history for this message
Bill Wear (billwear) wrote :

Added a note above the fold on the MAAS installation page:

If you have installed `bind9` or have it running, you will need to uninstall it before installing MAAS. You can check with `ps aux | grep named` to see if it's running. The `bind9` daemon interferes with MAAS operation and creates a number of unusual, hard-to-debug errors -- but don't worry, MAAS provides DNS and can work with existing DNS servers.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Thanks Bill. If you have MAAS running in a LXC container, does that advice apply to just the container, or to the combination of the container *and* the host.

Is there any information on how the DNS server on the host interferes with the MAAS-owned DNS server in the LXC container?

Many thanks!

Revision history for this message
dann frazier (dannf) wrote :

We are seeing this issue again, so the previous things that seemed to "fix it" have unfortunately not :(

One additional thing that we've determined is that the whole bind9 running on the host was *actually* just the bind9 process running in the container being *visible* on the host. That is, the only named process involved is the one MAAS manages:

Top from MAAS container:

Tasks: 141 total, 5 running, 136 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.4 us, 0.0 sy, 0.0 ni, 93.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 64285.7 total, 62462.0 free, 1374.5 used, 449.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 62911.2 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    543 root 20 0 2532852 145280 9660 S 126.6 0.2 61:42.61 named

Top from LXD host in which MAAS container resides:

Tasks: 1022 total, 1 running, 1021 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.6 us, 2.3 sy, 0.0 ni, 95.0 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 64285.7 total, 4431.3 free, 30569.9 used, 29284.4 buff/cache
MiB Swap: 65406.5 total, 65370.1 free, 36.4 used. 33818.6 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2144893 1000000 20 0 2532852 145280 9660 S 122.2 0.2 63:01.64 named

Revision history for this message
Patricia Domingues (patriciasd) wrote :

we do not have a second bind9 running on MAAS instance. I'm adding supervisor-run.log and named.log

Revision history for this message
Patricia Domingues (patriciasd) wrote :
Revision history for this message
Patricia Domingues (patriciasd) wrote :

I noticed this errors on the host system which MAAS is running in.
MAAS is a LXD container instance.

Revision history for this message
Sean Feole (sfeole) wrote :

While using a lab system for some kernel tests, the host suddenly dropped all dns. I was unable to resolv anything outside the lap. There was no logs to note from the machine under test, resolveconf was restarted however that did not appear to remedy the issue, nor was there any warnings or unusual messages in syslog

Revision history for this message
Alberto Donato (ack) wrote :

Hi, is this still happening? Would it be possible to have up to date regiond/rackd logs from the system, in that case?

Changed in maas:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.