MAAS

intermittent log errors in continuous loop

Bug #1891726 reported by Patricia Domingues on 2020-08-15

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Expired	Medium	Unassigned

Bug Description

seeing these critical failures intermittently:

`MAAS version: 2.8.1 (8567-g.c4825ca06)`

regiond.log:
```
2020-08-14 17:33:02 -: [critical] WSGI application error
...
          File "/snap/maas/7808/usr/lib/python3/dist-packages/twisted/python/failure.py", line 385, in raiseException
            raise self.value.with_traceback(self.tb)
        builtins.AttributeError: 'NoneType' object has no attribute 'writeHeaders'
```

rackd.log:
```
2020-08-14 17:34:34 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).
...
          File "/snap/maas/7808/lib/python3.6/site-packages/provisioningserver/rpc/clusterservice.py", line 1534, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
        twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
```

See original description

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2020-08-15:

region_rack.tar.gz Edit (1019.3 KiB, application/x-tar)

attaching rackd.log and regiond.log

description:

updated

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2020-08-15:

not sure if it is related, but named was causing high CPU usage:
```
Tasks: 155 total, 3 running, 152 sleeping, 0 stopped, 0 zombie
%Cpu(s): 20.0 us, 0.0 sy, 0.0 ni, 80.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 64285.7 total, 61679.9 free, 1982.3 used, 623.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 62303.4 avail Mem

MAAS was not working, after restarting it, a few hours later, the log failures have returned.

not sure if it is related, but named was causing high CPU usage: 
```
Tasks: 155 total,   3 running, 152 sleeping,   0 stopped,   0 zombie
%Cpu(s): 20.0 us,  0.0 sy,  0.0 ni, 80.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64285.7 total,  61679.9 free,   1982.3 used,    623.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  62303.4 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                        
 176167 root      20   0 2530252 232640   9708 S 127.8   0.4   1240:37 named                                                                                                                                          
 175821 root      20   0 1041904 202132  22444 S 107.9   0.3 333:56.23 python3                                                                                                                                        
 175819 root      20   0 1118808 210480  23112 S 107.6   0.3 339:09.09 python3                                                                                                                                        
 175823 root      20   0 1112448 197032  22940 S 107.3   0.3 326:32.64 python3                                                                                                                                        
 175822 root      20   0 1038156 198684  22680 S 107.0   0.3 329:51.00 python3                                                                                                                                        
 241016 postgres  20   0  257300  52424  46020 R   4.6   0.1   0:02.60 postgres                                                                                                                                       
 241086 postgres  20   0  257348  50936  44692 S   4.3   0.1   0:01.12 postgres 
```

MAAS was not working, after restarting it, a few hours later, the log failures have returned.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-08-18:

Is the bind9 debian package installed on that machine?

If so, please remove it, since that causes problems like that.

If not, is lxd installed and configured?

Changed in maas:
status:	New → Incomplete

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2020-08-18:

Yes! Bind9 is installed:
```
rc bind9 1:9.11.3+dfsg-1ubuntu1.12 amd64 Internet Domain Name Server
ii bind9-dnsutils 1:9.16.1-0ubuntu2.2 amd64 Clients provided with BIND 9
ii bind9-host 1:9.16.1-0ubuntu2.2 amd64 DNS Lookup Utility
ii bind9-libs:amd64 1:9.16.1-0ubuntu2.2 amd64 Shared Libraries used by BIND 9`
```
This MAAS was a deb-based 2.6.2 and than it was upgraded to snap 2.7.1 -> 2.8.1.
I'm deleting it and will update here if I see the same, many thanks.

Revision history for this message

dann frazier (dannf) wrote on 2020-08-18:

Thanks Patricia - from that output, it appears that some bind9 utilities are installed, but not the daemon. @bjornt: I assume you were concerned about the daemon?

dann frazier (dannf) on 2020-08-18

Changed in maas:
status:	Incomplete → New

Revision history for this message

Alexandre Erwin Ittner (aittner) wrote on 2020-08-19:

Just a quick note here: there are two bind9 binaries running, one provided by the MAAS snap (running inside the container) and other from the host. This also happens on r815 and should not be causing any problem by itself.

However in the past we had a MAAS instance running on bates (on the host, not VMs or containers). That instance was long removed and that machine received several upgrades since then but some configuration remains on /etc/bind/ Not sure if bind9 (from host) can be using it and causing some conflict for a particular hostname that is only queried in some very specific moments.

ubuntu@bates:/etc/bind$ ps aux | grep named
bind 1838 0.0 2.4 6720356 1621412 ? Ssl Jul21 23:56 /usr/sbin/named -f -u bind
ubuntu 2049998 0.0 0.0 10760 2508 pts/23 S+ 09:07 0:00 grep --color=auto named
1000000 2796318 122 0.2 2530772 146496 ? Sl Aug18 1828:35 /snap/maas/7808/usr/sbin/named -c /var/snap/maas/7808/bind/named.conf -g

ubuntu@r815:~$ ps aux | grep named
bind 2349 0.2 0.3 5460076 228816 ? Ssl Jul01 156:13 /usr/sbin/named -f -u bind
1000000 1623520 73.1 0.3 2534152 233768 ? Sl Aug17 2448:33 /snap/maas/7808/usr/sbin/named -c /var/snap/maas/7808/bind/named.conf -g
ubuntu 1729275 0.0 0.0 13628 2648 pts/30 S+ 09:09 0:00 grep --color=auto named

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-08-20:

Ok, so that extra bind instance is definitely the cause of the problems. You can't install MAAS when you have bind running on the machine.

Since we're already monitoring the MAAS bind process and start it if it's not running, we should put in a check to see if another bind process is running. Then we could report that bind is broken, rather than trying to compete with the other process.

Bill will add a section in the install instructions about this as well.

Changed in maas:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2020-08-20:

This MAAS instance is a container hosted on bates machine, from where Alexandre commented:

```
ubuntu@bates:/etc/bind$ ps aux | grep named
bind 1838 0.0 2.4 6720356 1621412 ? Ssl Jul21 23:56 /usr/sbin/named -f -u bind
ubuntu 2049998 0.0 0.0 10760 2508 pts/23 S+ 09:07 0:00 grep --color=auto named
1000000 2796318 122 0.2 2530772 146496 ? Sl Aug18 1828:35 /snap/maas/7808/usr/sbin/named -c /var/snap/maas/7808/bind/named.conf -g
```

What do you mean by extra bind instance: the one in the Maas host? or the bind9 packages installed on MAAS instance (I've deleted them) ?

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2020-08-20:

investigating this issue today, noticed that MAAS' DNS was pointing out to its host_IP:
```
   nameservers:
      addresses:
        - host_IP
        - ip_1
```

and as its host also has bind9 running, it should be causing the issue on MAAS.
I've removed `host_IP` and restarted MAAS.

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2020-08-24:

#10

Issue root caused to a manual misconfiguration of bind9 on MAAS LXD container. Marking "invalid".

Changed in maas:
status:	Triaged → Invalid

Revision history for this message

dann frazier (dannf) wrote on 2020-08-24:

#11

Wait, I'm confused. RE: Comments #6 & #7 - how is bind9 running on a LXD *host* causing issues with MAAS running in a LXD *container*? We previously verified that there is no other bind process running in the MAAS container.

And RE: comment #9 - why would it be a problem for a MAAS LXD container to point at a DNS server that happens to be running on the LXD host? I mean, if that DNS server is broken I can imagine other problems, but would this be a symptom?

I know that configuration changes have made this problem seem to go away, but I'd like to confirm that we've really identified the root cause, since that's the reason we're eating our own dog food :)

dann frazier (dannf) on 2020-08-24

Changed in maas:
status:	Invalid → New

Revision history for this message

Alexandre Erwin Ittner (aittner) wrote on 2020-08-24:

#12

Oh, the whole situation was a bit confusing, actually.

The first symptom we noticed was the DNS servers that the machines deployed by this MAAS were getting in their netplan config: 10.229.0.101 (bates' address, primary) and 10.229.0.1 (the correct one, secondary). So, the interference from bates actually happened but only because cloud-init was telling the machines to use it. Why? The MAAS container was configured (statically) to use these servers and MAAS passed this configuration for the machines it deployed.

Once we removed bates' address from the container DNS configuration the new machines only received the correct one.

Why we have a DNS server running on bates is still a mystery for me. It may be an old thing, from the time MAAS ran there, that can be safely removed or some service/system may be using it.

Revision history for this message

Bill Wear (billwear) wrote on 2020-08-25:

#13

Added a note above the fold on the MAAS installation page:

If you have installed `bind9` or have it running, you will need to uninstall it before installing MAAS. You can check with `ps aux | grep named` to see if it's running. The `bind9` daemon interferes with MAAS operation and creates a number of unusual, hard-to-debug errors -- but don't worry, MAAS provides DNS and can work with existing DNS servers.

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2020-08-25:

#14

Thanks Bill. If you have MAAS running in a LXC container, does that advice apply to just the container, or to the combination of the container *and* the host.

Is there any information on how the DNS server on the host interferes with the MAAS-owned DNS server in the LXC container?

Many thanks!

Revision history for this message

dann frazier (dannf) wrote on 2020-08-28:

#15

We are seeing this issue again, so the previous things that seemed to "fix it" have unfortunately not :(

One additional thing that we've determined is that the whole bind9 running on the host was *actually* just the bind9 process running in the container being *visible* on the host. That is, the only named process involved is the one MAAS manages:

Top from MAAS container:

Tasks: 141 total, 5 running, 136 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.4 us, 0.0 sy, 0.0 ni, 93.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 64285.7 total, 62462.0 free, 1374.5 used, 449.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 62911.2 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
543 root 20 0 2532852 145280 9660 S 126.6 0.2 61:42.61 named

Top from LXD host in which MAAS container resides:

Tasks: 1022 total, 1 running, 1021 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.6 us, 2.3 sy, 0.0 ni, 95.0 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 64285.7 total, 4431.3 free, 30569.9 used, 29284.4 buff/cache
MiB Swap: 65406.5 total, 65370.1 free, 36.4 used. 33818.6 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2144893 1000000 20 0 2532852 145280 9660 S 122.2 0.2 63:01.64 named

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2020-08-28:

#16

supervisor-run.log Edit (3.1 KiB, text/plain)

we do not have a second bind9 running on MAAS instance. I'm adding supervisor-run.log and named.log

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2020-08-28:

#17

named.log Edit (49.5 MiB, text/plain)

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2020-08-28:

#18

syslog_MAAS_host Edit (2.3 KiB, text/html)

I noticed this errors on the host system which MAAS is running in.
MAAS is a LXD container instance.

Revision history for this message

Sean Feole (sfeole) wrote on 2020-08-28:

#19

While using a lab system for some kernel tests, the host suddenly dropped all dns. I was unable to resolv anything outside the lap. There was no logs to note from the machine under test, resolveconf was restarted however that did not appear to remedy the issue, nor was there any warnings or unusual messages in syslog

Revision history for this message

Alberto Donato (ack) wrote on 2021-05-21:

#20

Hi, is this still happening? Would it be possible to have up to date regiond/rackd logs from the system, in that case?

Changed in maas:
status:	New → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2021-07-21:

#21

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.