dhcpd doesn't start until maas-rackd is restarted

Bug #1794882 reported by Jason Hobbs
64
This bug affects 20 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Björn Tillenius
2.7
Fix Released
High
Björn Tillenius

Bug Description

On a signal node install with 2.4.2-7034-g2f5deb8b8-0ubuntu1, sometimes i have to restart maas-rackd to get dhcpd to start.

Here is ps output of before and after restarting maas-rackd:

https://pastebin.canonical.com/p/MBJCBdPtmj/

Related branches

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Lee Trager (ltrager) wrote :

I think this may be fixed by LP:1789302 and LP:1794970. The next time this happens could you please post the output of

systemctl status maas-dhcpd
systemctl status maas-rackd

Changed in maas:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We saw this again on 2.6.0.

Changed in maas:
status: Expired → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

rackd.log after the restat http://paste.ubuntu.com/p/8GY5233Jn5/

Revision history for this message
Marian Gasparovic (marosg) wrote :

I see this quite often recently in my virtual environment, where maas is running in VM. Juju pod stays in commissioning state, when I check maas I see it did not get IP. As Jason mentioned, restarting regiond helps. I will try to get outputs next time I see it.

Revision history for this message
Marian Gasparovic (marosg) wrote :
Download full text (4.2 KiB)

ubuntu@infra1:~$ systemctl status maas-dhcpd
● maas-dhcpd.service - MAAS instance of ISC DHCP server for IPv4
   Loaded: loaded (/lib/systemd/system/maas-dhcpd.service; enabled; vendor preset: enabled)
   Active: inactive (dead)
Condition: start condition failed at Mon 2019-08-19 14:32:36 UTC; 2h 15min ago
     Docs: man:dhcpd(8)

Aug 19 14:22:42 infra1 dhcpd[2064]: execute_statement argv[11] = 600
Aug 19 14:22:42 infra1 dhcpd[2064]: execute_statement argv[12] = --hostname
Aug 19 14:22:42 infra1 dhcpd[2064]: execute_statement argv[13] = node0
Aug 19 14:22:42 infra1 dhcpd[2064]: execute_statement argv[14] = --socket
Aug 19 14:22:42 infra1 dhcpd[2064]: execute_statement argv[15] = /var/lib/maas/dhcpd.sock
Aug 19 14:22:42 infra1 dhcpd[2064]: reuse_lease: lease age 13 (secs) under 25% threshold, reply with unaltered, existing lease for 192.168.210.108
Aug 19 14:22:42 infra1 dhcpd[2064]: DHCPREQUEST for 192.168.210.108 from 52:54:6a:3c:53:04 (node0) via broam
Aug 19 14:22:42 infra1 dhcpd[2064]: DHCPACK on 192.168.210.108 to 52:54:6a:3c:53:04 (node0) via broam
Aug 19 14:22:43 infra1 systemd[1]: Stopping MAAS instance of ISC DHCP server for IPv4...
Aug 19 14:22:43 infra1 systemd[1]: Stopped MAAS instance of ISC DHCP server for IPv4.
ubuntu@infra1:~$ systemctl status maas-rackd
● maas-rackd.service - MAAS Rack Controller
   Loaded: loaded (/lib/systemd/system/maas-rackd.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-08-19 14:34:15 UTC; 2h 13min ago
     Docs: https://maas.io/
 Main PID: 27794 (sh)
    Tasks: 20 (limit: 4915)
   CGroup: /system.slice/maas-rackd.service
           ├─27794 /bin/sh -c exec /usr/sbin/rackd 2>&1 | tee -a $LOGFILE
           ├─27795 /usr/bin/python3 /usr/sbin/rackd
           ├─27796 tee -a /var/log/maas/rackd.log
           ├─27913 /usr/bin/python3 /usr/lib/maas/maas-common observe-beacons ens3
           ├─27928 /usr/bin/python3 /usr/lib/maas/maas-common observe-beacons virbr0
           ├─28064 sudo -n /usr/lib/maas/beacon-monitor virbr0
           ├─28065 /usr/sbin/tcpdump --interface virbr0 --direction=in --no-promiscuous-mode --packet-buffered --immediate-mode --snapshot-length=16384 -n -w - (udp dst port 5240) or (vlan and udp dst port 5240)
           ├─28066 sudo -n /usr/lib/maas/beacon-monitor ens3
           ├─28067 /usr/sbin/tcpdump --interface ens3 --direction=in --no-promiscuous-mode --packet-buffered --immediate-mode --snapshot-length=16384 -n -w - (udp dst port 5240) or (vlan and udp dst port 5240)
           ├─28995 /usr/bin/python3 /usr/lib/maas/maas-common observe-arp ens3
           ├─28996 /usr/bin/python3 /usr/lib/maas/maas-common observe-arp virbr0
           ├─28997 /usr/bin/python3 /usr/lib/maas/maas-common observe-mdns
           ├─29113 sudo -n /usr/lib/maas/network-monitor ens3
           ├─29114 /usr/sbin/tcpdump --interface ens3 --no-promiscuous-mode --packet-buffered --immediate-mode --snapshot-length=64 -n -w - arp or (vlan and arp)
           ├─29117 sudo -n /usr/lib/maas/network-monitor virbr0
           ├─29118 /usr/sbin/tcpdump --interface virbr0 --no-promiscuous-mode --packet-buffered --immediate-mode --snapshot-length=64 -n -w - arp or (vlan and arp...

Read more...

Alberto Donato (ack)
Changed in maas:
status: New → Triaged
importance: Undecided → Medium
Changed in maas:
importance: Medium → High
assignee: nobody → Blake Rouse (blake-rouse)
milestone: none → 2.7.0alpha1
Revision history for this message
Trent Lloyd (lathiat) wrote :

Hit this again on upgrade from 2.6.0 to 2.6.1. /var/lib/maas/dhcpd.conf doesn't exist so dhcpd fails to start. Restarting rackd fixes it as previously mentioned.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Seen on 2.6.2

Changed in maas:
milestone: 2.7.0b1 → 2.7.0b2
Changed in maas:
assignee: Blake Rouse (blake-rouse) → Alberto Donato (ack)
Revision history for this message
Alberto Donato (ack) wrote :

As Bjorn mentioned on https://bugs.launchpad.net/maas/+bug/1817663/comments/11 it's pretty easy to reproduce the issue by restarting the regiond with rackd running.

It seems in some cases the region doesn't tell the rackd to enable the DHCP at startup, so it stays off (it is stopped as soon as the rack disconnects from the region)

Changed in maas:
milestone: 2.7.0b2 → 2.7.0b3
Changed in maas:
assignee: Alberto Donato (ack) → Lee Trager (ltrager)
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Just hit this with maas 2.6.2-7841-ga10625be3-0ubuntu1~18.04.1

Changed in maas:
milestone: 2.7.0rc1 → 2.7.0rc2
assignee: Lee Trager (ltrager) → Björn Tillenius (bjornt)
Revision history for this message
Björn Tillenius (bjornt) wrote :

I think I've found the problem. The listener dictionary in PostgresListenerService is being read and written by multiple threads.

While registerChannels() iterate over the keys in the dictionary in one thread, other treads are calling the register() method that inserts new keys into the dictionary.

I noticed that the logs had this error:

  Unable to connect to database: dictionary changed size during iteration

Adding more debug logging, I confirmed that the listener dictionary is indeed
updated while registerChannels() is executing.

I tried to put a lock that protects whole of registerChannels() and whole
of register(). That seems to have solved the problem. I've restarted the
regiond server 10 times, without seeing that error, and dhcpd gets started
every time.

Tomorrow I'll clean up the fix, since we might need to protect other sections
of the code as well.

Changed in maas:
milestone: 2.7.0rc2 → 2.7.0rc3
Changed in maas:
milestone: 2.7.0rc3 → none
status: Triaged → In Progress
Changed in maas:
milestone: none → next
status: In Progress → Fix Committed
Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
milestone: next → 2.8.0b1
Revision history for this message
Bryan Larsen (bryan-larsen) wrote :

Is this supposed to be fixed in 2.7.1 (8261-g.5143564e6)? My dhcpd has always been tricky to start, and I'm not sure of the cause. And since I'm using MAAS to provide dhcp to the whole office, this is a major issue.

I installed maas using snap, so the logs and workarounds here don't appear to be available to me. It used to be that a `snap restart maas` or a reboot sometimes fixed this, but that no longer seems to be the case. I don't know how to reach into a snap to restart the regiond. If I kill rackd it restarts but doesn't help. regiond doesn't restart if I kill it.

`snap logs maas`:

```
2020-05-26T19:00:56Z systemd[1]: Started Service for snap application maas.supervisor.
2020-05-26T19:01:01Z rsyslogd[6791]: [origin software="rsyslogd" swVersion="8.32.0" x-pid="6791" x-info="http://www.rsyslog.com"] start
```

Revision history for this message
Bryan Larsen (bryan-larsen) wrote :

In my controller summary, regiond, bind9, ntp, proxy and syslog are marked green, rackd/tftp, dhcpd, dhcpd6, dns, ntp are marked red and the rest don't have a mark.

Revision history for this message
Björn Tillenius (bjornt) wrote :

This is supposed to be fixed in 2.7.1, yes. It seems like we still have a similar issue, but since it's not happening as frequent as the original issue, I'd like to keep this bug closed, and focus the effort in the new bug:

  https://bugs.launchpad.net/maas/+bug/1900832

It might be that it's a different issue, we're not sure at the moment.

Bryan, your issue looks different, though. I think we need the logs from /var/snap/maas/common/log to debug what's going on. If you could file a bug and attach those, that would be great.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.