regiond stops listening on the API port 5240 until regiond is restarted (listening sockets are lost) after database failover

Bug #1818703 reported by Dmitrii Shcherbakov
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
High
Alberto Donato

Bug Description

Forked from https://bugs.launchpad.net/maas/+bug/1817484

In some database failover cases regiond sockets get lost and are never reestablished while regiond processes are there and respond to database notifications (e.g. update bind9 configuration after dns record changes).

This was verified 2 times on independent test beds after several DB failovers:
https://bugs.launchpad.net/maas/+bug/1817484/comments/21 (without debug: true)
https://bugs.launchpad.net/maas/+bug/1817484/comments/29 (with debug: true)

See maas-vhost2 logs:
https://private-fileshare.canonical.com/~dima/maas-dumps/2019-03-01-maas-vhost1-2-3-etc-var-log.tar.gz

# no listening sockets
ubuntu@maas-vhost2:~$ sudo ss -tlpna 'sport = 5240'
State Recv-Q Send-Q Local Address:Port Peer Address:Port

# however, the processes are running
ubuntu@maas-vhost2:~$ pgrep -af regiond
1002 /bin/sh -c exec /usr/sbin/regiond 2>&1 | tee -a $LOGFILE
1004 /usr/bin/python3 /usr/sbin/regiond
1005 tee -a /var/log/maas/regiond.log
1967 /usr/bin/python3 /usr/sbin/regiond
1969 /usr/bin/python3 /usr/sbin/regiond
1972 /usr/bin/python3 /usr/sbin/regiond
1973 /usr/bin/python3 /usr/sbin/regiond

It definitely has the right content in the zone file so it receives DNS updates, however, no API sockets are present (port 5240):

ubuntu@maas-vhost2:~$ cat /etc/bind/maas/zone.test
; Zone file modified: 2019-02-28 23:26:43.245638.
$TTL 30
#...
@ 30 IN NS maas.
maas-region 0 IN A 10.100.1.2

ubuntu@maas-vhost1:~$ cat /etc/bind/maas/zone.test
; Zone file modified: 2019-02-28 23:26:32.619195.
$TTL 30
#...
@ 30 IN NS maas.
maas-region 0 IN A 10.100.1.2

Given our resource agent tries to use http://localhost:5240/MAAS for DNS update API calls the relevant timestamps can be seen by the "connection refused" errors reported by it:

root@maas-vhost2:~# grep -B1 -A1 -a refused /var/log/pacemaker.log
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ sock.connect(sa) ]
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ ConnectionRefusedError: [Errno 111] Connection refused ]
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ ]
--
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ raise URLError(err) ]
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ urllib.error.URLError: <urlopen error [Errno 111] Connection refused> ]
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: info: log_finished: finished - rsc:res_maas_region_hostname action:start call_id:25 pid:2891 exit-code:1 exec-time:9651ms queue-time:0ms

Maybe the sockets were gone even before that because I can see "request to http://127.0.0.1:5240/MAAS/metadata/2012-03-01/ failed" messages earlier in the regiond.log.

Tags: cpe-onsite
Changed in maas:
importance: Undecided → Critical
status: New → Triaged
assignee: nobody → Blake Rouse (blake-rouse)
milestone: none → 2.5.3
importance: Critical → High
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Subscribed ~field-high for tracking.

description: updated
Changed in maas:
milestone: 2.5.3 → 2.5.4
summary: - regiond sockets lost after database failover
+ regiond stops listening on the API port 5240 (listening sockets are
+ lost) after database failover
Changed in maas:
assignee: Blake Rouse (blake-rouse) → Alberto Donato (ack)
summary: - regiond stops listening on the API port 5240 (listening sockets are
- lost) after database failover
+ regiond stops listening on the API port 5240 until regiond is restarted
+ (listening sockets are lost) after database failover
Changed in maas:
milestone: 2.5.4 → none
Revision history for this message
Björn Tillenius (bjornt) wrote :

This is most likely the same issue as bug 1794882.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.