after reboot maas server shows "One cluster is not yet connected to the region"

Bug #1563483 reported by rory schramm
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Won't Fix
Undecided
Unassigned

Bug Description

After reboot of maas 1.9 server the webui shows that the cluster has become disconnected from region.

/var/log/maas/clusterd.log:

2016-03-29 10:55:45-0700 [Uninitialized] Region not available: Connection was refused by other side: 111: Connection refused. (While requesting RPC info at http://10.7.53.10/MAAS/rpc/).

However, doing curl http://10.7.53.10/MAAS/rpc/ returns

{"eventloops": {"maas:pid=1666": [["10.7.52.10", 5253], ["10.7.53.10", 5253], ["192.168.200.245", 5253], ["127.0.0.1", 5253]], "maas:pid=917": [["10.7.52.10", 5250], ["127.0.0.1", 5250], ["192.168.200.245", 5250], ["10.7.53.10", 5250]], "maas:pid=1643": [["10.7.53.10", 5252], ["127.0.0.1", 5252], ["10.7.52.10", 5252], ["192.168.200.245", 5252]], "maas:pid=1639": [["127.0.0.1", 5251], ["10.7.53.10", 5251], ["10.7.52.10", 5251], ["192.168.200.245", 5251]]}}

dpkg output:

ii maas 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server provisioning libraries

Revision history for this message
rory schramm (roryschramm) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Rory,

Can you confirm that your cluster never re-connected to your region after say, a few minutes ?

Changed in maas:
status: New → Incomplete
Revision history for this message
rory schramm (roryschramm) wrote :

Yes i can. Also this is an all-in-one maas server - region and cluster controller on the same server.

Revision history for this message
Mark Brown (mstevenbrown) wrote :

What other info should Rory supply for this report?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

So does that mean that the cluster never connects? What if you try to restart the maas-clusterd process manually ? Does it ever connect again?

Also, has the machined changed IP addresses (as in, are the interfaces on the MAAS server configured statically?) Can you verify that /etc/maas/clusterd.conf is pointing to the right address?

Changed in maas:
milestone: none → 1.9.2
Revision history for this message
rory schramm (roryschramm) wrote : RE: [Bug 1563483] Re: after reboot maas server shows "One cluster is not yet connected to the region"
Download full text (3.4 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

 Body:

The cluster doesn't reconnect. How do you restart the mass services?

The machine has never changed IP addresses or network interfaces.

The network interfaces/IP are configured statically.

root@maas:/home/ubuntu# cat /etc/maas/clusterd.conf
cluster_uuid: d8833b27-41cf-434e-b4eb-b8ae5874ee7b
maas_url: http://10.7.53.10/MAAS

root@maas:/home/ubuntu# cat /etc/maas/regiond.conf
database_host: localhost
database_name: maasdb
database_pass: mypass
database_user: maas
maas_url: http://10.7.53.10/MAAS

root@maas:/home/ubuntu# ifconfig
eth0 Link encap:Ethernet HWaddr 00:50:56:b7:5f:b7
          inet addr:10.7.53.10 Bcast:10.7.53.255 Mask:255.255.255.0
          inet6 addr: fe80::250:56ff:feb7:5fb7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:26384500 errors:0 dropped:0 overruns:0 frame:0
          TX packets:959713 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1896103029 (1.8 GB) TX bytes:807797649 (807.7 MB)

eth1 Link encap:Ethernet HWaddr 00:50:56:b7:fc:eb
          inet addr:192.168.200.245 Bcast:192.168.200.255 Mask:255.2
55.255.0
          inet6 addr: fe80::250:56ff:feb7:fceb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:683837 errors:0 dropped:0 overruns:0 frame:0
          TX packets:796510 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:460468328 (460.4 MB) TX bytes:210636212 (210.6 MB)

eth2 Link encap:Ethernet HWaddr 00:50:56:b7:8e:b5
          inet addr:10.7.52.10 Bcast:10.7.52.255 Mask:255.255.255.0
          inet6 addr: fe80::250:56ff:feb7:8eb5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:6 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4112 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:344 (344.0 B) TX bytes:630013 (630.0 KB)

root@maas:/home/ubuntu# cat /etc/network/interfaces # This file descri bes the network interfaces available on your system # and how to activ ate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet static
        address 10.7.53.10
        netmask 255.255.255.0
        network 10.7.53.0
        broadcast 10.7.53.255
        gateway 10.7.53.254
        # dns-* options are implemented by the resolvconf package, if installed
        dns-nameservers 10.7.53.10
        dns-search scalestack.local

auto eth1
iface eth1 inet static
        address 192.168.200.245/24

auto eth2
iface eth2 inet static
        address 10.7.52.10/24
_____________________________________________________________

-----BEGIN PGP SIGNATURE-----
iQFIBAEBAgAyBQJXAraMKxxSb3J5IFNjaHJhbW0gPHJvcnlzY2hyYW1tQHNjYWxl
bWF0cml4LmNvbT4ACgkQbFYpRI4keT9pogf/f0dFsYn/W3CmHxsO4phWWtzNgv0g
9Vu/9ER1me3CpdNTWtFwXh54SwkQIF/DdSR8L/VMayT17jFqaonQoe+cDvXQm2N+
4bOLo3fAVP...

Read more...

Revision history for this message
rory schramm (roryschramm) wrote :
Download full text (5.3 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Boot images aren’t updating as well.

==> /var/log/maas/maas.log <==
Apr 4 16:22:18 maas maas.bootsources: [INFO] Updated boot sources cac
he.
Apr 4 16:22:38 maas maas.bootsources: [INFO] Updated boot sources cac
he.
Apr 4 16:22:38 maas maas.bootresources: [INFO] Started importing of b
oot images from 1 source(s).
Apr 4 16:22:43 maas maas.bootresources: [INFO] Importing images from
source: http://images.maas.io/ephemeral-v2/daily/
Apr 4 16:22:54 maas maas.bootresources: [INFO] Finished importing of
boot images from 1 source(s).
Apr 4 16:22:54 maas maas.import-images: [INFO] Started importing boot
 images.
Apr 4 16:22:56 maas maas.import-images: [INFO] Writing boot image met
adata and iSCSI targets.
Apr 4 16:22:56 maas maas.import-images: [INFO] Installing boot images
 snapshot /var/lib/maas/boot-resources/snapshot-20160404-232255

==> /var/log/maas/clusterd.log <==
2016-04-04 16:23:03-0700 [-] Unhandled error in Deferred:
2016-04-04 16:23:03-0700 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/threading.py", line 783, in __boots
trap
            self.__bootstrap_inner()
          File "/usr/lib/python2.7/threading.py", line 810, in __boots
trap_inner
            self.run()
          File "/usr/lib/python2.7/threading.py", line 763, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/python/thread
pool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/contex
t.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *a
rgs, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/contex
t.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/ut
ils/twisted.py", line 200, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rp
c/boot_images.py", line 115, in _run_import
            boot_resources.import_images(sources)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/im
port_images/boot_resources.py", line 274, in import_images
            install_boot_loaders(snapshot_path, image_descriptions.get
_image_arches())
          File "/usr/lib/python2.7/dist-packages/provisioningserver/im
port_images/boot_resources.py", line 106, in install_boot_loaders
            boot_method.install_bootloader(destination)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/bo
ot/powerkvm.py", line 72, in install_bootloader
            'main', 'ppc64el')
          File "/usr/lib/python2.7/dist-packages/provisioningserver/bo
ot/utils.py", line 160, in get_updates_package
            package, archive, component, architecture, release=release
)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/bo
ot/utils.py", line 136, in get_pa...

Read more...

Revision history for this message
rory schramm (roryschramm) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Maas cluster was "reconnected" briefly yesterday. However, cluster is
back to showing disconnected.

-----BEGIN PGP SIGNATURE-----

iQFIBAEBAgAyBQJXA+REKxxSb3J5IFNjaHJhbW0gPHJvcnlzY2hyYW1tQHNjYWxl
bWF0cml4LmNvbT4ACgkQbFYpRI4keT/9wwf6A2Wlkb6EJ1KQXgHaIa+8LLWTRHDW
nY9Ujl5UAJ10TfWpT0/tfkazT0SfINMezDUpErmBNC+CQ80oTYR4aCRcOX2O1kkK
wfpmPJMkqx/CLZSvgJMT+2jesHvSOOSzi0nBemOISL3PYcNNBQ63rKQbmOl/gDuQ
1tavhnYUS6ahRFz4UAvk2qHFPdASXkKrVwAhWKeq8K3AEE+IgDX+dslCfhyNUj19
p5CkWXzxSMD3iD3TnaH8j9Uus+MXr1EvE82EG63UuC9dbqHyCp3WrPDnHLFR3eCj
j7N3pHltHEBu0AnDXSTjBQtNgD+zbf/f9UhvDyBNRJ6h85LOeTHIxneu7Q==
=Qd0v
-----END PGP SIGNATURE-----

Revision history for this message
rory schramm (roryschramm) wrote :
Revision history for this message
rory schramm (roryschramm) wrote : gui broken now too

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Maas is not working at all now.

The webgui is broken as well - If I try to go to http://server/MAAS I
 get a "Internal Server error" page.

Can we get this escalated? This is clearly not working...and not getti
ng addressed...

-----BEGIN PGP SIGNATURE-----

iQFIBAEBAgAyBQJXBZqVKxxSb3J5IFNjaHJhbW0gPHJvcnlzY2hyYW1tQHNjYWxl
bWF0cml4LmNvbT4ACgkQbFYpRI4keT8ffAf9Ffq6vC4TaX70uvY8C4x+ch8QmP1B
KYT/Ado3jpaCDitdDCGO28QBJRmKQUMNVia2upo+Hr4mw1u4i6OBLJl9bZLp69pc
CDUGkl3kDbuyDAitQ7hhvObPNZkv/hV7PLodZJz7T+A3B7PXzl/RxCYRzd3aU6/7
iGZqJeFa20xeg+2WXg7hm5EYE1o7KRSAO2KZWoAR90H+zMMz7JRN6V+ziV3yrol5
qSLY8lBBXsBTWFQu5h6lmHuqHPhF6sqbOEyb/Et9HO7EDvsQv77XFEj2IILvXwF0
VIbtXj900J0uzLSHDKTVISF5NYOBNMz+RlBKDHX13I5Xt9ljnXgD/sXtGA==
=l7c4
-----END PGP SIGNATURE-----

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Looking at regiond.log, it seems you've run out of space and that's why it might be showing all of these issues. Can you cleanup your machine and free up some space and after you do that, restart all daemons?

2016-04-05 06:41:04 [RegionServer,31,10.7.53.10] Cluster 'd8833b27-41cf-434e-b4eb-b8ae5874ee7b' authenticated.
2016-04-05 06:41:15 [-] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the Register command. [maas:pid=1643:cmd=Register:ask=2]
 Traceback (most recent call last):
   File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
     self.__bootstrap_inner()
   File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
     self.run()
   File "/usr/lib/python2.7/threading.py", line 763, in run
     self.__target(*self.__args, **self.__kwargs)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 791, in _worker
     return super(ThreadPool, self)._worker()
 --- <exception caught here> ---
   File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
     result = context.call(ctx, function, *args, **kwargs)
   File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 200, in wrapper
     return func(*args, **kwargs)
   File "/usr/lib/python2.7/dist-packages/maasserver/utils/orm.py", line 501, in call_within_transaction
     return func_outside_txn(*args, **kwargs)
   File "/usr/lib/python2.7/dist-packages/maasserver/utils/orm.py", line 328, in retrier
     return func(*args, **kwargs)
   File "/usr/lib/python2.7/dist-packages/django/db/transaction.py", line 339, in inner
     return func(*args, **kwargs)
   File "/usr/lib/python2.7/dist-packages/django/db/transaction.py", line 305, in __exit__
     connection.commit()
   File "/usr/lib/python2.7/dist-packages/django/db/backends/__init__.py", line 168, in commit
     self._commit()
   File "/usr/lib/python2.7/dist-packages/django/db/backends/__init__.py", line 136, in _commit
     return self.connection.commit()
   File "/usr/lib/python2.7/dist-packages/django/db/utils.py", line 99, in __exit__
     six.reraise(dj_exc_type, dj_exc_value, traceback)
   File "/usr/lib/python2.7/dist-packages/django/db/backends/__init__.py", line 136, in _commit
     return self.connection.commit()
 django.db.utils.OperationalError: could not access status of transaction 0
 DETAIL: Could not write to file "pg_notify/0014" at offset 90112: No space left on device.

Revision history for this message
rory schramm (roryschramm) wrote :

Hi Andres,

It looks like maas is trying to use 1TB of storage for the boot images. Is there a way to set a limit for the number of snapshots it's keeping?

Is there a recommended size for storage to use for the maas server?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Rory,

That's strange. 1 TB on images. What images are you importing in MAAS. I've never seen such a big amount of space used for images. That being said, MAAS, every time it updates to a new image, it deletes it from the DB.

I think that the issue you are experiencing is that Postgresql is not cleaning up large objects and leaving stuff behind when it shouldn't. If you are using 1.9, can you try to do:

sudo maas-region-admin db_vacuum_lobjects

That should try to vacuum large objects (images0 that were deleted byt not cleaned up by postgresq.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Rory,

Also provide the output of:

"ls -l /var/lib/maas/boot-resources"

Revision history for this message
rory schramm (roryschramm) wrote :

Sorry for the late response.

As a test I did a fresh install by doing the following:

apt-get update
apt-get dist-upgrade
add-apt-repository ppa:maas/stable
apt-get install maas

I then created a maas admin user.
I rebooted the server.

After the server rebooted the cluster had become disconnected. I tried this 3 more times and rebooting the server after a fresh install always resulted in the cluster becoming disconnected.

root@maas:/home/ubuntu# apt-cache policy maas
maas:
  Installed: 1.9.1+bzr4543-0ubuntu1~trusty1
  Candidate: 1.9.1+bzr4543-0ubuntu1~trusty1
  Version table:
 *** 1.9.1+bzr4543-0ubuntu1~trusty1 0
        500 http://ppa.launchpad.net/maas/stable/ubuntu/ trusty/main amd64 Packages
        100 /var/lib/dpkg/status
     1.7.6+bzr3376-0ubuntu2~14.04.1 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
     1.5.4+bzr2294-0ubuntu1.2 0
        500 http://security.ubuntu.com/ubuntu/ trusty-security/main amd64 Packages
     1.5+bzr2252-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages

root@maas:/home/ubuntu# cat /etc/apt/sources.list.d/maas-stable-trusty.list
deb http://ppa.launchpad.net/maas/stable/ubuntu trusty main
# deb-src http://ppa.launchpad.net/maas/stable/ubuntu trusty main

root@maas:/home/ubuntu# cat /etc/*-rel*
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.4 LTS"
NAME="Ubuntu"
VERSION="14.04.4 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.4 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"

root@maas:/home/ubuntu# df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 4.0K 2.0G 1% /dev
tmpfs 396M 720K 395M 1% /run
/dev/dm-3 11G 1.9G 8.5G 18% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 2.0G 0 2.0G 0% /run/shm
none 100M 0 100M 0% /run/user
/dev/sda1 236M 69M 155M 31% /boot
/dev/mapper/maas--vg-var--log 9.1G 40M 8.6G 1% /var/log
/dev/mapper/maas--vg-var--lib 28G 311M 26G 2% /var/lib
/dev/mapper/maas--vg-home 1.8G 2.9M 1.7G 1% /home

Revision history for this message
rory schramm (roryschramm) wrote :

added tarball of /var/log

Changed in maas:
status: Incomplete → New
Changed in maas:
milestone: 1.9.2 → 1.9.3
Changed in maas:
milestone: 1.9.3 → 1.9.4
Changed in maas:
milestone: 1.9.4 → 1.9.5
Changed in maas:
status: New → Incomplete
Revision history for this message
gdupont (ger-dupont) wrote :

Hi all,
Quite new on MaaS, but I think I'm facing the same issue.

Recently rebooted my maas controller (all-in-one installation) after kernel update (which requierd reboot) and suddently the maas-rackd cannot is rejected when it's trying to connect to the maas-region.

Not sure what information I can provide. Few things:
[version]
gdupont@ubuntu-maas:~$ sudo apt-cache policy maas
maas:
  Installed: 2.2.0+bzr6054-0ubuntu1~16.04.1
  Candidate: 2.2.0+bzr6054-0ubuntu1~16.04.1
  Version table:
 *** 2.2.0+bzr6054-0ubuntu1~16.04.1 500
        500 http://ppa.launchpad.net/maas/stable/ubuntu xenial/main amd64 Packages
        500 http://ppa.launchpad.net/maas/stable/ubuntu xenial/main i386 Packages
        100 /var/lib/dpkg/status
     2.1.5+bzr5596-0ubuntu1~16.04.1 500
        500 http://fr.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
        500 http://fr.archive.ubuntu.com/ubuntu xenial-updates/main i386 Packages
     2.0.0~beta3+bzr4941-0ubuntu1 500
        500 http://fr.archive.ubuntu.com/ubuntu xenial/main amd64 Packages
        500 http://fr.archive.ubuntu.com/ubuntu xenial/main i386 Packages

[rackd log]
2017-06-26 11:01:11 provisioningserver.rpc.clusterservice: [info] Event-loop 'ubuntu-maas:pid=1517' authenticated.
2017-06-26 11:01:11 provisioningserver.rpc.clusterservice: [info] Event-loop 'ubuntu-maas:pid=1503' authenticated.
2017-06-26 11:01:11 provisioningserver.rpc.clusterservice: [info] Event-loop 'ubuntu-maas:pid=1498' authenticated.
2017-06-26 11:01:11 provisioningserver.rpc.clusterservice: [info] Event-loop 'ubuntu-maas:pid=1499' authenticated.
2017-06-26 11:01:11 provisioningserver.rpc.clusterservice: [info] Rack controller REJECTED by the region (via ubuntu-maas:pid=1499).

Revision history for this message
Andres Rodriguez (andreserl) wrote :

We believe that this is not longer an issue in the latest releases of MAAS. If you believe this is still an issue, please re-open this bug report and target it accordingly.

Changed in maas:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.