There is MaaS (2.1.3) installation on 2 bare metal nodes in the HA mode (no load balancing):
- hrzagt1-f1-sc-support-01.hrzagt1.sc.in.XYZ (ctwya7): region and rack controller
- hrzagt1-f1-sc-support-02.hrzagt1.sc.in.XYZ (ssdcba): region and rack controller
Attached are configuration files from both nodes ("conf" directory).
This setup was initially successful (all services turned into green status in MaaS GUI, I was able to enlist and commission nodes), but over time something broke and nodes started to fail when commissioning. The following symptoms have been observed:
1) The following error messages have been found in log files of the primary region controller (ctwya7):
2017-04-10 07:27:13 maasserver.rpc.regionservice: [info] Rack controller 'ssdcba' disconnected.
2017-04-10 07:27:13 RegionServer,10,::ffff:10.24.122.3: [info] RegionServer connection lost (HOST:IPv6Address(TCP, '::ffff:10.24.122.2', 5250) PEER:IPv6Address(TCP, '::ffff:10.24.122.3', 39596))
2017-04-10 07:27:13 maasserver.models.signals.power: [critical] Failed to update power state of machine after state transition.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
_inlineCallbacks(r, g, deferred)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1174, in _inlineCallbacks
deferred.errback()
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 434, in errback
self._startRunCallbacks(fail)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3/dist-packages/maasserver/models/signals/power.py", line 52, in eb_error
Node.DoesNotExist, UnknownPowerType, PowerProblem)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 342, in trap
self.raiseException()
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 368, in raiseException
raise self.value.with_traceback(self.tb)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
File "/usr/lib/python3/dist-packages/maasserver/models/node.py", line 2111, in confirm_power_driver_operable
missing_packages = yield power_driver_check(client, power_type)
twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:27:13 maasserver.models.signals.power: [critical] Failed to update power state of machine after state transition.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
_inlineCallbacks(r, g, deferred)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1174, in _inlineCallbacks
deferred.errback()
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 434, in errback
self._startRunCallbacks(fail)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3/dist-packages/maasserver/models/signals/power.py", line 52, in eb_error
Node.DoesNotExist, UnknownPowerType, PowerProblem)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 342, in trap
self.raiseException()
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 368, in raiseException
raise self.value.with_traceback(self.tb)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3/dist-packages/maasserver/models/node.py", line 2111, in confirm_power_driver_operable
missing_packages = yield power_driver_check(client, power_type)
twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:27:15 twisted.python.log: [info] ::ffff:10.24.110.127 - - [10/Apr/2017:07:27:15 +0000] "GET /MAAS/rpc/ HTTP/1.0" 200 2104 "-" "provisioningserver.rpc.clusterservice.ClusterClientService"
2017-04-10 07:27:17 twisted.python.log: [info] ::1 - - [10/Apr/2017:07:27:16 +0000] "GET /MAAS/ HTTP/1.1" 200 5301 "http://10.24.110.100/MAAS/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
2017-04-10 07:27:19 twisted.python.log: [info] ::1 - - [10/Apr/2017:07:27:18 +0000] "GET /MAAS/ HTTP/1.1" 200 5301 "http://10.24.110.100/MAAS/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
2017-04-10 07:27:23 maasserver: [error] Error while calling DescribePowerTypes: RPC connection timed out to rack controller 'hrzagt1-f2-sc-support-02' (ssdcba).
2) Secondary rack controller (ssdcba) reports in MaaS GUI that it's missing connection to 1 region controller ("ssdcba.png" file).
3) The following error messages have been found in log files of the secondary rack controller (ssdcba):
2017-04-10 07:43:05 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(TCP, '::ffff:10.24.122.3', 40474) PEER:IPv6Address(TCP, '::ffff:10.24.122.2', 5250))
2017-04-10 07:43:05 twisted.internet.defer: [critical] Unhandled error in Deferred:
2017-04-10 07:43:05 twisted.internet.defer: [critical]
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:43:05 provisioningserver.rackdservices.ntp: [critical] Failed to update NTP configuration.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 434, in errback
self._startRunCallbacks(fail)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
self._runCallbacks()
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
_inlineCallbacks(r, g, deferred)
--- <exception caught here> ---
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3/dist-packages/provisioningserver/rackdservices/ntp.py", line 71, in _getConfiguration
GetTimeConfiguration, system_id=client.localIdent)
twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:43:05 provisioningserver.utils.services: [critical] Failed to update and/or record network interface configuration: Connection to the other side was lost in a non-clean fashion.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
_inlineCallbacks(r, g, deferred)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1174, in _inlineCallbacks
deferred.errback()
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 434, in errback
self._startRunCallbacks(fail)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 342, in trap
self.raiseException()
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 368, in raiseException
raise self.value.with_traceback(self.tb)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3/dist-packages/provisioningserver/utils/services.py", line 556, in _configureNetworkDiscovery
self.getDiscoveryState)
twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:43:06 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(TCP, '::ffff:10.24.123.3', 60238) PEER:IPv6Address(TCP, '::ffff:10.24.123.2', 5251))
2017-04-10 07:43:06 twisted.internet.defer: [critical] Unhandled error in Deferred:
2017-04-10 07:43:06 twisted.internet.defer: [critical]
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:43:06 twisted.internet.defer: [critical] Unhandled error in Deferred:
2017-04-10 07:43:06 twisted.internet.defer: [critical]
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:43:06 twisted.internet.defer: [critical] Unhandled error in Deferred:
2017-04-10 07:43:06 twisted.internet.defer: [critical]
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:43:06 provisioningserver.rackdservices.service_monitor_service: [critical] Failed to monitor services and update region.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 434, in errback
self._startRunCallbacks(fail)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
self._runCallbacks()
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
_inlineCallbacks(r, g, deferred)
--- <exception caught here> ---
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3/dist-packages/provisioningserver/rackdservices/service_monitor_service.py", line 94, in _updateRegion
services = yield self._buildServices(services)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3/dist-packages/provisioningserver/rackdservices/service_monitor_service.py", line 106, in _buildServices
status, status_info = yield state.getStatusInfo(service)
twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:43:06 provisioningserver.rackdservices.dhcp_probe_service: [critical] Unable to probe for DHCP servers.
(UNABLE TO OBTAIN TRACEBACK FROM EVENT)
2017-04-10 07:43:06 twisted.internet.defer: [critical] Unhandled error in Deferred:
2017-04-10 07:43:06 twisted.internet.defer: [critical]
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:43:06 twisted.internet.defer: [critical] Unhandled error in Deferred:
2017-04-10 07:43:06 twisted.internet.defer: [critical]
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
2017-04-10 07:43:06 twisted.internet.defer: [critical] Unhandled error in Deferred:
2017-04-10 07:43:06 twisted.internet.defer: [critical]
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
Additional information:
1. Contents of /var/log/maas/*
Attached ("logs" directory).
2. Output from dpkg -l '*maas*'|cat
ctwya7:
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==============================-============-=============================================
un maas <none> <none> (no description available)
ii maas-cli 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.1.3+bzr5573-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.1.3+bzr5573-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.1.3+bzr5573-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)
ssdcba:
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==============================-============-=============================================
un maas <none> <none> (no description available)
ii maas-cli 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.1.3+bzr5573-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.1.3+bzr5573-0ubuntu1~16.04.1 all Region controller API service for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.1.3+bzr5573-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)
3. Instructions on how we can re-create your bug and what the expected result should be.
Ho to re-create: https://docs.ubuntu.com/maas/2.1/en/manage-ha
Expected result: secondary rack controller (ssdcba) does not get disconnected