Rack controller disconnects when the regiond process can't find self in database.

Bug #1703686 reported by Andres Rodriguez
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Unassigned
2.2
Invalid
Critical
Unassigned

Bug Description

It was reported that when doing some changes to MAAS model, it causes the rack controller to disconnect. Statement:

"just changing the fabrics the various VLANs and subnets belong to seems to set it off"

Causes:

2017-07-11 16:20:57 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(TCP, '::ffff:172.26.10.11', 5252) PEER:IPv6Address(TCP, '::ffff:172.26.10.11', 43936)) [11/2555]
2017-07-11 16:20:57 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(TCP, '::ffff:172.26.17.11', 5251) PEER:IPv6Address(TCP, '::ffff:172.26.17.11', 46664))
2017-07-11 16:20:57 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(TCP, '::ffff:172.26.16.11', 5250) PEER:IPv6Address(TCP, '::ffff:172.26.16.11', 41964))
2017-07-11 16:20:57 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:172.26.16.11:38184'.
2017-07-11 16:20:57 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:172.26.10.11:43936'.
2017-07-11 16:20:57 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:172.26.16.11:41964'.
2017-07-11 16:20:57 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:172.26.17.11:46664'.

2017-07-11 16:22:45 regiond: [info] 172.26.11.11 GET /MAAS/rpc/ HTTP/1.0 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2017-07-11 16:22:45 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(TCP, '::ffff:172.26.16.11', 5253) PEER:IPv6Address(TCP, '::ffff:172.26.16.11', 38232))
2017-07-11 16:22:45 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(TCP, '::ffff:172.26.10.11', 5252) PEER:IPv6Address(TCP, '::ffff:172.26.10.11', 43982))
2017-07-11 16:22:45 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(TCP, '::ffff:172.26.17.11', 5251) PEER:IPv6Address(TCP, '::ffff:172.26.17.11', 46710))
2017-07-11 16:22:45 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(TCP, '::ffff:172.26.16.11', 5250) PEER:IPv6Address(TCP, '::ffff:172.26.16.11', 42006))
2017-07-11 16:23:16 maasserver.rpc.regionservice: [critical] Failed to register rack controller 'ns66rw' into the database. Connection will be dropped.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
    _inlineCallbacks(r, g, deferred)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
--- <exception caught here> ---
  File "/usr/lib/python3/dist-packages/maasserver/rpc/regionservice.py", line 613, in register
    process, rack_controller, self.host)
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 262, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 875, in callInContext
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 232, in wrapper
    result = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/orm.py", line 686, in call_within_transaction
    return func_outside_txn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/orm.py", line 504, in retrier
    return func(*args, **kwargs)
  File "/usr/lib/python3.5/contextlib.py", line 30, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/rpc/regionservice.py", line 1289, in registerConnection
    process=process, address=host.host, port=host.port)
  File "/usr/lib/python3/dist-packages/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/db/models/query.py", line 334, in get
    self.model._meta.object_name
maasserver.models.regioncontrollerprocessendpoint.DoesNotExist: RegionControllerProcessEndpoint matching query does not exist.
2017-07-11 16:23:16 maasserver.rpc.regionservice: [info] Rack controller 'ns66rw' disconnected.
2017-07-11 16:23:16 RegionServer,0,::ffff:172.26.16.11: [info] RegionServer connection lost (HOST:IPv6Address(TCP, '::ffff:172.26.16.11', 5253) PEER:IPv6Address(TCP, '::ffff:172.26.16.11', 37692))

Tags: registration
Changed in maas:
importance: Undecided → Critical
status: New → Triaged
milestone: none → 2.3.0
tags: added: registration
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Can you provide more detail on the steps to reproduce this? Like the exact steps?

Changed in maas:
status: Triaged → Incomplete
summary: - Rack Controller disconnection
+ Rack controller disconnects when the regiond process can't find self in
+ database.
Revision history for this message
Andres Rodriguez (andreserl) wrote :

I believe the steps are:

1. install maas
2. Change fabrics/vlans around from what originally discovered
3. errors

Revision history for this message
Lorenzo Cavassa (lorenzo-cavassa) wrote :

Yes, those are the steps we ran. Region & Rack controller on the same server.

Here attached there is the server's ENI file.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Change fabric/vlans on interfaces on a rack controller?

When you say it took a long time to come up? What took a long time to come up? The rack controller itself died? Or the DHCP took a long time to come up?

I am still really confused on exactly what was done? I need a lot more detail.

Revision history for this message
Lorenzo Cavassa (lorenzo-cavassa) wrote :

Yes, fabrics and the vlans on the rack controller.

The rack controller takes a lot of time to steadily reconnect to the region controller: before that the MAAS GUI shows all the rack-controller services down and the /var/log/maas/rackd.log file shows that the rack controller registers with the region controller to be then disconnected.

Here attached the rackd.log file

Changed in maas:
milestone: 2.3.0 → 2.3.0beta2
Changed in maas:
milestone: 2.3.0beta2 → 2.3.0beta3
Changed in maas:
milestone: 2.3.0beta3 → 2.3.0beta4
Revision history for this message
Andres Rodriguez (andreserl) wrote :

I believe this has now been fixed in the latest versions. Marking as invalid.

Changed in maas:
status: Incomplete → Invalid
milestone: 2.3.0beta4 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.