MAAS

After a fresh install, cluster can't connect to region

Bug #1375594 reported by Andres Rodriguez on 2014-09-30

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Gavin Panella	MAAS 1.7.0

Bug Description

After a fresh install, the Cluster cannot connect to the region.

What I did to have it connected is to restart apache2, maas-cluster-register, maas-cluster. After I do that I see the following:

Sep 30 01:41:42 unleashed maas.cluster: [INFO] Starting cluster controller 63aa9a86-5265-4bec-86cf-a7645698cc1c.
Sep 30 01:41:42 unleashed maas.cluster: [INFO] Could not register with region controller: INTERNAL SERVER ERROR.
Sep 30 01:41:42 unleashed maas.lease_upload_service: [INFO] PeriodicLeaseUploadService starting.
Sep 30 01:41:43 unleashed maas.bootsources: [INFO] Updated boot sources cache.
Sep 30 01:41:47 unleashed maas.import-images: [INFO] Started importing boot images.
Sep 30 01:41:47 unleashed maas.power_monitor_service: [ERROR] This cluster (63aa9a86-5265-4bec-86cf-a7645698cc1c) is not recognised by the region.

A few seconds later:

Sep 30 01:41:47 unleashed maas.import-images: [WARNING] No resources found in Simplestreams repository u'http://192.168.1.37/MAAS/images-stream/streams/v1/index.json'. Is it correctly configured?
Sep 30 01:41:47 unleashed maas.import-images: [WARNING] Finished importing boot images, no boot images available.
Sep 30 01:42:44 unleashed maas: [WARNING] Failed to create Network when adding/editing cluster interface maas-tun0 with error [{u'netmask': [u'This netmask leaves no room for IP addresses.']}]. This is OK if it already exists.
Sep 30 01:42:45 unleashed maas: [WARNING] Failed to create Network when adding/editing cluster interface maas-tun1 with error [{u'netmask': [u'This netmask leaves no room for IP addresses.']}]. This is OK if it already exists.
Sep 30 01:42:46 unleashed maas: [WARNING] Failed to create Network when adding/editing cluster interface maas-tun2 with error [{u'netmask': [u'This netmask leaves no room for IP addresses.']}]. This is OK if it already exists.
Sep 30 01:42:48 unleashed maas.api: [INFO] New cluster controller registered: maas
Sep 30 01:43:25 unleashed maas.tftp: [WARNING] No boot images have been imported yet.

And the Cluster is connected

Related branches

lp:~allenap/maas/rpc-embrace-uncertainty

Merged into lp:~maas-committers/maas/trunk at revision 3150

Raphaël Badin (community): Approve on 2014-10-01

lp:~allenap/maas/rpc-resilience-when-advertising

Merged into lp:~maas-committers/maas/trunk at revision 3161

Graham Binns (community): Approve on 2014-10-01

Andres Rodriguez (andreserl) on 2014-09-30

Changed in maas:
importance:	Undecided → Critical
assignee:	nobody → Gavin Panella (allenap)

Gavin Panella (allenap) on 2014-09-30

Changed in maas:
assignee:	Gavin Panella (allenap) → nobody
milestone:	none → 1.7.0

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-09-30:

Could not register with region controller: INTERNAL SERVER ERROR.

This indicates there is a problem registering the cluster; we need the django log to see what the problem is.

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-09-30:

I reproduced the problem by installing 1.7.0~beta4+bzr3130-0ubuntu1~trusty1 and here is /var/log/maas/maas-django.log http://paste.ubuntu.com/8465155/

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-01:

I think the problem is:

- Apache runs two processes.

- Both processes start an event-loop.

- RegionAdvertisingService starts in both event-loops.

- In one of them it fails to start. That is the "current transaction is
aborted message". I'm not sure why that's happening though.

- One process is now not running RegionAdvertisingService. The RPC info
view (m.views.rpc.info) thus declares that there are *no* event-loops
running.

- Apache, I assume, round-robins between WSGI processes.

- When the cluster asks for RPC info from the good process, it gets
endpoint info for that process. It establishes a connection to the
event-loop in the good process.

- When it asks the broken process, it gets no endpoints. The cluster
then drops all connections to all event-loops.

- Because of the delays that the cluster uses, the logs will show that
RPC connections drop every 30s. Then, 2s later, they will come back
up.

- For users of the web UI or the API, it will appear as if the cluster
is connected only about half the time.

Ideas for fixing this:

- Make RegionAdvertisingService more resilient to failures starting.

- Fix whatever it is that makes RegionAdvertisingService fail to start.

- Return a "don't know" response from the RPC info view when the
advertising service is not running, instead of a definitive "no
connections". On the cluster, skip past "don't know" responses.

I'll probably implement all of the above, for belt-n-braces.

Changed in maas:
status:	New → In Progress
assignee:	nobody → Gavin Panella (allenap)

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-01:

I still don't know what the reason for the database issue is.

Changed in maas:
status:	In Progress → Fix Committed

Julian Edwards (julian-edwards) on 2014-11-19

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.