After a fresh install, cluster can't connect to region

Bug #1375594 reported by Andres Rodriguez
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Gavin Panella

Bug Description

After a fresh install, the Cluster cannot connect to the region.

What I did to have it connected is to restart apache2, maas-cluster-register, maas-cluster. After I do that I see the following:

Sep 30 01:41:42 unleashed maas.cluster: [INFO] Starting cluster controller 63aa9a86-5265-4bec-86cf-a7645698cc1c.
Sep 30 01:41:42 unleashed maas.cluster: [INFO] Could not register with region controller: INTERNAL SERVER ERROR.
Sep 30 01:41:42 unleashed maas.lease_upload_service: [INFO] PeriodicLeaseUploadService starting.
Sep 30 01:41:43 unleashed maas.bootsources: [INFO] Updated boot sources cache.
Sep 30 01:41:47 unleashed maas.import-images: [INFO] Started importing boot images.
Sep 30 01:41:47 unleashed maas.power_monitor_service: [ERROR] This cluster (63aa9a86-5265-4bec-86cf-a7645698cc1c) is not recognised by the region.

A few seconds later:

Sep 30 01:41:47 unleashed maas.import-images: [WARNING] No resources found in Simplestreams repository u'http://192.168.1.37/MAAS/images-stream/streams/v1/index.json'. Is it correctly configured?
Sep 30 01:41:47 unleashed maas.import-images: [WARNING] Finished importing boot images, no boot images available.
Sep 30 01:42:44 unleashed maas: [WARNING] Failed to create Network when adding/editing cluster interface maas-tun0 with error [{u'netmask': [u'This netmask leaves no room for IP addresses.']}]. This is OK if it already exists.
Sep 30 01:42:45 unleashed maas: [WARNING] Failed to create Network when adding/editing cluster interface maas-tun1 with error [{u'netmask': [u'This netmask leaves no room for IP addresses.']}]. This is OK if it already exists.
Sep 30 01:42:46 unleashed maas: [WARNING] Failed to create Network when adding/editing cluster interface maas-tun2 with error [{u'netmask': [u'This netmask leaves no room for IP addresses.']}]. This is OK if it already exists.
Sep 30 01:42:48 unleashed maas.api: [INFO] New cluster controller registered: maas
Sep 30 01:43:25 unleashed maas.tftp: [WARNING] No boot images have been imported yet.

And the Cluster is connected

Related branches

Changed in maas:
importance: Undecided → Critical
assignee: nobody → Gavin Panella (allenap)
Gavin Panella (allenap)
Changed in maas:
assignee: Gavin Panella (allenap) → nobody
milestone: none → 1.7.0
Revision history for this message
Raphaël Badin (rvb) wrote :

Could not register with region controller: INTERNAL SERVER ERROR.

This indicates there is a problem registering the cluster; we need the django log to see what the problem is.

Revision history for this message
Raphaël Badin (rvb) wrote :

I reproduced the problem by installing 1.7.0~beta4+bzr3130-0ubuntu1~trusty1 and here is /var/log/maas/maas-django.log http://paste.ubuntu.com/8465155/

Revision history for this message
Gavin Panella (allenap) wrote :

I think the problem is:

- Apache runs two processes.

- Both processes start an event-loop.

- RegionAdvertisingService starts in both event-loops.

- In one of them it fails to start. That is the "current transaction is
  aborted message". I'm not sure why that's happening though.

- One process is now not running RegionAdvertisingService. The RPC info
  view (m.views.rpc.info) thus declares that there are *no* event-loops
  running.

- Apache, I assume, round-robins between WSGI processes.

- When the cluster asks for RPC info from the good process, it gets
  endpoint info for that process. It establishes a connection to the
  event-loop in the good process.

- When it asks the broken process, it gets no endpoints. The cluster
  then drops all connections to all event-loops.

- Because of the delays that the cluster uses, the logs will show that
  RPC connections drop every 30s. Then, 2s later, they will come back
  up.

- For users of the web UI or the API, it will appear as if the cluster
  is connected only about half the time.

Ideas for fixing this:

- Make RegionAdvertisingService more resilient to failures starting.

- Fix whatever it is that makes RegionAdvertisingService fail to start.

- Return a "don't know" response from the RPC info view when the
  advertising service is not running, instead of a definitive "no
  connections". On the cluster, skip past "don't know" responses.

I'll probably implement all of the above, for belt-n-braces.

Changed in maas:
status: New → In Progress
assignee: nobody → Gavin Panella (allenap)
Revision history for this message
Gavin Panella (allenap) wrote :

I still don't know what the reason for the database issue is.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.