MAAS 1.7b7 probe-and-enlist-hardware causes cluster to stop working, it reregisters after about 10 minutes

Bug #1384778 reported by Jeff Lane on 2014-10-23
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Blake Rouse
1.7
Critical
Blake Rouse

Bug Description

Trying to recreate this bug enlisting SeaMicro SM15K on MAAS 1.7b7. We are trying to enlist the 64 nodes on a SM15K chassis into a MAAS 1.7b7 cluster. The following is what happens. Initially it seemed that the cluster completely died, but this time I let it sit while I was gathering bits from logs and noticed that after about 10 minutes, the cluster recovered and re-registered itself. BUT for 10 minutes or so, it was unusable and non-responsive.

Also during this time, the MAAS Dashboard was very, VERY sluggish, I presume because django was doing something to query data from the cluster controller and having to wait on that.

ubuntu@utsa-maas:/var/log/maas$ maas demo node-group probe-and-enlist-hardware -d $uuid model=seamicro15k username=admin password=seamicro mac=00:21:53:12:ad:00 power_control=restapi2
200 OK

   Content-Length: 0
     Content-Type: text/html; charset=utf-8
             Date: Thu, 23 Oct 2014 15:23:36 GMT
           Server: Apache/2.4.7 (Ubuntu)
           Status: 200
             Vary: Authorization,Cookie
  X-Frame-Options: SAMEORIGIN

After running the above, if I go to the MAAS UI and click on a different tab, for example, the clusters tab if you're already on the nodes tab, the UI stops responding.

in maas-django.log the only things I see are these ERROR entries every hour:
ERROR 2014-10-23 01:09:51,386 twisted {}
ERROR 2014-10-23 02:09:51,805 twisted {}
ERROR 2014-10-23 03:09:51,278 twisted {}
ERROR 2014-10-23 04:09:51,400 twisted {}
ERROR 2014-10-23 05:09:51,588 twisted {}
ERROR 2014-10-23 06:09:51,310 twisted {}
ERROR 2014-10-23 07:09:51,406 twisted {}
ERROR 2014-10-23 08:09:51,387 twisted {}
ERROR 2014-10-23 09:09:51,368 twisted {}
ERROR 2014-10-23 10:09:51,444 twisted {}

in maas.log, I see this:
ubuntu@utsa-maas:/var/log/maas$ date
Thu Oct 23 10:31:29 CDT 2014
ubuntu@utsa-maas:/var/log/maas$ tail maas.log
Oct 23 09:29:50 utsa-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 23 09:49:50 utsa-maas maas.import-images: [INFO] Started importing boot images.
Oct 23 09:49:50 utsa-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 23 10:09:49 utsa-maas maas.bootresources: [INFO] Started importing of boot resources.
Oct 23 10:09:50 utsa-maas maas.import-images: [INFO] Started importing boot images.
Oct 23 10:09:50 utsa-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 23 10:09:51 utsa-maas maas.bootresources: [INFO] Finished importing of boot resources.
Oct 23 10:09:51 utsa-maas maas.import-images: [INFO] Started importing boot images.
Oct 23 10:09:51 utsa-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 23 10:23:36 utsa-maas maas.drivers.seamicro: [INFO] Probing for seamicro15k nodes with arguments ip=192.168.1.98, username=admin, password=seamicro, power_control=restapi2

Note that those maas.import-iages messages appear every 20 minutes on the 9 until I try the SeaMicro import. At that point, the maas.import-images messages stop happening, even though there should have been another set at 10:29 (see the date output above before I tail the log).

in pserv.log this is what I find. Note that there are NO entries for today, 2014-10-23 until a few minutes after I ran the probe-and-enlist-hardware command:
2014-10-22 18:13:40-0500 [Uninitialized] Event-loop utsa-maas:pid=73139 (10.241.1.193:40705): Connection was refused by other side: 111: Connection re
fused.
2014-10-22 18:13:40-0500 [Uninitialized] Event-loop utsa-maas:pid=73139 (10.241.1.193:40705): Connection was refused by other side: 111: Connection re
fused.
2014-10-23 10:32:20-0500 [ClusterClient,client] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 889, in _commandReceived
            deferred = self.dispatchCommand(box)
          File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 946, in dispatchCommand
            return maybeDeferred(responder, box)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 139, in maybeDeferred
            result = f(*args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 1035, in doit
            return maybeDeferred(aCallable, **kw).addCallback(
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 139, in maybeDeferred
            result = f(*args, **kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/clusterservice.py", line 360, in add_seamicro15k
            power_control=power_control)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/hardware/seamicro.py", line 279, in probe_seamicro15k_and_enlist
            servers = find_seamicro15k_servers(ip, username, password, power_control)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/hardware/seamicro.py", line 266, in find_seamicro15k_servers
            servers = get_seamicro15k_servers(version, ip, username, password)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/hardware/seamicro.py", line 242, in get_seamicro15k_servers
            for server in api.servers.list():
          File "/usr/lib/python2.7/dist-packages/seamicroclient/v2/servers.py", line 77, in list
            return self._list("/servers")
          File "/usr/lib/python2.7/dist-packages/seamicroclient/base.py", line 53, in _list
            _resp, body = self.api.client.get(url)
          File "/usr/lib/python2.7/dist-packages/seamicroclient/client.py", line 173, in get
            return self._cs_request(url, 'GET', **kwargs)
          File "/usr/lib/python2.7/dist-packages/seamicroclient/client.py", line 158, in _cs_request
            method,**kwargs)
          File "/usr/lib/python2.7/dist-packages/seamicroclient/client.py", line 140, in _time_request
            resp, body = self.request(url, method, **kwargs)
          File "/usr/lib/python2.7/dist-packages/seamicroclient/client.py", line 114, in request
            **kwargs)
          File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
            return session.request(method=method, url=url, **kwargs)
          File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
            resp = self.send(prep, **send_kwargs)
          File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
            r = adapter.send(request, **kwargs)
          File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 378, in send
            raise ConnectionError(e)
        requests.exceptions.ConnectionError: HTTPConnectionPool(host='192.168.1.98', port=80): Max retries exceeded with url: /v2.0/servers?username=admin&password=seamicro?username=admin&password=seamicro?username=admin&password=seamicro?username=admin&password=seamicro (Caused by <class 'socket.error'>: [Errno 110] Connection timed out)

2014-10-23 10:32:20-0500 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4Address(TCP, '127.0.0.1', 39950) PEER:IPv4Address(TCP, u'127.0.0.1', 53349))
2014-10-23 10:32:20-0500 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4Address(TCP, '127.0.0.1', 39950) PEER:IPv4Address(TCP, u'127.0.0.1', 53349))
2014-10-23 10:32:20-0500 [ClusterClient,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2014-10-23 10:32:20-0500 [ClusterClient,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2014-10-23 10:32:20-0500 [ClusterClient,client] Logged OOPS id OOPS-5ceee5f9016c8211b16f7f3d0d2971e6: ConnectionError: HTTPConnectionPool(host='192.168.1.98', port=80): Max retries exceeded with url: /v2.0/servers?username=admin&password=seamicro?username=admin&password=seamicro?username=admin&password=seamicro?username=admin&password=seamicro (Caused by <class 'socket.error'>: [Errno 110] Connection timed out)
2014-10-23 10:32:20-0500 [ClusterClient,client] Logged OOPS id OOPS-039edc75bbdcf4324aab4656d3e4959a: ConnectionDone: Connection was closed cleanly.
2014-10-23 10:32:20-0500 [ClusterClient,client] Logged OOPS id OOPS-da0710da830b08f4777a03ace32371f5: ConnectionDone: Connection was closed cleanly.
2014-10-23 10:32:20-0500 [Uninitialized] ClusterClient connection established (HOST:IPv4Address(TCP, '127.0.0.1', 51622) PEER:IPv4Address(TCP, u'127.0.0.1', 53349))
2014-10-23 10:32:20-0500 [Uninitialized] ClusterClient connection established (HOST:IPv4Address(TCP, '127.0.0.1', 51622) PEER:IPv4Address(TCP, u'127.0.0.1', 53349))
2014-10-23 10:32:20-0500 [ClusterClient,client] Event-loop 'utsa-maas:pid=73967' authenticated.
2014-10-23 10:32:20-0500 [ClusterClient,client] Event-loop 'utsa-maas:pid=73967' authenticated.
2014-10-23 10:32:20-0500 [ClusterClient,client] Cluster 'd98d9d2f-20cd-4f78-b081-e92f47dc002a' registered (via utsa-maas:pid=73967).
2014-10-23 10:32:20-0500 [ClusterClient,client] Cluster 'd98d9d2f-20cd-4f78-b081-e92f47dc002a' registered (via utsa-maas:pid=73967).

It does seem that the cluster DID recover finally, though it was unresponsive for about 10 minutes. Note the last to pserv entries. the cluster was unregistered or unresponsive for about 10 minutes, then it re-registered and became active again.

Related branches

Jeff Lane (bladernr) wrote :

Attached are the logs from the maas server

Changed in maas:
importance: Undecided → Critical
milestone: none → 1.7.0
Changed in maas:
status: New → Confirmed
Blake Rouse (blake-rouse) wrote :

Okay the issue, is the architecture that is passed back to CreateNode RPC call.

That call needs to handle the case, where a sub-architecture is not given. It should make the subarchitecture generic in that case.

Changed in maas:
status: Confirmed → Triaged
assignee: nobody → Blake Rouse (blake-rouse)
milestone: 1.7.0 → next
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: next → none
Changed in maas:
status: Fix Committed → Fix Released

Hello Jeff, or anyone else affected,

Accepted maas into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/maas/1.7.5+bzr3369-0ubuntu1~14.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Andres Rodriguez (andreserl) wrote :

This issue has been verified to work both on upgrade and fresh install, and has been QA'd. Marking verification-done.

tags: added: verification-done
removed: verification-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments