MAAS 1.7b7 probe-and-enlist-hardware causes cluster to stop working, it reregisters after about 10 minutes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Released
|
Critical
|
Blake Rouse | ||
1.7 |
Fix Released
|
Critical
|
Blake Rouse |
Bug Description
Trying to recreate this bug enlisting SeaMicro SM15K on MAAS 1.7b7. We are trying to enlist the 64 nodes on a SM15K chassis into a MAAS 1.7b7 cluster. The following is what happens. Initially it seemed that the cluster completely died, but this time I let it sit while I was gathering bits from logs and noticed that after about 10 minutes, the cluster recovered and re-registered itself. BUT for 10 minutes or so, it was unusable and non-responsive.
Also during this time, the MAAS Dashboard was very, VERY sluggish, I presume because django was doing something to query data from the cluster controller and having to wait on that.
ubuntu@
200 OK
Content-Length: 0
Content-Type: text/html; charset=utf-8
Date: Thu, 23 Oct 2014 15:23:36 GMT
Server: Apache/2.4.7 (Ubuntu)
Status: 200
Vary: Authorization,
X-Frame-Options: SAMEORIGIN
After running the above, if I go to the MAAS UI and click on a different tab, for example, the clusters tab if you're already on the nodes tab, the UI stops responding.
in maas-django.log the only things I see are these ERROR entries every hour:
ERROR 2014-10-23 01:09:51,386 twisted {}
ERROR 2014-10-23 02:09:51,805 twisted {}
ERROR 2014-10-23 03:09:51,278 twisted {}
ERROR 2014-10-23 04:09:51,400 twisted {}
ERROR 2014-10-23 05:09:51,588 twisted {}
ERROR 2014-10-23 06:09:51,310 twisted {}
ERROR 2014-10-23 07:09:51,406 twisted {}
ERROR 2014-10-23 08:09:51,387 twisted {}
ERROR 2014-10-23 09:09:51,368 twisted {}
ERROR 2014-10-23 10:09:51,444 twisted {}
in maas.log, I see this:
ubuntu@
Thu Oct 23 10:31:29 CDT 2014
ubuntu@
Oct 23 09:29:50 utsa-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 23 09:49:50 utsa-maas maas.import-images: [INFO] Started importing boot images.
Oct 23 09:49:50 utsa-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 23 10:09:49 utsa-maas maas.bootresources: [INFO] Started importing of boot resources.
Oct 23 10:09:50 utsa-maas maas.import-images: [INFO] Started importing boot images.
Oct 23 10:09:50 utsa-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 23 10:09:51 utsa-maas maas.bootresources: [INFO] Finished importing of boot resources.
Oct 23 10:09:51 utsa-maas maas.import-images: [INFO] Started importing boot images.
Oct 23 10:09:51 utsa-maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 23 10:23:36 utsa-maas maas.drivers.
Note that those maas.import-iages messages appear every 20 minutes on the 9 until I try the SeaMicro import. At that point, the maas.import-images messages stop happening, even though there should have been another set at 10:29 (see the date output above before I tail the log).
in pserv.log this is what I find. Note that there are NO entries for today, 2014-10-23 until a few minutes after I ran the probe-and-
2014-10-22 18:13:40-0500 [Uninitialized] Event-loop utsa-maas:pid=73139 (10.241.
fused.
2014-10-22 18:13:40-0500 [Uninitialized] Event-loop utsa-maas:pid=73139 (10.241.
fused.
2014-10-23 10:32:20-0500 [ClusterClient,
Traceback (most recent call last):
File "/usr/lib/
File "/usr/lib/
return maybeDeferred(
File "/usr/lib/
result = f(*args, **kw)
File "/usr/lib/
return maybeDeferred(
--- <exception caught here> ---
File "/usr/lib/
result = f(*args, **kw)
File "/usr/lib/
File "/usr/lib/
servers = find_seamicro15
File "/usr/lib/
servers = get_seamicro15k
File "/usr/lib/
for server in api.servers.list():
File "/usr/lib/
return self._list(
File "/usr/lib/
_resp, body = self.api.
File "/usr/lib/
return self._cs_
File "/usr/lib/
File "/usr/lib/
resp, body = self.request(url, method, **kwargs)
File "/usr/lib/
File "/usr/lib/
return session.
File "/usr/lib/
resp = self.send(prep, **send_kwargs)
File "/usr/lib/
r = adapter.
File "/usr/lib/
raise ConnectionError(e)
2014-10-23 10:32:20-0500 [ClusterClient,
2014-10-23 10:32:20-0500 [ClusterClient,
2014-10-23 10:32:20-0500 [ClusterClient,
Traceback (most recent call last):
Failure: twisted.
2014-10-23 10:32:20-0500 [ClusterClient,
Traceback (most recent call last):
Failure: twisted.
2014-10-23 10:32:20-0500 [ClusterClient,
2014-10-23 10:32:20-0500 [ClusterClient,
2014-10-23 10:32:20-0500 [ClusterClient,
2014-10-23 10:32:20-0500 [Uninitialized] ClusterClient connection established (HOST:IPv4Addre
2014-10-23 10:32:20-0500 [Uninitialized] ClusterClient connection established (HOST:IPv4Addre
2014-10-23 10:32:20-0500 [ClusterClient,
2014-10-23 10:32:20-0500 [ClusterClient,
2014-10-23 10:32:20-0500 [ClusterClient,
2014-10-23 10:32:20-0500 [ClusterClient,
It does seem that the cluster DID recover finally, though it was unresponsive for about 10 minutes. Note the last to pserv entries. the cluster was unregistered or unresponsive for about 10 minutes, then it re-registered and became active again.
Related branches
- Andres Rodriguez (community): Approve
-
Diff: 44 lines (+24/-0)2 files modifiedsrc/maasserver/rpc/nodes.py (+6/-0)
src/maasserver/rpc/tests/test_nodes.py (+18/-0)
- Christian Reis (community): Approve
- Julian Edwards (community): Approve
-
Diff: 44 lines (+24/-0)2 files modifiedsrc/maasserver/rpc/nodes.py (+6/-0)
src/maasserver/rpc/tests/test_nodes.py (+18/-0)
- Andres Rodriguez (community): Approve
-
Diff: 21 lines (+12/-1)1 file modifieddebian/changelog (+12/-1)
Changed in maas: | |
importance: | Undecided → Critical |
milestone: | none → 1.7.0 |
Changed in maas: | |
status: | New → Confirmed |
Changed in maas: | |
status: | Triaged → In Progress |
Changed in maas: | |
status: | In Progress → Fix Committed |
Changed in maas: | |
milestone: | next → none |
Changed in maas: | |
status: | Fix Committed → Fix Released |
Attached are the logs from the maas server