Cluster gets disconnected and causes set power-type: KeyError: u'sm15k' exception and node never moves to failed state

Bug #1469742 reported by Larry Michel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Won't Fix
High
Unassigned

Bug Description

We are seeing occasional failures to power on sm15k where server remains in allocated state and will not change to deploying. In regiond.log, this is the error:

2015-06-28 14:51:07 [maasserver] ERROR: ################################ Exception: u'sm15k' ################################
2015-06-28 14:51:07 [maasserver] ERROR: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/support.py", line 52, in __call__
    response = upcall(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/decorators/vary.py", line 19, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/piston/resource.py", line 167, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python2.7/dist-packages/piston/resource.py", line 165, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/support.py", line 200, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/nodes.py", line 411, in start
    form = Form(instance=node)
  File "/usr/lib/python2.7/dist-packages/maasserver/forms.py", line 766, in __init__
    self.set_up_power_type(data, instance)
  File "/usr/lib/python2.7/dist-packages/maasserver/forms.py", line 830, in set_up_power_type
    power_type]
KeyError: u'sm15k'

2015-06-28 14:51:07 [-] 127.0.0.1 - - [28/Jun/2015:14:51:07 +0000] "POST /MAAS/api/1.0/nodes/node-a2ed9f80-c4cd-11e3-8102-00163efc5068/?op=start HTTP/1.1" 500 8 "-" "Go 1.1 package http"

Tags: oil
Revision history for this message
Raphaël Badin (rvb) wrote :

Seems 'sm15k' is not among the supported power types supported by the cluster. Is this reproducible? (You're saying 'occasional failures'.) Can you check if there is any indication that the cluster is disconnected during the time this happens?

Changed in maas:
importance: Undecided → Critical
status: New → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

Yes. I see these errors:

ubuntu@maas-trusty-back-may22:~$ grep -i "cluster" /var/log/maas/regiond.log|grep ERROR
2015-06-28 14:51:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-06-28 14:51:26 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)

Changed in maas:
status: Incomplete → New
summary: - Failure to start sm15k servers because of exception to set power-type:
- KeyError: u'sm15k'
+ Cluster gets disconnected and causes to set power-type: KeyError:
+ u'sm15k'
Changed in maas:
milestone: none → 1.9.0
Revision history for this message
Raphaël Badin (rvb) wrote : Re: Cluster gets disconnected and causes to set power-type: KeyError: u'sm15k'

Cluster log shows that bug 1446813 is happening again.

Revision history for this message
Raphaël Badin (rvb) wrote :

Hard to debug without a usable cluster log file…

Larry Michel (lmic)
summary: - Cluster gets disconnected and causes to set power-type: KeyError:
- u'sm15k'
+ Cluster gets disconnected and causes set power-type: KeyError: u'sm15k'
+ exception and node never moves to failed state
Revision history for this message
Larry Michel (lmic) wrote :

I tried to simulate a failure to power on through incorrect credentials to see behavior on the juju client side. The node immediately moved to Failed Deployment state which allowed the deployment to continue and eventually timeout.

With the KeyError failure to power on sm15k, the server appears stuck in that intermediate state and it remains in the allocated state. This breaks juju_deployer which waits for juju client to return an entry for a machine that's started; however, jujuclient does not understand allocated state and this causes juju_deployer to wait forever. The workaround would be for juju_deployer to abort after timing out.

I think that the cluster should be able to move servers in this state to the failed deployment state upon reconnecting.

Changed in maas:
status: New → Incomplete
status: Incomplete → Confirmed
Gavin Panella (allenap)
Changed in maas:
status: Confirmed → Triaged
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Is this still an error that you are seeing in OIL?

no longer affects: maas/1.8
Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

Yes, were are still seeing it. This is the last occurrence:

i/1.0/nodegroups/8afb6953-b2d0-4cd9-9f58-4ed6b57e04ef/boot-images/ HTTP/1.1" 503 - "-" "Go 1.1 package http"
2015-10-04 07:53:25 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (8afb6953-b2d0-4cd9-9f58-4ed6b57e04ef)
2015-10-04 07:53:25 [-] 127.0.0.1 - - [04/Oct/2015:07:53:24 +0000] "GET /MAAS/metadata/latest/by-id/node-c2a78510-4b5a-11e5-a2c2-00163e362f6f/?op=get_preseed HTTP/1.1" 503 93 "-" "Cloud-Init/0.7.5"
2015-10-04 07:53:26 [maasserver] ERROR: ################################ Exception: u'sm15k' ################################
2015-10-04 07:53:26 [maasserver] ERROR: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/support.py", line 52, in __call__
    response = upcall(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/decorators/vary.py", line 19, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/piston/resource.py", line 167, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python2.7/dist-packages/piston/resource.py", line 165, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/support.py", line 200, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/nodes.py", line 412, in start
    form = Form(instance=node)
  File "/usr/lib/python2.7/dist-packages/maasserver/forms.py", line 767, in __init__
    self.set_up_power_type(data, instance)
  File "/usr/lib/python2.7/dist-packages/maasserver/forms.py", line 831, in set_up_power_type
    power_type]
KeyError: u'sm15k'

2015-10-04 07:53:26 [-] 127.0.0.1 - - [04/Oct/2015:07:53:25 +0000] "POST /MAAS/api/1.0/nodes/node-d8ed36b2-4b5a-11e5-a0fd-00163e362f6f/?op=start HTTP/1.1" 500 8 "-" "Go 1.1 package http"

Changed in maas:
status: Incomplete → New
Changed in maas:
status: New → Confirmed
Gavin Panella (allenap)
Changed in maas:
status: Confirmed → Triaged
Christian Reis (kiko)
Changed in maas:
importance: Critical → High
milestone: 1.9.0 → 1.9.1
Changed in maas:
milestone: 1.9.1 → 1.9.2
Changed in maas:
milestone: 1.9.2 → 1.9.3
Changed in maas:
milestone: 1.9.3 → 1.9.4
Changed in maas:
milestone: 1.9.4 → 1.9.5
Revision history for this message
Andres Rodriguez (andreserl) wrote :

We believe this is no longer an issue in the latest releases of MAAS. Please upgrade to the latest version of MAAS, and If you believe this issue is still present, please re-open this bug report or file a new one.

Changed in maas:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.