500 error reported to juju when starting node - "another action is already in progress for that node"

Bug #1453954 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse

Bug Description

I've seen this at least with 1.8 beta4 and 1.8 beta5:

2015-05-09 04:38:38 [HTTPChannel,7,127.0.0.1] 500 Error - /MAAS/api/1.0/nodes/node-9db5ca7e-c4cd-11e3-824b-00163efc5068/
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/web/wsgi.py", line 315, in run
            appIterator = self.application(self.environ, self.startResponse)
          File "/usr/lib/python2.7/dist-packages/django/core/handlers/wsgi.py", line 206, in __call__
            response = self.get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 213, in get_response
            response = get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 202, in get_response
            request, get_resolver(None), sys.exc_info())
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 187, in get_response
            return django_get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/orm.py", line 331, in __exit__
            self.fire()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 151, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/async.py", line 180, in fire
            self._fire_in_reactor(hook).wait()
          File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 219, in wait
            result.raiseException()
          File "<string>", line 2, in raiseException

        provisioningserver.rpc.exceptions.PowerActionAlreadyInProgress: Unable to change power state to 'on' for node hayward-29: another action is already in progress for that node.

2015-05-09 04:38:38 [-] 127.0.0.1 - - [09/May/2015:04:38:37 +0000] "POST /MAAS/api/1.0/nodes/node-9db5ca7e-c4cd-11e3-824b-00163efc5068/?op=start HTTP/1.1" 500 332 "-" "Go 1.1 package http"

This ends up messing with the node's status in juju:
   '5':
    agent-state-info: "gomaasapi: got error back from server: 500 INTERNAL SERVER\
      \ ERROR (<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"\
      http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\"\
      \ xml:lang=\"en\" lang=\"en\">\n <head>\n <title>Error: Internal server\
      \ error</title>\n </head>\n <body>\n <h2>\n Internal server error.\n\
      \ </h2>\n </body>\n</html>\n)"
    containers:
      5/lxc/0:
        instance-id: pending
        series: precise
    instance-id: pending
    series: precise

Tags: oil

Related branches

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Some more log context around that failure - there was a release operation on that node issued 5 seconds before the failing start operation:

http://paste.ubuntu.com/11084923/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's some more from the maas.log:

http://paste.ubuntu.com/11084974/

This shows the node had completed the release operation and was powered off and ready at the time it was allocated and started.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

full maas.log and regiond.log - there is no cluster.log for some reason. There is a hole in our cluster logs from may 5 to may 9. It looks like clusterd just stopped logging.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Node event log events:

Node changed status - From 'Allocated' to 'Deploying' Sat, 09 May. 2015 04:38:36
Node changed status - From 'Ready' to 'Allocated' (to oil-slave-1) Sat, 09 May. 2015 04:38:36
Powering node off Sat, 09 May. 2015 04:38:33
Node changed status - From 'Deployed' to 'Ready' Sat, 09 May. 2015 04:38:33
Node powered off Sat, 09 May. 2015 04:35:31

Changed in maas:
importance: Undecided → Critical
milestone: none → 1.8.0
status: New → Triaged
Revision history for this message
Blake Rouse (blake-rouse) wrote :

I think I know why this is occurring. For a node to transition from Releasing to Ready, the update_power_state on the Node must be called with power_state = 'off'. This will cause the transition to occur. There is two ways this gets called on the region and both come from the cluster.

1. A power action is being performed and the power transition is completed, it will make the RPC call to update the power_state in the region.
2. The power monitor service will make the RPC call when it queries the node.

I think the combination of the two make this issue occur. Since the power monitor will query a node even if a power action is being performed this issue can occur. This is the breakdown of what happens.

1. Release action is performed.
2. Node is placed in Releasing state.
3. Power off action is sent to the cluster.
4. Cluster performs the power off, and waits a moment to check to result.
5. While the cluster is waiting to check the power state in the power action, the power monitor service performs the same check.
6. Power monitor makes the RPC call to update the node.
7. Node is transitioned to Ready.
8. The node is told to power on, since its in ready state. But the power action has not finished.

The fix:

Have the power monitor service not power a node that is having a power action performed. This can easily be done as we know the system_id of the node both in the power action and the power monitor service.

Changed in maas:
status: Triaged → Fix Committed
Revision history for this message
Ashley Lai (alai) wrote :

I saw the same error message when I did destroy-environment

$juju destroy-environment -y ci-oil-slave6

ERROR failed to destroy environment "ci-oil-slave6"

If the environment is unusable, then you may run

    juju destroy-environment --force

to forcefully destroy the environment. Upon doing so, review
your environment provider console for any resources that need
to be cleaned up. Using force will also by-pass destroy-envrionment block.

ERROR destroying environment: cannot release nodes: gomaasapi: got error back from server: 500 INTERNAL SERVER ERROR (<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Error: Internal server error</title>
  </head>
  <body>
    <h2>
      Internal server error.
    </h2>
  </body>
</html>
)

Revision history for this message
Andres Rodriguez (andreserl) wrote :

can you confirm the maas releasE?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

1.8beta6

Revision history for this message
Larry Michel (lmic) wrote :

This looks like a recreate with 1.8 beta6

  '1':
    agent-state-info: "gomaasapi: got error back from server: 500 INTERNAL SERVER\
      \ ERROR (<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"\
      http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\"\
      \ xml:lang=\"en\" lang=\"en\">\n <head>\n <title>Error: Internal server\
      \ error</title>\n </head>\n <body>\n <h2>\n Internal server error.\n\
      \ </h2>\n </body>\n</html>\n)"
    containers:
      1/lxc/0:
        instance-id: pending
        series: trusty
      1/lxc/1:
        instance-id: pending
        series: trusty
    instance-id: pending
    series: trusty

2015-05-18 02:41:47,112 [DEBUG] deployer.import: Adding units...
2015-05-18 02:41:48,516 [WARNING] deployer.import: Config specifies num units for subordinate: ceilometer-agent
2015-05-18 02:41:48,517 [DEBUG] deployer.import: Service 'cinder' does not need any more units added.
...
2015-05-18 03:02:20,984 [DEBUG] deployer.env: Delta unit: ceilometer/0 change:started
2015-05-18 03:58:44,149 [DEBUG] deployer.env: Connecting to environment...
2015-05-18 03:58:45,296 [DEBUG] deployer.env: Connected to environment
2015-05-18 03:58:45,297 [ERROR] deployer.import: Reached deployment timeout.. exiting
2015-05-18 03:58:45,299 [ERROR] oil_ci.deploy.oil_deployer: Deployment failed:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/oil_ci/deploy/oil_deployer.py", line 140, in deploy
    debug=True, verbose=True)
  File "/usr/lib/python2.7/dist-packages/oil_ci/juju/juju_deployer.py", line 85, in run_deployer
    importer.Importer(env, deploy, options=opts).run()
  File "/usr/lib/python2.7/dist-packages/deployer/action/importer.py", line 214, in run
    self.check_timeout()
  File "/usr/lib/python2.7/dist-packages/deployer/action/importer.py", line 165, in check_timeout
    raise ErrorExit()
ErrorExit
2015-05-18 03:58:45,301 [INFO] oil_ci.deploy.oil_deployer: Saving deployment artifacts to: ./artifacts
+ rc=1
+ echo 'Deployment returned: 1'
Deployment returned: 1
+ [[ 1 == 0 ]]

Changed in maas:
status: Fix Committed → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Larry,

Can you confirm when was this error seen again the latest? (with appropriate logs)

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Larry,

Also, witout MAAS logs we can't be sure what's going on!

Thanks

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Times are completely different

2015-05-18 02:39:34 [-] 127.0.0.1 - - [18/May/2015:02:39:34 +0000] "POST /MAAS/api/1.0/nodes/node-9db5ca7e-c4cd-11e3-824b-00163efc5068/?op=start HTTP/1.1" 500 332 "-" "Go 1.1 package http"
2015-05-18 02:39:36 [HTTPChannel,773,127.0.0.1] 500 Error - /MAAS/api/1.0/nodes/node-9db5ca7e-c4cd-11e3-824b-00163efc5068/
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/web/wsgi.py", line 315, in run
            appIterator = self.application(self.environ, self.startResponse)
          File "/usr/lib/python2.7/dist-packages/django/core/handlers/wsgi.py", line 206, in __call__
            response = self.get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 213, in get_response
            response = get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 202, in get_response
            request, get_resolver(None), sys.exc_info())
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 187, in get_response
            return django_get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/orm.py", line 331, in __exit__
            self.fire()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 154, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/async.py", line 180, in fire
            self._fire_in_reactor(hook).wait()
          File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 219, in wait
            result.raiseException()
          File "<string>", line 2, in raiseException

        provisioningserver.rpc.exceptions.PowerActionAlreadyInProgress: Unable to change power state to 'on' for node hayward-29: another action is already in progress for that node.

Revision history for this message
Larry Michel (lmic) wrote :

Here's one that matches.

2015-05-20 20:01:19,219 [INFO] oil_ci.juju.client: Boostrapping new environment
Bootstrapping environment "ci-oil-slave7"
Starting new instance for initial state server
Launching instance
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
Bootstrap failed, destroying environment
ERROR failed to bootstrap environment: cannot start bootstrap instance: gomaasapi: got error back from server: 500 INTERNAL SERVER ERROR (<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Error: Internal server error</title>
  </head>
  <body>
    <h2>
      Internal server error.
    </h2>
  </body>
</html>
)
2015-05-20 20:01:45,706 [ERROR] oil_ci.juju.client: Calling "juju bootstrap" failed!

From regiond.log:

2015-05-20 20:01:40 [HTTPChannel,750,127.0.0.1] 500 Error - /MAAS/api/1.0/nodes/node-9df8a42a-c4cd-11e3-824b-00163efc5068/
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/web/wsgi.py", line 315, in run
            appIterator = self.application(self.environ, self.startResponse)
          File "/usr/lib/python2.7/dist-packages/django/core/handlers/wsgi.py", line 206, in __call__
            response = self.get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 213, in get_response
            response = get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 202, in get_response
            request, get_resolver(None), sys.exc_info())
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 187, in get_response
            return django_get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/orm.py", line 331, in __exit__
            self.fire()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 154, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/async.py", line 180, in fire
            self._fire_in_reactor(hook).wait()
          File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 219, in wait
            result.raiseException()
          File "<string>", line 2, in raiseException

        provisioningserver.rpc.exceptions.PowerActionAlreadyInProgress: Unable to change power state to 'on' for node hayward-43: another action is already in progress for that node.

Revision history for this message
Larry Michel (lmic) wrote :

Andres,

For the mismatch in the timestamps, I see that it's a 2 mns 11 seconds difference.

2015-05-18 02:41:47,112 [DEBUG] deployer.import: Adding units... VS
-----------------------------------------------------------------------------------
2015-05-18 02:39:36 [HTTPChannel,773,127.0.0.1] 500 Error - /MAAS/api/1.0/nodes/node-9db5ca7e-c4cd-11e3-824b-00163efc5068/
...
-----------------------------------------------------------------------------------

For a more recent build where this happens on the non-bootstrap node, time difference is 1 minutes 16 seconds:

2015-05-20 13:10:07,620 [DEBUG] deployer.import: Adding units... VS
-----------------------------------------------------------------------------------
2015-05-20 13:08:51 [HTTPChannel,2243,127.0.0.1] 500 Error - /MAAS/api/1.0/nodes/node-9e9e2eb8-c4cd-11e3-8102-00163efc5068/
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/web/wsgi.py", line 315, in run
            appIterator = self.application(self.environ, self.startResponse)
          File "/usr/lib/python2.7/dist-packages/django/core/handlers/wsgi.py", line 206, in __call__
            response = self.get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 213, in get_response
            response = get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 202, in get_response
            request, get_resolver(None), sys.exc_info())
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/views.py", line 187, in get_response
            return django_get_response(request)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/orm.py", line 331, in __exit__
            self.fire()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 154, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/maasserver/utils/async.py", line 180, in fire
            self._fire_in_reactor(hook).wait()
          File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 219, in wait
            result.raiseException()
          File "<string>", line 2, in raiseException

        provisioningserver.rpc.exceptions.PowerActionAlreadyInProgress: Unable to change power state to 'on' for node hayward-28: another action is already in progress for that node.
-----------------------------------------------------------------------------------

Perhaps the "adding units ..." getting logged on the console.txt from the deployer is somewhat delayed from the time deployer contacts maas to deploy the nodes?

... If I can find build that corresponds to data in the description, I'll check the logs to see whether there's comparable difference between timestamp from "adding units ..." and timestamp from exception in maas log.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Okay I can easily repeat this issue without juju. I have done it now twice.

Take a node that has failed deployment and "off", and perform the "release" action. It will go immediantly to "ready", then immediantly do "deploy" it will fail.

Node failed to be deployed, because of the following error: Unable to change power state to 'on' for node embarrassed-iron: another action is already in progress for that node.

The issue is that the release action sends a power off to the node, even though the node is already off. It should not perform this action since the node is off. The sending of that action and the quick deploy action is what is causing this issue.

Changed in maas:
status: New → In Progress
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.