race condition when changing node states

Bug #1259910 reported by aeva black
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Max Lobur

Bug Description

It is possible for two requests to be received by Ironic's API service to change the same node's state at the same time. If both pass the initial check in the API layer, eg:

  if rpc_node.target_power_state is not None:
    raise ...
  pecan.request.rpcapi.change_node_power_state(...)
  return ...

then both API requests will return status 202 to the client, and both will send an RPC request to the ConductorManager, but only one will successfully acquire a lock on the resource and run. If this happens, the other request will fail in such a way that the user is not informed.

aeva black (tenbrae)
description: updated
Changed in ironic:
status: New → Triaged
importance: Undecided → Medium
status: Triaged → Confirmed
Revision history for this message
Max Lobur (max-lobur) wrote :

So far I see two ways to fix it:
1. We could have a separate request to acquire lock. In this way we will be able to prevent race and the client will know whether it's request really will be started on the conductor. But I don't like distributed locking, e.g. one call for lock, then pause, AMQP lags?, second call to perform some job and release lock..
2. The second way is more reasonable to me - to make change_node_power_state synchronised, but since it's long, to do main conductor job in a separate thread. In other words, we will acquire lock, start a thread for the job and return the indication to the client that the lock was acquired and the job has started, or exception. But in such way we would steel need to invent a way to safely release lock in the end of the job, because it will already be outside of "with" block (we won't be able use context manager at all)

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Right I want to propose another way to solve it, in a more REST way.

By triggering the state change we would be creating a new "state change task" and only one "state change task" is allowed to run at the same time per node.

So the client wants to change the power state of the node A will do:

POST {'target': 'power on'} /v1/nodes/<node A uuid>/states/power
Return Code: 202 (Accepted)
Location Header: /v1/nodes/<node A uuid>/states/power/<task uuid>

And then to trace the status or cancel the task he could GET/DELETE the using the task unique URI: /v1/nodes/<uuid>/states/power/<task uuid>

How it's going to solve the race condition:

Let's say two request came in the exactly same time, the method that actually creates the new task resource is synchronous, so it's going to use an RPC call that will get an exclusive lock and then register the task uuid in the DB, as it's synchronous one of them will go thru and the other will fail. The one that failed will then return: HTTP 490 (Conflict) "Node A is already being provisioned by task <task uuid>."

ps*: Doing this way will need a small change on the client as well (to use POST instead of PUT to trigger the task change)

Thoughts?

Revision history for this message
Max Lobur (max-lobur) wrote :

Looks reasonable. I'd like to clarify the following:

1. "only one "state change task" is allowed to run at the same time per node" Isn't it something that "shared=False" currently does? And what do you think about other tasks, f.e. do_node_deploy discussed recently. Do we need to similar singletone task for this one?

2. You said that the method that creates the new task resource is synchronous, is it something that already exist, or you're going to implement this in RPC interface?

3. If it's synchronous then I assume it won't do actual job, just register a task, get lock and that's all, am I right? If so, then how we're going to start the job? Will it be some conductor periodic task that picks all tasks registered in db and start them, or will it be another RPC call (async) to start the task?

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

1. "only one "state change task" is allowed to run at the same time per node" Isn't it something that "shared=False" currently does? And what do you think about other tasks, f.e. do_node_deploy discussed recently. Do we need to similar singletone task for this one?

R: It's more about the API perspective, yes internally "shared=False" does it, but it's not clear for the user looking at the API. In this case we have /v1/nodes/<uuid>/states/power working like as a factory, so users might think that by POSTing multiple times will create multiple tasks, but for things like changing the power state it's not possible so we should prevent it and make it clear on the documentation. I see the provision state ( do_node_deploy() ) having the same behaviour.

2. You said that the method that creates the new task resource is synchronous, is it something that already exist, or you're going to implement this in RPC interface?

R: I doesn't, it would be very simple... Throwing one suggestion here:

For the power state, right now we have in our DB two fields related to it: "target_power_state", "power_state". We could add another called "power_state_task", which will save the UUID of the task that is performing the state change, and also will work like a lock/flag to prevent multiple tasks from being created simultaneously.

3. If it's synchronous then I assume it won't do actual job, just register a task, get lock and that's all, am I right? If so, then how we're going to start the job? Will it be some conductor periodic task that picks all tasks registered in db and start them, or will it be another RPC call (async) to start the task?

R: Right the function that handle the POST in our API will call both methods, e.g:

class PowerController:
    def post(...):
        try:
            task_uuid = rpcapi.get_power_lock()
       except Exception:
            raise Exception("Power state in Node A already being running")

        ... check if target is a valid state ...

        rpcapi.change_node_power_state(pecan.request.context, rpc_node, target)

Makes sense?

Revision history for this message
Max Lobur (max-lobur) wrote :

About terminology - when you're saying task you mean not a task acquired inside conductor by

with task_manager.acquire(context, node_id, shared=False) as task:

right? It's different thing, a REST API Task, which is simply the lock in db. If so then I assume we don't need to call the rpcapi to get it, we may reserve it directly from the API.

2. R: "power_state_task" for changing power state and then "deploy_task" for deployments? Maybe to have some generic "task" for all such purposes? Because there is also could be mixed race codition - the first client requests change_node_power_state and the second requests undeploy.

In that way we will have two node locks in DB, "reservation" and "task". So could we join them into one? F.e. currently we're putting CONF.host to node's "reservation" field, but what if we put the task_uuid there:

def post(...):
        try:
            reservation_key = reserve_node(node_id) # This generates uuid, tries to put it to the node.reservation, and then returns.
       except Exception:
            raise Exception("Node A already reserved for another task")

      ... check if target is a valid state ...

     rpcapi.change_node_power_state(pecan.request.context, rpc_node, target, reservation_key) # giving reservation key to the conductor.

Then in conductor:

    with task_manager.acquire(context, node_id, reservation_key=reservation_key, shared=False) as task:
# The task_manager opens the node using the received reservation_key (acquire will compare node.reservation uuid with given key), does the job, and unlocks the node after it's done.

reservation_key will be optional parameter of task_manager.acquire, if not supplied then it will be generated on the fly.

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Terminology, yes sorry for the confusion, by task I don't mean the task acquired inside conductor, what I mean is a more generic thing, it's "the work that needs to be done in order to power on the node" and the lock/flag is part of this process.

Yea it's a good idea to simplify the locks, as you mentioned, the reservation field right now have the hostname of the conductor holding the lock on the node, IMHO it's a good information to have but if we changing it to an UUID we will lose that info.

Revision history for this message
Max Lobur (max-lobur) wrote :

Ok, I see. Then let's use two fields :) I'll try to make a patch considering all points mentioned here

Changed in ironic:
assignee: nobody → Max Lobur (max-lobur)
Revision history for this message
Max Lobur (max-lobur) wrote :

Currently Tasks approach seems a little heavy to me, we will need to maintain two locks in the DB all the time. All the places where we want to break "reservation" block should be updated to break the new "task" lock too.

The alternate solution - Pre-reservation scheme. Single call flow example:
1. The CLI_1 sends request to the Ironic API
2. The Ironic API generates new uuid and puts it to "pre_reservation_key" field in the db, if both "reservation" and "pre_reservation_key" are empty. Otherwise NodePreLocked is raised.
3. The Ironic API makes async RPC and passes the generated uuid there together with other params. This will the key to open pre-reservation.
4. The conductor receives RPC request and calls .acquire(context, node_id, pre_reservation_key, shared=False) as task.
5. acquire compares the given "pre_reservation_key" value with DB, if they're same a) sets "reservation" db field to hostname b) removes "pre_reservation_key" value from the DB. If they're not the exception is thrown.
6. When the task is done "reservation" cleared.

Double call flow example:
1. Points 1 and 2 from previous flow happened.
2. The CLI_2 sends request to Ironic API
3. The API generates "pre_reservation_key" uuid and tries to put it to DB, but since it is not empty, the NodePreLocked exception is sent back to the client.

I think "pre_reservation_key" should be uuid instead of simple true/false, to surely know that reservation was performed by the RPC which have been made during the particular API call. Though currently I don't see why the true/false won't work (all subsequent pre-reservations will fail immediately and the concurrent RPC won't happen), I think we should stick with uuid key.

Any thoughts?

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Hi Max,

Indeed it's a lighter approach. I liked it. Maybe we should go and try to make a proof of concept patch for it.

...

Similar to the approach you presented, I was thinking here and if we just used the target_power_state field was used as our flag? Similar to ur pre_reservation_key.

Instead of setting the target_power_state in an async way, we could make the API to do two RPC calls:

First, a sync call that would check if the target_power_state is already set and do some validations (validate the driver_info, if the requested state is valid, etc...). If everything is good it would then set the target_power_state with the client requested state.

Second, it would then do an async call to actually trigger the power on/off of the node.

E.g:

1 - CLI sends a request to change power on the node
2 - API will make a sync call that will set the target_power_state.
3 - API will make an async call to change the power state.

Double call flow example:
1 - Points 1 and 2 from previous flow happened.
2 - API will try to make the sync call for setting the target_power_state but since the target_power_state is already set (or being set, and for that the node would be locked) a NodeLocked exception would be raised.

*I will try to make a proof of concept patch for the approach using the target_power_state as a flag.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/65327

Changed in ironic:
assignee: Max Lobur (max-lobur) → Lucas Alvares Gomes (lucasagomes)
status: Confirmed → In Progress
Revision history for this message
Max Lobur (max-lobur) wrote :

Basing on the latest discussion in patch set 4 of https://review.openstack.org/65327:
This is an example of how we can take care of acquiring/releasing the lock manually in different threads http://stackoverflow.com/a/6815029 (Releasing lock in thread_end callback). This can release lock when thread succeeds or throws an exception.

Lucas mentioned there's also a hanging case. It seems our current code doesn't handle hanging, we will need to research how we can improve callback approach to cover this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/66368

Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Max Lobur (max-lobur)
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :
Download full text (3.6 KiB)

Right this bug can be a bit annoying, I was testing the deployment and it's the second time I get a 202 but it actually failed to process the request because the node was already locked because another process, a periodic task.

So I think we should try to fix this problem asap, acquiring the lock in a sync way and retuning an error if the node is already locked is a much better feedback than failing under the hood.

LOGS:

2014-01-17 11:33:01.160 3729 DEBUG ironic.conductor.manager [req-ce1842ce-a00c-4144-88ca-cd728101ef28 admin admin] RPC do_node_tear_down called for node 053342aa-0113-4f5f-a022-4eab0ffbc0bf. do_node_tear_down /o
pt/stack/ironic/ironic/conductor/manager.py:304
2014-01-17 11:33:01.163 3729 DEBUG ironic.openstack.common.lockutils [-] Got semaphore "node_resource" lock /opt/stack/ironic/ironic/openstack/common/lockutils.py:170
2014-01-17 11:33:01.163 3729 DEBUG ironic.openstack.common.lockutils [-] Got semaphore / lock "release" inner /opt/stack/ironic/ironic/openstack/common/lockutils.py:250
2014-01-17 11:33:01.164 3729 DEBUG ironic.openstack.common.lockutils [-] Semaphore / lock released "release" inner /opt/stack/ironic/ironic/openstack/common/lockutils.py:254
2014-01-17 11:33:01.176 3729 ERROR ironic.openstack.common.rpc.amqp [req-ce1842ce-a00c-4144-88ca-cd728101ef28 admin admin] Exception during message handling
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp Traceback (most recent call last):
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/ironic/ironic/openstack/common/rpc/amqp.py", line 434, in _process_data
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp **args)
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/ironic/ironic/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/ironic/ironic/conductor/manager.py", line 306, in do_node_tear_down
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp with task_manager.acquire(context, node_id, shared=False) as task:
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp return self.gen.next()
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/ironic/ironic/conductor/task_manager.py", line 123, in acquire
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp t.dbapi.reserve_nodes(CONF.host, node_ids)
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/ironic/ironic/objects/__init__.py", line 28, in wrapper
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp result = fn(*args, **kwargs)
2014-01-17 11:33:01.176 3729 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/ironic/ironic/db/sqlalchemy/api.py", line 254, in reserve_nodes
2014-01-17 11:33:01.176 3729 TRACE i...

Read more...

Revision history for this message
Max Lobur (max-lobur) wrote :

Hi Lucas,

How about a temporary quick fix until we get a final approach? Just check reservation in the API

ironic/api/controllers/v1/node.py

def power(self, node_uuid, target):
    rpc_node = objects.Node.get_by_uuid(pecan.request.context, node_uuid)
    if rpc_node.reservation:
        raise wsme.exc.ClientSideError(msg, status_code=409) # Conflict
    .....

This require to change only one file and we will be able to quickly pass all the reviews and merge it

aeva black (tenbrae)
Changed in ironic:
milestone: none → icehouse-3
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/69135

Revision history for this message
aeva black (tenbrae) wrote :

Robert and I just hashed through this for a couple hours in an etherpad:
 https://etherpad.openstack.org/p/action-started

tl;dr - he's right that there is a DOS potential by spawning an uncontrolled number of threads with futures. But other proposed solutions also have significant cons. So we didn't come up with a clear solution yet.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/69135
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=c4f2f26edf5cd1f5880f86853fcc611281605e63
Submitter: Jenkins
Branch: master

commit c4f2f26edf5cd1f5880f86853fcc611281605e63
Author: Max Lobur <email address hidden>
Date: Wed Jan 22 10:13:39 2014 -0500

    Fix race condition when changing node states

    To fix race condition we're adding a mechanism of background task
    execution in conductor. The conductor will get synchrozed AMQP call,
    reserve lock, start the background task and return empty response to
    the API. In case when lock cannot be acquired or background tasks pool
    is full, the exception is sent back to the API and the task is not
    started.

    Also the patch adds an ability to control resource locks manually.
    This feauture used to release lock in the end of background task.

    Change-Id: I4095de2d82058ea5e052531698e67a0947424435
    Closes-Bug: #1259910

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: icehouse-3 → 2014.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers