Several concurent scheduling requests for CPU pinning may fail due to racy host_state handling

Bug #1438238 reported by Nikola Đipanov on 2015-03-30
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Nikola Đipanov
Kilo
Medium
Nikola Đipanov

Bug Description

The issue happens when multiple scheduling attempts that request CPU pinning are done in parallel.

015-03-25T14:18:00.222 controller-0 nova-scheduler err Exception during message handling: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher incoming.message))

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139, in inner

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher return func(*args, **kwargs)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/scheduler/manager.py", line 86, in select_destinations

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 80, in select_destinations

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 241, in _schedule

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/scheduler/host_manager.py", line 266, in consume_from_instance

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1472, in get_host_numa_usage_from_instance

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1344, in numa_usage_from_instances

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/objects/numa.py", line 91, in pin_cpus

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher CPUPinningInvalid: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher

What is likely happening is:

* nova-scheduler is handling several RPC calls to select_destinations at the same time, in multiple greenthreads

* greenthread 1 runs the NUMATopologyFilter and selects a cpu on a particular compute node, updating host_state.instance_numa_topology

* greenthread 1 then blocks for some reason

* greenthread 2 runs the NUMATopologyFilter and selects the same cpu on the same compute node, updating host_state.instance_numa_topology. This also seems like an issue if a different cpu was selected, as it would be overwriting the instance_numa_topology selected by greenthread 1.

* greenthread 2 then blocks for some reason

* greenthread 1 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology

* greenthread 1 completes the scheduling operation

* greenthread 2 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology - since the resources were already consumed by greenthread 1, we get the exception above

Sean Dague (sdague) wrote :

Can we get a reproduce test for it?

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Nikola Đipanov (ndipanov) wrote :

This exception AFAICT happens because the host state is kept in the
host_state_map attribute of the scheduler driver which is shared among
concurrent greenthreads. sadly the reason for the switch can be any IO
that filters attempt to do (IMHO they should not do anything like DB
lookups at all, as it completely invalidates the premise of the current
design, but that is another story).

This raciness is true for other resources like CPU and RAM, in the sense
that host_state free_ram_mb may have different value at the time the
filter was run, and at the time consume_from_instance is run for any
particular instance.

Solving this properly is non-trivial with the current desing it seems to
me. I think a lot of the code in the scheduler was written with premise
that the scheduling decision should be atomic (they would be if it
weren't for IO in the filters, which is bad). Making the scheduler
race-free by means of mutexes is not something we want to do lightly
because at that point we might as well just use DB transactions for
consistency.

With all the above - I think we should be just ignoring the exception so
as to not cause an unnecessary scheduling failure, the pinning is
re-calculated on the final compute host (and re-schedule is triggered
upon failure to claim) so data consistency in the scheduler is best
effort only really.

Nikola Đipanov (ndipanov) wrote :

@Sean - as with any race bugs with eventlet, the reproducer is not that straightforward - it would require threads for specific requests in the scheduler switching at the right time. However from just simple code inspection - it's clear that the scheduler was designed to be "racy" - this is a design choice afaict.

Fix proposed to branch: master
Review: https://review.openstack.org/169245

Changed in nova:
assignee: nobody → Nikola Đipanov (ndipanov)
status: Confirmed → In Progress
Changed in nova:
milestone: none → kilo-rc1
tags: added: kilo-rc-potential
John Garbutt (johngarbutt) wrote :

Discussed we don't want to block on this one, rather just merge it if we can, moving to potential tagged only.

Changed in nova:
milestone: kilo-rc1 → none

Reviewed: https://review.openstack.org/169245
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d6b3156a6c89ddff9b149452df34c4b32c50b6c3
Submitter: Jenkins
Branch: master

commit d6b3156a6c89ddff9b149452df34c4b32c50b6c3
Author: Nikola Dipanov <email address hidden>
Date: Tue Apr 7 20:53:32 2015 +0100

    scheduler: re-calculate NUMA on consume_from_instance

    This patch narrows down the race window between the filter running and
    the consumption of resources from the instance after the host has been
    chosen.

    It does so by re-calculating the fitted NUMA topology just before consuming it
    from the chosen host. Thus we avoid any locking, but also make sure that
    the host_state is kept as up to date as possible for concurrent
    requests, as there is no opportunity for switching threads inside a
    consume_from_instance.

    Several things worth noting:
      * Scheduler being lock free (and thus racy) does not really affect
      resources other than PCI and NUMA topology this badly - this is due
      to complexity of said resources. In order for scheduler decesions to not
      be based on basically guessing, in case of those two we will likely need
      to introduce either locking or special heuristics.

      * There is a lot of repeated code between the 'consume_from_instance'
      method and the actual filters. This situation should really be fixed but
      is out of scope for this bug fix (which is about preventing valid
      requests failing because of races in the scheduler).

    Change-Id: If0c7ad20506c9dddf4dec1eb64c9d6dd4fb75633
    Closes-bug: #1438238

Changed in nova:
status: In Progress → Fix Committed
tags: added: kilo-backport-potential
Nikola Đipanov (ndipanov) wrote :

Would be really great to have this + https://bugs.launchpad.net/nova/+bug/1444021 in for the kilo release, as without it - one of the features done in Kilo (http://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/input-output-based-numa-scheduling.html) is completely broken.

Reviewed: https://review.openstack.org/175787
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=880a356e40d327c0af4ce94b5a08fe0cd6fcab5d
Submitter: Jenkins
Branch: stable/kilo

commit 880a356e40d327c0af4ce94b5a08fe0cd6fcab5d
Author: Nikola Dipanov <email address hidden>
Date: Tue Apr 7 20:53:32 2015 +0100

    scheduler: re-calculate NUMA on consume_from_instance

    This patch narrows down the race window between the filter running and
    the consumption of resources from the instance after the host has been
    chosen.

    It does so by re-calculating the fitted NUMA topology just before consuming it
    from the chosen host. Thus we avoid any locking, but also make sure that
    the host_state is kept as up to date as possible for concurrent
    requests, as there is no opportunity for switching threads inside a
    consume_from_instance.

    Several things worth noting:
      * Scheduler being lock free (and thus racy) does not really affect
      resources other than PCI and NUMA topology this badly - this is due
      to complexity of said resources. In order for scheduler decesions to not
      be based on basically guessing, in case of those two we will likely need
      to introduce either locking or special heuristics.

      * There is a lot of repeated code between the 'consume_from_instance'
      method and the actual filters. This situation should really be fixed but
      is out of scope for this bug fix (which is about preventing valid
      requests failing because of races in the scheduler).

    Change-Id: If0c7ad20506c9dddf4dec1eb64c9d6dd4fb75633
    Closes-bug: #1438238
    (cherry picked from commit d6b3156a6c89ddff9b149452df34c4b32c50b6c3)

Thierry Carrez (ttx) on 2015-04-23
tags: removed: kilo-backport-potential kilo-rc-potential
Download full text (18.1 KiB)

Reviewed: https://review.openstack.org/179284
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5228d4e418734164ffa5ccd91d2865d9cc659c00
Submitter: Jenkins
Branch: master

commit 906ab9d6522b3559b4ad36d40dec3af20397f223
Author: He Jie Xu <email address hidden>
Date: Thu Apr 16 07:09:34 2015 +0800

    Update rpc version aliases for kilo

    Update all of the rpc client API classes to include a version alias
    for the latest version implemented in Kilo. This alias is needed when
    doing rolling upgrades from Kilo to Liberty. With this in place, you can
    ensure all services only send messages that both Kilo and Liberty will
    understand.

    Closes-Bug: #1444745

    Conflicts:
     nova/conductor/rpcapi.py

    NOTE(alex_xu): The conflict is due to there are some logs already added
    into the master.

    Change-Id: I2952aec9aae747639aa519af55fb5fa25b8f3ab4
    (cherry picked from commit 78a8b5802ca148dcf37c5651f75f2126d261266e)

commit f191a2147a21c7e50926b288768a96900cf4c629
Author: Hans Lindgren <email address hidden>
Date: Fri Apr 24 13:10:39 2015 +0200

    Add security group calls missing from latest compute rpc api version bump

    The recent compute rpc api version bump missed out on the security group
    related calls that are part of the api.

    One possible reason is that both compute and security group client side
    rpc api:s share a single target, which is of little value and only cause
    mistakes like this.

    This change eliminates future problems like this by combining them into
    one to get a 1:1 relationship between client and server api:s.

    Change-Id: I9207592a87fab862c04d210450cbac47af6a3fd7
    Closes-Bug: #1448075
    (cherry picked from commit bebd00b117c68097203adc2e56e972d74254fc59)

commit a2872a9262985bd0ee2c6df4f7593947e0516406
Author: Dan Smith <email address hidden>
Date: Wed Apr 22 09:02:03 2015 -0700

    Fix migrate_flavor_data() to catch instances with no instance_extra rows

    The way the query was being performed previously, we would not see any
    instances that didn't have a row in instance_extra. This could happen if
    an instance hasn't been touched for several releases, or if the data
    set is old.

    The fix is a simple change to use outerjoin instead of join. This patch
    includes a test that ensures that instances with no instance_extra rows
    are included in the migration. If we query an instance without such a
    row, we create it before doing a save on the instance.

    Closes-Bug: #1447132
    Change-Id: I2620a8a4338f5c493350f26cdba3e41f3cb28de7
    (cherry picked from commit 92714accc49e85579f406de10ef8b3b510277037)

commit e3a7b83834d1ae2064094e9613df75e3b07d77cd
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Apr 23 02:18:41 2015 +0000

    Updated from global requirements

    Change-Id: I5d4acd36329fe2dccb5772fed3ec55b442597150

commit 8c9b5e620eef3233677b64cd234ed2551e6aa182
Author: Divya <email address hidden>
Date: Tue Apr 21 08:26:29 2015 +0200

    Control create/delete flavor api permissions using policy.json

    The permissions of ...

Thierry Carrez (ttx) on 2015-06-24
Changed in nova:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-10-15
Changed in nova:
milestone: liberty-1 → 12.0.0
chao (chao-wang) wrote :

Just would like to know if any workaround can help with this case once you meet it?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers