Instances (bare metal) queue for 30-60 seconds when managing a large amount of Ironic nodes

Bug #1864122 reported by Jason Anderson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Jason Anderson

Bug Description

Description
===========
We have two deployments, one with ~150 bare metal nodes, and another with ~300. These are each managed by one nova-compute process running the Ironic driver. After upgrading from the Ocata release, we noticed that instance launches would be stuck in the spawning state for a long time, up to 30 minutes to an hour in some cases.

After investigation, the root cause appeared to be contention between the update_resources periodic task and the instance claim step. There is one semaphore "compute_resources" that is used to control every access within the resource_tracker. In our case, what was happening was the update_resources job, which runs every minute by default, was constantly queuing up accesses to this semaphore, because each hypervisor is updated independently, in series. This meant that, for us, each Ironic node was being processed and was holding the semaphore during its update (which took about 2-5 seconds in practice.) Multiply this by 150 and our update task was running constantly. Because an instance claim also needs to access this semaphore, this led to instances getting stuck in the "Build" state, after scheduling, for tens of minutes on average. There seemed to be some probabilistic effect here, which I hypothesize is related to the locking mechanism not using a "fair" lock (first-come, first-served) by default.

Steps to reproduce
==================
I suspect this is only visible on deployments of >100 Ironic nodes or so (and, they have to be managed by one nova-compute-ironic service.) Due to the non-deterministic nature of the lock, the behavior is sporadic, but launching an instance is enough to observe the behavior.

Expected result
===============
Instance proceeds to networking phase of creation after <60 seconds.

Actual result
=============
Instance stuck in BUILD state for 30-60 minutes before proceeding to networking phase.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/
   Nova 20.0.1

2. Which hypervisor did you use?
   Ironic

2. Which storage type did you use?
   N/A

3. Which networking type did you use?
   Neutron/OVS

Logs & Configs
==============

Links
=====
First report, on openstack-discuss: http://lists.openstack.org/pipermail/openstack-discuss/2019-May/006192.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/709832

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/711528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Jason Anderson (<email address hidden>) on branch: master
Review: https://review.opendev.org/709832
Reason: Abandoning in favor of Ia5e521e0f0c7a78b5ace5de9f343e84d872553f9

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/712674

melanie witt (melwitt)
tags: added: ironic resource-tracker
Revision history for this message
melanie witt (melwitt) wrote : Re: Instances (bare metal) queue for long time when managing a large amount of Ironic nodes

For completeness, adding a comment from Jason on his initial patch [1]:

"I tested this out and noticed two things:

1. The performance of this resource update task has gotten way better in Train (and potentially Stein). I had originally filed the bug in Rocky and hadn't tested the original bug condition thoroughly again versus my fix. I had just tested that the fix didn't regress any other behavior. So this isn't as big of a deal as it used to be! I only saw the instances getting stuck for maybe 30 seconds to a minute.

2. Fair locks did improve things and from my tests the instance claim was able to obtain the lock more or less as it arrived."

Given Jason's findings in newer versions, this bug report will be used for the fair locking improvement in his second patch:

https://review.opendev.org/712674

[1] https://review.opendev.org/#/c/709832/1

melanie witt (melwitt)
summary: - Instances (bare metal) queue for long time when managing a large amount
- of Ironic nodes
+ Instances (bare metal) queue for 30-60 seconds when managing a large
+ amount of Ironic nodes
Changed in nova:
importance: Undecided → Medium
status: New → In Progress
melanie witt (melwitt)
Changed in nova:
assignee: nobody → Jason Anderson (jasonandersonatuchicago)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/711528
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1ed9f9dac59c36cdda54a9852a1f93939b3ebbc3
Submitter: Zuul
Branch: master

commit 1ed9f9dac59c36cdda54a9852a1f93939b3ebbc3
Author: Jason Anderson <email address hidden>
Date: Thu Feb 27 10:37:34 2020 -0600

    Use fair locks in resource tracker

    When the resource tracker has to lock a compute host for updates or
    inspection, it uses a single semaphore. In most cases, this is fine, as
    a compute process only is tracking one hypervisor. However, in Ironic, it's
    possible for one compute process to track many hypervisors. In this
    case, wait queues for instance claims can get "stuck" briefly behind
    longer processing loops such as the update_resources periodic job. The
    reason this is possible is because the oslo.lockutils synchronized
    library does not use fair locks by default. When a lock is released, one
    of the threads waiting for the lock is randomly allowed to take the lock
    next. A fair lock ensures that the thread that next requested the lock
    will be allowed to take it.

    This should ensure that instance claim requests do not have a chance of
    losing the lock contest, which should ensure that instance build
    requests do not queue unnecessarily behind long-running tasks.

    This includes bumping the oslo.concurrency dependency; fair locks were
    added in 3.29.0 (I37577becff4978bf643c65fa9bc2d78d342ea35a).

    Change-Id: Ia5e521e0f0c7a78b5ace5de9f343e84d872553f9
    Related-Bug: #1864122

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/712674
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=eefe3ec2eec0838070e8a7c678f9978c64353154
Submitter: Zuul
Branch: master

commit eefe3ec2eec0838070e8a7c678f9978c64353154
Author: Balazs Gibizer <email address hidden>
Date: Thu Mar 12 13:03:10 2020 +0100

    Ensures that COMPUTE_RESOURCE_SEMAPHORE usage is fair

    This patch poisons the synchronized decorator in the unit test
    to prevent adding you synchronized methods without the fair=True
    flag.

    Change-Id: I739025dacbcaa0f7adbe612c064f979bf6390880
    Related-Bug: #1864122

melanie witt (melwitt)
Changed in nova:
status: In Progress → Fix Released
Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
melanie witt (melwitt) wrote :

I had a brain lapse thinking this could be backported. I cannot be, because it requires a bump of a lower constraint. I'm abandoning the proposed backport patches accordingly.

no longer affects: nova/stein
no longer affects: nova/train
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.