Bug #1719933 “placement server needs to retry allocations, serve...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-10-02:

#1

FYI, I've been trying to recreate this here to get the logs from a failure:

https://review.openstack.org/#/c/507918/

Revision history for this message

Jay Pipes (jaypipes) wrote on 2017-10-02:

#2

Gonna take this one as I'm cleaning up resource_provider.py module in a patch series right now...

Changed in nova:
assignee:	nobody → Jay Pipes (jaypipes)

Revision history for this message

Jay Pipes (jaypipes) wrote on 2017-10-02:

#4

OK, so this is indeed invalid. We do a server-side retry on creating allocations already:

https://github.com/openstack/nova/blob/8ca24bf1ff80f39b14726aca22b5cf52603ea5a0/nova/objects/resource_provider.py#L1757-L1759

which is called from AllocationList.create_all():

https://github.com/openstack/nova/blob/8ca24bf1ff80f39b14726aca22b5cf52603ea5a0/nova/objects/resource_provider.py#L1839

Changed in nova:
status:	Triaged → Invalid

Revision history for this message

Jay Pipes (jaypipes) wrote on 2017-10-02:

#5

I'll remove the no longer relevant comment in a patch shortly.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-03: Fix proposed to nova (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/509034

Changed in nova:
status:	Invalid → In Progress

Revision history for this message

Chris Dent (cdent) wrote on 2017-10-03:

#7

Isn't that retry decorator only retrying on db api errors, notably deadlock? When generation conflict happens, ConcurrentUpdateDetected is raised, which is not a db api error.

Revision history for this message

Jay Pipes (jaypipes) wrote on 2017-10-03:

#8

I see, yes, Chris, that's a good point.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-09: Change abandoned on nova (master)

#9

Change abandoned by Jay Pipes (<email address hidden>) on branch: master
Review: https://review.openstack.org/509034

Chris Dent (cdent) on 2017-11-28

Changed in nova:
status:	In Progress → Triaged

Revision history for this message

Ed Leafe (ed-leafe) wrote on 2018-02-09:

#10

We're not sure if this is a bug, or if it is, what the proper solution should be. We have a discussion about our use of generations to ensure integrity scheduled for the Dublin PTG, and this issue will be part of that.

Jay Pipes (jaypipes) on 2018-04-20

Changed in nova:
assignee:	Jay Pipes (jaypipes) → nobody

Revision history for this message

Chris Dent (cdent) wrote on 2018-07-26:

#11

It turns out that allocation write contention on resource provider generation is a relatively significant issue when launching many VMs to one or a very small number of compute nodes, as might happen in a clustered environment like the vmwareapi virtdriver.

The retries handling at https://github.com/openstack/nova/blob/d687e7d29b37b3cdc9e1bc429dec3a01be298f80/nova/scheduler/client/report.py#L103-L123 is insufficient when something like 1200vms are being created because there's always another vm being created concurrently for the same compute node.

Server side retries will help because there will be less latency but they won't fully fix it as the the architecture expects and assumes there will be at least a bit of horizontal diversity in resource providers.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-26: Fix proposed to nova (master)

#12

Fix proposed to branch: master
Review: https://review.openstack.org/586048

Changed in nova:
assignee:	nobody → Chris Dent (cdent)
status:	Triaged → In Progress

Digambar (digambarpatil15) on 2018-07-27

Changed in nova:
assignee:	Chris Dent (cdent) → Digambar (digambarpatil15)

OpenStack Infra (hudson-openstack) on 2018-07-27

Changed in nova:
assignee:	Digambar (digambarpatil15) → Chris Dent (cdent)

Revision history for this message

Chris Dent (cdent) wrote on 2018-07-27:

#13

Diga, my apologies forgetting to this before you. I had to rush to work on this because we encountered internally at work that would be solved by having a solution to it and it was a critical issue, so I rushed something out at https://review.openstack.org/#/c/586048/

If you could review that that would be great. I tried to find you on IRC to discuss it, but could not.

Chris Dent (cdent) on 2018-07-30

tags:

added: queens-backport-potential
removed: note-to-self

Revision history for this message

Digambar (digambarpatil15) wrote on 2018-07-31:

#14

Sure Chris. Sorry, I was not feeling well so was on leave for last two days.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-31: Fix merged to nova (master)

#15

Reviewed: https://review.openstack.org/586048
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=72e4c4c8d7fb146862b899337626485dad10f15b
Submitter: Zuul
Branch: master

commit 72e4c4c8d7fb146862b899337626485dad10f15b
Author: Chris Dent <email address hidden>
Date: Thu Jul 26 12:00:14 2018 +0100

[placement] Retry allocation writes server side

    This change adds a fast retry loop around
    AllocationList._set_allocations if a resource provider generation
    conflict happens. It turns out that under high concurrency of allocation
    claims being made on the same resource provider conflicts can be quite
    common and client side retries are insufficient.

    Because both consumer generation and resource provider generations had
    raised the same exception there was no way to distinguish between the
    two so a child of ConcurrentUpdateDetected has been created as
    ResourceProviderConcurrentUpdateDetected. In the future this will allow
    us to send different error codes to the client as well, but that change
    is not done here.

When the conflict is detected, all the resource providers in the
AllocationList are reloaded and the list objects refreshed.

Logging is provided to indicate:

    * at debug that a retry is going to happen
    * at warning that all the retries failed and the client is going to
      see the conflict

The tests for this are a bit funky: Some mocks are used to cause the
conflicts, then the real actions after a couple of iterations.

Change-Id: Id614d609fc8f3ed2d2ff29a2b52143f53b3b1b9a
Closes-Bug: #1719933

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-03: Fix proposed to nova (stable/queens)

#16

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/588569

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-09: Fix merged to nova (stable/queens)

#17

Reviewed: https://review.openstack.org/588569
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=66a47b7623035cff38a12fdace9a325bbfdd9a14
Submitter: Zuul
Branch: stable/queens

commit 66a47b7623035cff38a12fdace9a325bbfdd9a14
Author: Chris Dent <email address hidden>
Date: Thu Jul 26 12:00:14 2018 +0100

[placement] Retry allocation writes server side

    This change adds a fast retry loop around
    AllocationList._set_allocations if a resource provider generation
    conflict happens. It turns out that under high concurrency of allocation
    claims being made on the same resource provider conflicts can be quite
    common and client side retries are insufficient.

    Because both consumer generation and resource provider generations had
    raised the same exception there was no way to distinguish between the
    two so a child of ConcurrentUpdateDetected has been created as
    ResourceProviderConcurrentUpdateDetected. In the future this will allow
    us to send different error codes to the client as well, but that change
    is not done here.

When the conflict is detected, all the resource providers in the
AllocationList are reloaded and the list objects refreshed.

Logging is provided to indicate:

    * at debug that a retry is going to happen
    * at warning that all the retries failed and the client is going to
      see the conflict

The tests for this are a bit funky: Some mocks are used to cause the
conflicts, then the real actions after a couple of iterations.

    This was backported from 72e4c4c8d7fb146862b899337626485dad10f15b with
    conflicts because exceptions, tests and object files were moved to
    new locations with Rocky and the AllocationList.create_all method was
    renamed to the more accurate replace_all. Prior to Rocky there were
    no Consumer objects and fewer helper methods in functional tests,
    so the test is adjusted accordingly.

    Change-Id: Id614d609fc8f3ed2d2ff29a2b52143f53b3b1b9a
    Closes-Bug: #1719933
    (cherry picked from commit 72e4c4c8d7fb146862b899337626485dad10f15b)

Reviewed:  https://review.openstack.org/588569
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=66a47b7623035cff38a12fdace9a325bbfdd9a14
Submitter: Zuul
Branch:    stable/queens

commit 66a47b7623035cff38a12fdace9a325bbfdd9a14
Author: Chris Dent <cdent@anticdent.org>
Date:   Thu Jul 26 12:00:14 2018 +0100

[placement] Retry allocation writes server side
    
    This change adds a fast retry loop around
    AllocationList._set_allocations if a resource provider generation
    conflict happens. It turns out that under high concurrency of allocation
    claims being made on the same resource provider conflicts can be quite
    common and client side retries are insufficient.
    
    Because both consumer generation and resource provider generations had
    raised the same exception there was no way to distinguish between the
    two so a child of ConcurrentUpdateDetected has been created as
    ResourceProviderConcurrentUpdateDetected. In the future this will allow
    us to send different error codes to the client as well, but that change
    is not done here.
    
    When the conflict is detected, all the resource providers in the
    AllocationList are reloaded and the list objects refreshed.
    
    Logging is provided to indicate:
    
    * at debug that a retry is going to happen
    * at warning that all the retries failed and the client is going to
      see the conflict
    
    The tests for this are a bit funky: Some mocks are used to cause the
    conflicts, then the real actions after a couple of iterations.
    
    This was backported from 72e4c4c8d7fb146862b899337626485dad10f15b with
    conflicts because exceptions, tests and object files were moved to
    new locations with Rocky and the AllocationList.create_all method was
    renamed to the more accurate replace_all. Prior to Rocky there were
    no Consumer objects and fewer helper methods in functional tests,
    so the test is adjusted accordingly.
    
    Change-Id: Id614d609fc8f3ed2d2ff29a2b52143f53b3b1b9a
    Closes-Bug: #1719933
    (cherry picked from commit 72e4c4c8d7fb146862b899337626485dad10f15b)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-10: Fix included in openstack/nova 18.0.0.0rc1

#18

This issue was fixed in the openstack/nova 18.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-10: Fix proposed to nova (stable/pike)

#19

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/590745

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-10: Fix proposed to nova (stable/ocata)

#20

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/591042

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-24: Fix included in openstack/nova 17.0.6

#21

This issue was fixed in the openstack/nova 17.0.6 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-27: Change abandoned on nova (stable/pike)

#22

Change abandoned by Chris Dent (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/590745

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-27: Change abandoned on nova (stable/ocata)

#23

Change abandoned by Chris Dent (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/591042

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Chris Dent
	Queens	Fix Committed	Medium	Chris Dent

OpenStack Compute (nova)

placement server needs to retry allocations, server-side

Bug Description

Other bug subscribers

Remote bug watches