AllocationUpdateFailed_Remote: Failed to update allocations for consumer. Error: another process changed the consumer after the report client read the consumer state during the claim
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Matt Riedemann | ||
Queens |
New
|
Undecided
|
Unassigned | ||
Rocky |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
* Job: neutron-
* Failed test: tempest.
* Example of failure: http://
Details: (ServersNegativ
tags: | added: gate-failure |
Brian Haley (brian-haley) wrote : | #1 |
Changed in neutron: | |
status: | New → Incomplete |
Brian Haley (brian-haley) wrote : | #2 |
Also, the above search didn't show much, but this one shows 4 failures in the past few days:
Hongbin Lu (hongbin.lu) wrote : | #3 |
Matt Riedemann (mriedem) wrote : | #4 |
Looks like the failure is here?
The test is failing to unshelve the server because unshelve fails during scheduling b/c of that placement failure. I saw this the other day as well, not sure we have a bug for it yet.
Changed in nova: | |
status: | New → Triaged |
summary: |
- iptables_hybrid tests - tempest.api.compute.servers.test_servers_negative.ServersNegativeTestJSON.test_shelve_shelved_server - failed + AllocationUpdateFailed_Remote: Failed to update allocations for + consumer. Error: another process changed the consumer after the report + client read the consumer state during the claim |
no longer affects: | neutron |
Matt Riedemann (mriedem) wrote : | #5 |
Nov 06 19:48:37.408999 ubuntu-
Nov 06 19:48:37.410479 ubuntu-
Nov 06 19:48:37.410965 ubuntu-
Nov 06 19:48:37.411208 ubuntu-
Nov 06 19:48:37.411438 ubuntu-
Nov 06 19:48:37.411659 ubuntu-
Nov 06 19:48:37.411886 ubuntu-
Nov 06 19:48:37.412150 ubuntu-
Nov 06 19:48:37.412394 ubuntu-
Nov 06 19:48:37.412615 ubuntu-
Nov 06 19:48:37.412873 ubuntu-
Nov 06 19:48:37.413105 ubuntu-
Nov 06 19:48:37.413340 ubuntu-
Nov 06 19:48:37.413573 ubuntu-
Nov 06 19:48:37.413810 ubuntu-
Nov 06 19:48:37.414262 ubuntu-
tags: | added: placement scheduler |
Matt Riedemann (mriedem) wrote : | #6 |
Looking at logstash, this started happening really around Nov 4:
Matt Riedemann (mriedem) wrote : | #7 |
Looking at the logs:
We'd see these messages if we were retrying:
But I don't see those, so likely something changed in the placement error response message which makes our retry code no longer work.
Matt Riedemann (mriedem) wrote : | #8 |
What's also strange is that the scheduler logs say we're doing a doubled-up allocation on the same host:
Nov 06 19:48:36.969356 ubuntu-
But I'm not sure why because this is an unshelve operation of a shelved offloaded server, which shouldn't have any other allocations.
Matt Riedemann (mriedem) wrote : | #9 |
My guess is something regressed in the claim_resources logic introduced in this change:
Matt Riedemann (mriedem) wrote : | #10 |
Well, https:/
Nov 06 19:48:37.013780 ubuntu-
From the placement logs, I see three different times that allocations are PUT for the consumer:
Nov 06 19:47:57.568237 ubuntu-
Nov 06 19:47:57.617423 ubuntu-
Nov 06 19:47:57.617981 ubuntu-
The first one has to be when initially scheduling the instance.
Then we should shelve offload the instance and delete it's allocations, but I'm not seeing a DELETE allocations request for consumer 6665f00a-
I then see a second PUT allocations request:
Nov 06 19:48:36.931411 ubuntu-
Matt Riedemann (mriedem) wrote : | #11 |
It looks like the compute manager doesn't remove the allocations for the shelved instances until after updating the status to SHELVED_OFFLOADED, so it's possible we're hitting a race where tempest sees the instance is shelved, and then immediately unshelves it before we've removed the allocations during shelve...otherwise I can't explain how we go down this patch in the scheduler during the unshelve:
Since for a shelved offloaded server there should not be any existing allocations.
Matt Riedemann (mriedem) wrote : | #12 |
This is where the compute logs that it removed the allocations after shelve offload:
Nov 06 19:48:36.982861 ubuntu-
Which lines up with the 2nd PUT in the placement logs:
Nov 06 19:48:36.931411 ubuntu-
Matt Riedemann (mriedem) wrote : | #13 |
There is a note in the compute code when removing allocations which says if we PUT {} for allocations that placement will delete the consumer record:
https:/
But I'm not seeing that actually anywhere in the placement handler code here:
In fact I see it ensure a consumer exists, but doesn't delete it.
Matt Riedemann (mriedem) wrote : | #14 |
Oh I guess consumers with no allocations should be deleted here:
Matt Riedemann (mriedem) wrote : | #15 |
The SQL in here:
might be broken in the same way ensure_consumer was broken:
If the consumer had allocations against >1 resource class, it might not query properly.
Changed in nova: | |
importance: | Undecided → High |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master) | #16 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #17 |
Related fix proposed to branch: master
Review: https:/
Matt Riedemann (mriedem) wrote : | #18 |
More discussion notes:
(2:58:16 PM) dansmith: mriedem: so you found a test that confirmed the behavior of that thing?
(2:58:23 PM) dansmith: mriedem: that deletes the consumer?
(3:02:49 PM) mriedem: DeleteConsumerI
(3:02:54 PM) mriedem: and it looks like a correct test to me,
(3:03:05 PM) mriedem: creates 2 consumers each with 2 allocations on different resource classes,
(3:03:13 PM) mriedem: clears the allocations for one of them and asserts the consumer is gone
(3:03:33 PM) mriedem: i think we're just hitting a race with the shelve offloaded status change before we cleanup the allocations
(3:03:43 PM) mriedem: but i've posted a couple of patches to add debug logs to help determine if that's the case
(3:03:55 PM) mriedem: https:/
(3:04:59 PM) dansmith: okay I'm not sure how we could race and see no allocations but a consumer and get that generation conflict
(3:05:17 PM) dansmith: it'd be one thing if we thought the consumer was there and then disappeared out from under us
(3:17:15 PM) mriedem: during unshelve the scheduler does see allocations
(3:17:35 PM) mriedem: and it thinks we're doing a move
(3:18:00 PM) dansmith: okay I thought you pasted a line showing that there was only one allocation going back to placement
(3:18:11 PM) mriedem: there are 3 PUTs for allocations
(3:18:15 PM) mriedem: 1. create the server - initial
(3:18:27 PM) mriedem: 2. shelve offload - wipe the allocations to {} - which should delete the consumer
(3:18:37 PM) mriedem: 3. unshelve - scheduler claims resources with the wrong consumer generation
(3:18:49 PM) mriedem: and when 3 happens, the scheduler gets allocations for hte consumer and they are there,
(3:18:51 PM) dansmith: ...right
(3:18:59 PM) mriedem: so it uses the consumer generation (1) from those allocations
(3:19:07 PM) mriedem: then i think what happens is,
(3:19:09 PM) dansmith: oh, so it passes generation=1 instead of generation=0, meaning new consumer?
(3:19:15 PM) mriedem: placement recreates the consumer which will have generation null
(3:19:19 PM) mriedem: yes
(3:19:26 PM) dansmith: okay I see
(3:19:50 PM) dansmith: I thought you were seeing consumer generation was null or zero or whatever in the third put, but still getting a conflict
(3:19:53 PM) dansmith: but that makes sense now
(3:20:06 PM) mriedem: Nov 06 19:48:37.013780 ubuntu-
(3:20:15 PM) mriedem: consumer generation conflict - expected null but got 1
(3:20:24 PM) mriedem: yup - so new consumer but we're passing a generation of 1
(3:20:28 PM) mriedem: from the old, now dele...
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master) | #19 |
Fix proposed to branch: master
Review: https:/
Changed in nova: | |
assignee: | nobody → Matt Riedemann (mriedem) |
status: | Triaged → In Progress |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master) | #20 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 82b4e3ff7e1f217
Author: Matt Riedemann <email address hidden>
Date: Fri Nov 9 13:48:07 2018 -0500
Add debug logs when doubling-up allocations during scheduling
During claim_resources in the scheduler, if the consumer (instance)
has existing allocations, the scheduler thinks we're doing something
like a resize to same host or evacuation, but it would be useful
to know what the original allocations when doing that, so this adds
logging of the original allocations that take us down the double-up
path.
Change-Id: Ibfb0e97840141a
Related-Bug: #1798688
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master) | #21 |
Fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master) | #22 |
Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #23 |
Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master) | #24 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 6369f39244558b1
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 7 17:27:16 2018 -0500
Remove allocations before setting vm_status to SHELVED_OFFLOADED
Tempest is intermittently failing a test which does the
following:
1. Create a server.
2. Shelve offload it.
3. Unshelve it.
Tempest waits for the server status to be SHELVED_OFFLOADED
before unshelving the server, which goes through the
scheduler to pick a compute node and claim resources on it.
When shelve offloading a server, the resource allocations
for the instance and compute node it was on are cleared, which
will also delete the internal consumer record in the placement
service.
The race is that the allocations are removed during shelve
offload *after* the server status changes to SHELVED_OFFLOADED.
This leaves a window where unshelve is going through the
scheduler and gets the existing allocations for the instance,
which are non-empty and have a consumer generation. The
claim_resources method in the scheduler then uses that
consumer generation when PUTing the allocations. That PUT
fails because in between the GET and PUT of the allocations,
placement has deleted the internal consumer record. When
PUTing the new allocations with a non-null consumer generation,
placement returns a 409 conflict error because for a new
consumer it expects the "consumer_
None.
This change handles the race by simply making sure the allocations
are deleted (along with the related consumer record in placement)
*before* the instance.vm_status is changed.
Change-Id: I2a6ccaff904c1f
Closes-Bug: #1798688
Changed in nova: | |
status: | In Progress → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1 | #25 |
This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky) | #26 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/rocky
commit 1121a59edb48fe1
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 7 17:27:16 2018 -0500
Remove allocations before setting vm_status to SHELVED_OFFLOADED
Tempest is intermittently failing a test which does the
following:
1. Create a server.
2. Shelve offload it.
3. Unshelve it.
Tempest waits for the server status to be SHELVED_OFFLOADED
before unshelving the server, which goes through the
scheduler to pick a compute node and claim resources on it.
When shelve offloading a server, the resource allocations
for the instance and compute node it was on are cleared, which
will also delete the internal consumer record in the placement
service.
The race is that the allocations are removed during shelve
offload *after* the server status changes to SHELVED_OFFLOADED.
This leaves a window where unshelve is going through the
scheduler and gets the existing allocations for the instance,
which are non-empty and have a consumer generation. The
claim_resources method in the scheduler then uses that
consumer generation when PUTing the allocations. That PUT
fails because in between the GET and PUT of the allocations,
placement has deleted the internal consumer record. When
PUTing the new allocations with a non-null consumer generation,
placement returns a 409 conflict error because for a new
consumer it expects the "consumer_
None.
This change handles the race by simply making sure the allocations
are deleted (along with the related consumer record in placement)
*before* the instance.vm_status is changed.
Change-Id: I2a6ccaff904c1f
Closes-Bug: #1798688
(cherry picked from commit 6369f39244558b1
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova rocky-eol | #27 |
This issue was fixed in the openstack/nova rocky-eol release.
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/queens) | #28 |
Change abandoned by "Elod Illes <email address hidden>" on branch: stable/queens
Review: https:/
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.
From looking at the log it's unclear this is a neutron issue:
Captured traceback: api/compute/ servers/ test_servers_ negative. py", line 47, in tearDown
self. server_ check_teardown( ) api/compute/ base.py" , line 201, in server_ check_teardown
cls.server_ id, 'ACTIVE') common/ waiters. py", line 96, in wait_for_ server_ status TimeoutExceptio n(message) lib.exceptions. TimeoutExceptio n: Request timed out eTestJSON: tearDown) Server 7e7cf40f- 0ab7-4f22- 91ce-6f4e22a54a c2 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: None.
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):
File "tempest/
File "tempest/
File "tempest/
raise lib_exc.
tempest.
Details: (ServersNegativ
I didn't see any tracebacks in the neutron logs that would indicate there was a failure.