Bug #1798688 “AllocationUpdateFailed_Remote: Failed to update al...” : Bugs : OpenStack Compute (nova)

Hongbin Lu (hongbin.lu) on 2018-10-18

tags:

added: gate-failure

Revision history for this message

Brian Haley (brian-haley) wrote on 2018-10-19:

#1

From looking at the log it's unclear this is a neutron issue:

Captured traceback:
~~~~~~~~~~~~~~~~~~~
     Traceback (most recent call last):
       File "tempest/api/compute/servers/test_servers_negative.py", line 47, in tearDown
         self.server_check_teardown()
       File "tempest/api/compute/base.py", line 201, in server_check_teardown
         cls.server_id, 'ACTIVE')
       File "tempest/common/waiters.py", line 96, in wait_for_server_status
         raise lib_exc.TimeoutException(message)
     tempest.lib.exceptions.TimeoutException: Request timed out
     Details: (ServersNegativeTestJSON:tearDown) Server 7e7cf40f-0ab7-4f22-91ce-6f4e22a54ac2 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: None.

I didn't see any tracebacks in the neutron logs that would indicate there was a failure.

Changed in neutron:
status:	New → Incomplete

Revision history for this message

Brian Haley (brian-haley) wrote on 2018-10-19:

#2

Also, the above search didn't show much, but this one shows 4 failures in the past few days:

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_status:%5C%22FAILURE%5C%22%20AND%20project:%5C%22openstack%2Fneutron%5C%22%20AND%20message:%5C%22test_attach_volume_shelved_or_offload_server%5C%22%20AND%20tags:%5C%22console%5C%22&from=7d

Revision history for this message

Hongbin Lu (hongbin.lu) wrote on 2018-11-06:

#3

This happened again: http://logs.openstack.org/26/615126/5/check/tempest-full/69d913a/job-output.txt.gz#_2018-11-06_19_51_54_083047

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#4

Looks like the failure is here?

http://logs.openstack.org/26/615126/5/check/tempest-full/69d913a/controller/logs/screen-n-super-cond.txt.gz#_Nov_06_19_48_37_408999

The test is failing to unshelve the server because unshelve fails during scheduling b/c of that placement failure. I saw this the other day as well, not sure we have a bug for it yet.

Changed in nova:
status:	New → Triaged
summary:	- iptables_hybrid tests - tempest.api.compute.servers.test_servers_negative.ServersNegativeTestJSON.test_shelve_shelved_server - failed + AllocationUpdateFailed_Remote: Failed to update allocations for + consumer. Error: another process changed the consumer after the report + client read the consumer state during the claim
no longer affects:	neutron

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#5

Download full text (5.0 KiB)

Nov 06 19:48:37.408999 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: ERROR nova.conductor.manager [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] [instance: 6665f00a-dcf1-4286-b075-d7dcd7c37487] Unshelve attempted but an error has occurred: AllocationUpdateFailed_Remote: Failed to update allocations for consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487. Error: another process changed the consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487 after the report client read the consumer state during the claim
Nov 06 19:48:37.410479 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: ERROR oslo_messaging.rpc.server [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Exception during message handling: AllocationUpdateFailed_Remote: Failed to update allocations for consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487. Error: another process changed the consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487 after the report client read the consumer state during the claim
Nov 06 19:48:37.410965 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: Traceback (most recent call last):
Nov 06 19:48:37.411208 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
Nov 06 19:48:37.411438 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: res = self.dispatcher.dispatch(message)
Nov 06 19:48:37.411659 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
Nov 06 19:48:37.411886 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: return self._do_dispatch(endpoint, method, ctxt, args)
Nov 06 19:48:37.412150 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
Nov 06 19:48:37.412394 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: result = func(ctxt, **new_args)
Nov 06 19:48:37.412615 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 229, in inner
Nov 06 19:48:37.412873 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: return func(*args, **kwargs)
Nov 06 19:48:37.413105 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: File "/opt/stack/nova/nova/scheduler/manager.py", line 169, in select_destinations
Nov 06 19:48:37.413340 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: allocation_request_version, return_alternates)
Nov 06 19:48:37.413573 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 91, in select_destinations
Nov 06 19:48:37.413810 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: allocation_request_version, return_alternates)
Nov 06 19:48:37.414262 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: File "/opt/...

Nov 06 19:48:37.408999 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: ERROR nova.conductor.manager [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] [instance: 6665f00a-dcf1-4286-b075-d7dcd7c37487] Unshelve attempted but an error has occurred: AllocationUpdateFailed_Remote: Failed to update allocations for consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487. Error: another process changed the consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487 after the report client read the consumer state during the claim 
Nov 06 19:48:37.410479 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: ERROR oslo_messaging.rpc.server [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Exception during message handling: AllocationUpdateFailed_Remote: Failed to update allocations for consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487. Error: another process changed the consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487 after the report client read the consumer state during the claim
Nov 06 19:48:37.410965 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: Traceback (most recent call last):
Nov 06 19:48:37.411208 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
Nov 06 19:48:37.411438 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     res = self.dispatcher.dispatch(message)
Nov 06 19:48:37.411659 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
Nov 06 19:48:37.411886 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     return self._do_dispatch(endpoint, method, ctxt, args)
Nov 06 19:48:37.412150 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
Nov 06 19:48:37.412394 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     result = func(ctxt, **new_args)
Nov 06 19:48:37.412615 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 229, in inner
Nov 06 19:48:37.412873 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     return func(*args, **kwargs)
Nov 06 19:48:37.413105 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/opt/stack/nova/nova/scheduler/manager.py", line 169, in select_destinations
Nov 06 19:48:37.413340 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     allocation_request_version, return_alternates)
Nov 06 19:48:37.413573 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 91, in select_destinations
Nov 06 19:48:37.413810 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     allocation_request_version, return_alternates)
Nov 06 19:48:37.414262 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 232, in _schedule
Nov 06 19:48:37.414507 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     allocation_request_version=allocation_request_version):
Nov 06 19:48:37.414743 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/opt/stack/nova/nova/scheduler/utils.py", line 1039, in claim_resources
Nov 06 19:48:37.414974 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     consumer_generation=None)
Nov 06 19:48:37.415365 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 37, in __run_method
Nov 06 19:48:37.415594 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     return getattr(self.instance, __name)(*args, **kwargs)
Nov 06 19:48:37.415820 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/opt/stack/nova/nova/scheduler/client/report.py", line 78, in wrapper
Nov 06 19:48:37.416063 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     return f(self, *a, **k)
Nov 06 19:48:37.416649 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/opt/stack/nova/nova/scheduler/client/report.py", line 120, in wrapper
Nov 06 19:48:37.416910 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     return f(self, *a, **k)
Nov 06 19:48:37.417150 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:   File "/opt/stack/nova/nova/scheduler/client/report.py", line 1865, in claim_resources
Nov 06 19:48:37.417394 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]:     consumer_uuid=consumer_uuid, error=reason)
Nov 06 19:48:37.417620 ubuntu-xenial-inap-mtl01-0000379614 nova-conductor[14084]: AllocationUpdateFailed: Failed to update allocations for consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487. Error: another process changed the consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487 after the report client read the consumer state during the claim

tags:

added: placement scheduler

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#6

Looking at logstash, this started happening really around Nov 4:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AllocationUpdateFailed%3A%20Failed%20to%20update%20allocations%20for%20consumer%5C%22%20AND%20message%3A%5C%22Error%3A%20another%20process%20changed%20the%20consumer%5C%22%20AND%20message%3A%5C%22after%20the%20report%20client%20read%20the%20consumer%20state%20during%20the%20claim%5C%22%20AND%20(tags%3A%5C%22console%5C%22%20OR%20tags%3A%5C%22screen-n-sch.txt%5C%22)&from=10d

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#7

Looking at the logs:

http://logs.openstack.org/26/615126/5/check/tempest-full/69d913a/controller/logs/screen-n-sch.txt.gz#_Nov_06_19_48_36_930958

We'd see these messages if we were retrying:

                LOG.debug(
                    'Unable to %(op)s because %(reason)s; retrying...',
                    {'op': e.operation, 'reason': e.reason})
        LOG.error('Failed scheduler client operation %s: out of retries',
                  f.__name__)

But I don't see those, so likely something changed in the placement error response message which makes our retry code no longer work.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#8

What's also strange is that the scheduler logs say we're doing a doubled-up allocation on the same host:

Nov 06 19:48:36.969356 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: DEBUG nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Doubling-up allocation_request for move operation. {{(pid=13677) _move_operation_alloc_request /opt/stack/nova/nova/scheduler/client/report.py:203}}

But I'm not sure why because this is an unshelve operation of a shelved offloaded server, which shouldn't have any other allocations.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#9

My guess is something regressed in the claim_resources logic introduced in this change:

https://review.openstack.org/#/c/583667/

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#10

Download full text (6.7 KiB)

Well, https://review.openstack.org/#/c/583667/ looks to have changed the logic in processing the error response so that we aren't properly retrying. This is the error from placement:

Nov 06 19:48:37.013780 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: WARNING nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Failed to save allocation for 6665f00a-dcf1-4286-b075-d7dcd7c37487. Got HTTP 409: {"errors": [{"status": 409, "request_id": "req-c9ba6cbd-3b6e-4e5d-b550-9588be8a49d2", "code": "placement.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 1 ", "title": "Conflict"}]}

From the placement logs, I see three different times that allocations are PUT for the consumer:

Nov 06 19:47:57.568237 ubuntu-xenial-inap-mtl01-0000379614 <email address hidden>[7279]: DEBUG nova.api.openstack.placement.requestlog [req-18ee3cd2-b2a1-4cf4-965f-10030c8e5f6c req-014d6d95-06da-489f-a2fe-8e546567974d service placement] Starting request: 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" {{(pid=7281) __call__ /opt/stack/nova/nova/api/openstack/placement/requestlog.py:38}}
Nov 06 19:47:57.617423 ubuntu-xenial-inap-mtl01-0000379614 <email address hidden>[7279]: DEBUG nova.api.openstack.placement.handlers.allocation [req-18ee3cd2-b2a1-4cf4-965f-10030c8e5f6c req-014d6d95-06da-489f-a2fe-8e546567974d service placement] Successfully wrote allocations AllocationList[Allocation(consumer=Consumer(6665f00a-dcf1-4286-b075-d7dcd7c37487),created_at=<?>,id=150,resource_class='VCPU',resource_provider=ResourceProvider(3ceb7eab-549c-40ba-a70c-320822c310ab),updated_at=<?>,used=1), Allocation(consumer=Consumer(6665f00a-dcf1-4286-b075-d7dcd7c37487),created_at=<?>,id=151,resource_class='MEMORY_MB',resource_provider=ResourceProvider(3ceb7eab-549c-40ba-a70c-320822c310ab),updated_at=<?>,used=64)] {{(pid=7281) _create_allocations /opt/stack/nova/nova/api/openstack/placement/handlers/allocation.py:441}}
Nov 06 19:47:57.617981 ubuntu-xenial-inap-mtl01-0000379614 <email address hidden>[7279]: INFO nova.api.openstack.placement.requestlog [req-18ee3cd2-b2a1-4cf4-965f-10030c8e5f6c req-014d6d95-06da-489f-a2fe-8e546567974d service placement] 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" status: 204 len: 0 microversion: 1.29

The first one has to be when initially scheduling the instance.

Then we should shelve offload the instance and delete it's allocations, but I'm not seeing a DELETE allocations request for consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487 in the placement logs.

I then see a second PUT allocations request:

Nov 06 19:48:36.931411 ubuntu-xenial-inap-mtl01-0000379614 <email address hidden>[7279]: DEBUG nova.api.openstack.placement.requestlog [req-ef6a4e27-111b-4f9c-864d-5ab710954ab5 req-e32df7fa-ef0c-4e31-9428-d5bcd0b3c9be service placement] Starting request: 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" {{(pid=7281) __call__ /opt/stack...

Well, https://review.openstack.org/#/c/583667/ looks to have changed the logic in processing the error response so that we aren't properly retrying. This is the error from placement:

Nov 06 19:48:37.013780 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: WARNING nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Failed to save allocation for 6665f00a-dcf1-4286-b075-d7dcd7c37487. Got HTTP 409: {"errors": [{"status": 409, "request_id": "req-c9ba6cbd-3b6e-4e5d-b550-9588be8a49d2", "code": "placement.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 1  ", "title": "Conflict"}]}

From the placement logs, I see three different times that allocations are PUT for the consumer:

Nov 06 19:47:57.568237 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: DEBUG nova.api.openstack.placement.requestlog [req-18ee3cd2-b2a1-4cf4-965f-10030c8e5f6c req-014d6d95-06da-489f-a2fe-8e546567974d service placement] Starting request: 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" {{(pid=7281) __call__ /opt/stack/nova/nova/api/openstack/placement/requestlog.py:38}}
Nov 06 19:47:57.617423 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: DEBUG nova.api.openstack.placement.handlers.allocation [req-18ee3cd2-b2a1-4cf4-965f-10030c8e5f6c req-014d6d95-06da-489f-a2fe-8e546567974d service placement] Successfully wrote allocations AllocationList[Allocation(consumer=Consumer(6665f00a-dcf1-4286-b075-d7dcd7c37487),created_at=<?>,id=150,resource_class='VCPU',resource_provider=ResourceProvider(3ceb7eab-549c-40ba-a70c-320822c310ab),updated_at=<?>,used=1), Allocation(consumer=Consumer(6665f00a-dcf1-4286-b075-d7dcd7c37487),created_at=<?>,id=151,resource_class='MEMORY_MB',resource_provider=ResourceProvider(3ceb7eab-549c-40ba-a70c-320822c310ab),updated_at=<?>,used=64)] {{(pid=7281) _create_allocations /opt/stack/nova/nova/api/openstack/placement/handlers/allocation.py:441}}
Nov 06 19:47:57.617981 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: INFO nova.api.openstack.placement.requestlog [req-18ee3cd2-b2a1-4cf4-965f-10030c8e5f6c req-014d6d95-06da-489f-a2fe-8e546567974d service placement] 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" status: 204 len: 0 microversion: 1.29

The first one has to be when initially scheduling the instance.

Then we should shelve offload the instance and delete it's allocations, but I'm not seeing a DELETE allocations request for consumer 6665f00a-dcf1-4286-b075-d7dcd7c37487 in the placement logs.

I then see a second PUT allocations request:

Nov 06 19:48:36.931411 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: DEBUG nova.api.openstack.placement.requestlog [req-ef6a4e27-111b-4f9c-864d-5ab710954ab5 req-e32df7fa-ef0c-4e31-9428-d5bcd0b3c9be service placement] Starting request: 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" {{(pid=7281) __call__ /opt/stack/nova/nova/api/openstack/placement/requestlog.py:38}}
Nov 06 19:48:36.956513 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: DEBUG nova.api.openstack.placement.requestlog [req-f266a0ff-2840-413d-9877-4500e61512f5 req-7461e49d-5b5f-45b7-8ae2-84704675e9fb service placement] Starting request: 198.72.124.206 "GET /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" {{(pid=7282) __call__ /opt/stack/nova/nova/api/openstack/placement/requestlog.py:38}}
Nov 06 19:48:36.966890 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: DEBUG nova.api.openstack.placement.handlers.allocation [req-ef6a4e27-111b-4f9c-864d-5ab710954ab5 req-e32df7fa-ef0c-4e31-9428-d5bcd0b3c9be service placement] Successfully wrote allocations AllocationList[Allocation(consumer=Consumer(6665f00a-dcf1-4286-b075-d7dcd7c37487),created_at=<?>,id=150,resource_class='VCPU',resource_provider=ResourceProvider(3ceb7eab-549c-40ba-a70c-320822c310ab),updated_at=<?>,used=0), Allocation(consumer=Consumer(6665f00a-dcf1-4286-b075-d7dcd7c37487),created_at=<?>,id=151,resource_class='MEMORY_MB',resource_provider=ResourceProvider(3ceb7eab-549c-40ba-a70c-320822c310ab),updated_at=<?>,used=0)] {{(pid=7281) _create_allocations /opt/stack/nova/nova/api/openstack/placement/handlers/allocation.py:441}}
Nov 06 19:48:36.967449 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: INFO nova.api.openstack.placement.requestlog [req-ef6a4e27-111b-4f9c-864d-5ab710954ab5 req-e32df7fa-ef0c-4e31-9428-d5bcd0b3c9be service placement] 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" status: 204 len: 0 microversion: 1.28

Which is successful. And then a third which fails:

Nov 06 19:48:36.994159 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: DEBUG nova.api.openstack.placement.requestlog [req-f266a0ff-2840-413d-9877-4500e61512f5 req-c9ba6cbd-3b6e-4e5d-b550-9588be8a49d2 service placement] Starting request: 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" {{(pid=7281) __call__ /opt/stack/nova/nova/api/openstack/placement/requestlog.py:38}}
Nov 06 19:48:37.012282 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: DEBUG nova.api.openstack.placement.wsgi_wrapper [req-f266a0ff-2840-413d-9877-4500e61512f5 req-c9ba6cbd-3b6e-4e5d-b550-9588be8a49d2 service placement] Placement API returning an error response: consumer generation conflict - expected null but got 1 {{(pid=7281) call_func /opt/stack/nova/nova/api/openstack/placement/wsgi_wrapper.py:31}}
Nov 06 19:48:37.012622 ubuntu-xenial-inap-mtl01-0000379614 devstack@placement-api.service[7279]: INFO nova.api.openstack.placement.requestlog [req-f266a0ff-2840-413d-9877-4500e61512f5 req-c9ba6cbd-3b6e-4e5d-b550-9588be8a49d2 service placement] 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" status: 409 len: 280 microversion: 1.29

That third and failed PUT allocations request is from the scheduler, but I'm not sure where the 2nd PUT request is coming from. Looking at the request ID from the 2nd request, it looks like that's coming from the compute, probably to remove the allocations:

http://logs.openstack.org/26/615126/5/check/tempest-full/69d913a/controller/logs/screen-n-cpu.txt.gz#_Nov_06_19_48_28_559759

And that explains why I don't see the DELETE allocations, I see a PUT with {} because of this change:

https://review.openstack.org/#/c/591597/

So the consumer persists in placement across shelve/unshelve, and the consumer has a generation we're not accounting for during unshelve.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#11

It looks like the compute manager doesn't remove the allocations for the shelved instances until after updating the status to SHELVED_OFFLOADED, so it's possible we're hitting a race where tempest sees the instance is shelved, and then immediately unshelves it before we've removed the allocations during shelve...otherwise I can't explain how we go down this patch in the scheduler during the unshelve:

https://github.com/openstack/nova/blob/e27905f482ba26d2bbf3ae5d948dee37523042d5/nova/scheduler/client/report.py#L1824

Since for a shelved offloaded server there should not be any existing allocations.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#12

This is where the compute logs that it removed the allocations after shelve offload:

http://logs.openstack.org/26/615126/5/check/tempest-full/69d913a/controller/logs/screen-n-cpu.txt.gz#_Nov_06_19_48_36_982861

Nov 06 19:48:36.982861 ubuntu-xenial-inap-mtl01-0000379614 nova-compute[15198]: INFO nova.scheduler.client.report [None req-ef6a4e27-111b-4f9c-864d-5ab710954ab5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Deleted allocation for instance 6665f00a-dcf1-4286-b075-d7dcd7c37487

Which lines up with the 2nd PUT in the placement logs:

Nov 06 19:48:36.931411 ubuntu-xenial-inap-mtl01-0000379614 <email address hidden>[7279]: DEBUG nova.api.openstack.placement.requestlog [req-ef6a4e27-111b-4f9c-864d-5ab710954ab5 req-e32df7fa-ef0c-4e31-9428-d5bcd0b3c9be service placement] Starting request: 198.72.124.206 "PUT /placement/allocations/6665f00a-dcf1-4286-b075-d7dcd7c37487" {{(pid=7281) __call__ /opt/stack/nova/nova/api/openstack/placement/requestlog.py:38}}

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#13

There is a note in the compute code when removing allocations which says if we PUT {} for allocations that placement will delete the consumer record:

https://review.openstack.org/#/c/591597/8/nova/scheduler/client/report.py@2091

But I'm not seeing that actually anywhere in the placement handler code here:

https://github.com/openstack/nova/blob/e27905f482ba26d2bbf3ae5d948dee37523042d5/nova/api/openstack/placement/handlers/allocation.py#L404

In fact I see it ensure a consumer exists, but doesn't delete it.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#14

Oh I guess consumers with no allocations should be deleted here:

https://github.com/openstack/nova/blob/e27905f482ba26d2bbf3ae5d948dee37523042d5/nova/api/openstack/placement/objects/resource_provider.py#L2099

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#15

The SQL in here:

https://github.com/openstack/nova/blob/e27905f482ba26d2bbf3ae5d948dee37523042d5/nova/api/openstack/placement/objects/consumer.py#L70

might be broken in the same way ensure_consumer was broken:

https://github.com/openstack/nova/commit/730936e535e67127c76d4f27649a16d8cf05efc9#diff-fcca11e34c1b5fce52a4ddbc418aa2d5

If the consumer had allocations against >1 resource class, it might not query properly.

Changed in nova:
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-09: Related fix proposed to nova (master)

#16

Related fix proposed to branch: master
Review: https://review.openstack.org/617016

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-09:

#17

Related fix proposed to branch: master
Review: https://review.openstack.org/617017

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-11-09:

#18

Download full text (3.2 KiB)

More discussion notes:

(2:58:16 PM) dansmith: mriedem: so you found a test that confirmed the behavior of that thing?
(2:58:23 PM) dansmith: mriedem: that deletes the consumer?
(3:02:49 PM) mriedem: DeleteConsumerIfNoAllocsTestCase is the functional test that covers that case,
(3:02:54 PM) mriedem: and it looks like a correct test to me,
(3:03:05 PM) mriedem: creates 2 consumers each with 2 allocations on different resource classes,
(3:03:13 PM) mriedem: clears the allocations for one of them and asserts the consumer is gone
(3:03:33 PM) mriedem: i think we're just hitting a race with the shelve offloaded status change before we cleanup the allocations
(3:03:43 PM) mriedem: but i've posted a couple of patches to add debug logs to help determine if that's the case
(3:03:55 PM) mriedem: https://review.openstack.org/617016
(3:04:59 PM) dansmith: okay I'm not sure how we could race and see no allocations but a consumer and get that generation conflict
(3:05:17 PM) dansmith: it'd be one thing if we thought the consumer was there and then disappeared out from under us
(3:17:15 PM) mriedem: during unshelve the scheduler does see allocations
(3:17:35 PM) mriedem: and it thinks we're doing a move
(3:18:00 PM) dansmith: okay I thought you pasted a line showing that there was only one allocation going back to placement
(3:18:11 PM) mriedem: there are 3 PUTs for allocations
(3:18:15 PM) mriedem: 1. create the server - initial
(3:18:27 PM) mriedem: 2. shelve offload - wipe the allocations to {} - which should delete the consumer
(3:18:37 PM) mriedem: 3. unshelve - scheduler claims resources with the wrong consumer generation
(3:18:49 PM) mriedem: and when 3 happens, the scheduler gets allocations for hte consumer and they are there,
(3:18:51 PM) dansmith: ...right
(3:18:59 PM) mriedem: so it uses the consumer generation (1) from those allocations
(3:19:07 PM) mriedem: then i think what happens is,
(3:19:09 PM) dansmith: oh, so it passes generation=1 instead of generation=0, meaning new consumer?
(3:19:15 PM) mriedem: placement recreates the consumer which will have generation null
(3:19:19 PM) mriedem: yes
(3:19:26 PM) dansmith: okay I see
(3:19:50 PM) dansmith: I thought you were seeing consumer generation was null or zero or whatever in the third put, but still getting a conflict
(3:19:53 PM) dansmith: but that makes sense now
(3:20:06 PM) mriedem: Nov 06 19:48:37.013780 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: WARNING nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Failed to save allocation for 6665f00a-dcf1-4286-b075-d7dcd7c37487. Got HTTP 409: {"errors": [{"status": 409, "request_id": "req-c9ba6cbd-3b6e-4e5d-b550-9588be8a49d2", "code": "placement.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 1 ", "title": "Conflict"}]}
(3:20:15 PM) mriedem: consumer generation conflict - expected null but got 1
(3:20:24 PM) mriedem: yup - so new consumer but we're passing a generation of 1
(3:20:28 PM) mriedem: from the old, now dele...

More discussion notes:

(2:58:16 PM) dansmith: mriedem: so you found a test that confirmed the behavior of that thing?
(2:58:23 PM) dansmith: mriedem: that deletes the consumer?
(3:02:49 PM) mriedem: DeleteConsumerIfNoAllocsTestCase is the functional test that covers that case,
(3:02:54 PM) mriedem: and it looks like a correct test to me,
(3:03:05 PM) mriedem: creates 2 consumers each with 2 allocations on different resource classes,
(3:03:13 PM) mriedem: clears the allocations for one of them and asserts the consumer is gone
(3:03:33 PM) mriedem: i think we're just hitting a race with the shelve offloaded status change before we cleanup the allocations
(3:03:43 PM) mriedem: but i've posted a couple of patches to add debug logs to help determine if that's the case
(3:03:55 PM) mriedem: https://review.openstack.org/617016
(3:04:59 PM) dansmith: okay I'm not sure how we could race and see no allocations but a consumer and get that generation conflict
(3:05:17 PM) dansmith: it'd be one thing if we thought the consumer was there and then disappeared out from under us
(3:17:15 PM) mriedem: during unshelve the scheduler does see allocations
(3:17:35 PM) mriedem: and it thinks we're doing a move
(3:18:00 PM) dansmith: okay I thought you pasted a line showing that there was only one allocation going back to placement
(3:18:11 PM) mriedem: there are 3 PUTs for allocations
(3:18:15 PM) mriedem: 1. create the server - initial
(3:18:27 PM) mriedem: 2. shelve offload - wipe the allocations to {} - which should delete the consumer
(3:18:37 PM) mriedem: 3. unshelve - scheduler claims resources with the wrong consumer generation
(3:18:49 PM) mriedem: and when 3 happens, the scheduler gets allocations for hte consumer and they are there,
(3:18:51 PM) dansmith: ...right
(3:18:59 PM) mriedem: so it uses the consumer generation (1) from those allocations
(3:19:07 PM) mriedem: then i think what happens is,
(3:19:09 PM) dansmith: oh, so it passes generation=1 instead of generation=0, meaning new consumer?
(3:19:15 PM) mriedem: placement recreates the consumer which will have generation null
(3:19:19 PM) mriedem: yes
(3:19:26 PM) dansmith: okay I see
(3:19:50 PM) dansmith: I thought you were seeing consumer generation was null or zero or whatever in the third put, but still getting a conflict
(3:19:53 PM) dansmith: but that makes sense now
(3:20:06 PM) mriedem: Nov 06 19:48:37.013780 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: WARNING nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Failed to save allocation for 6665f00a-dcf1-4286-b075-d7dcd7c37487. Got HTTP 409: {"errors": [{"status": 409, "request_id": "req-c9ba6cbd-3b6e-4e5d-b550-9588be8a49d2", "code": "placement.concurrent_update",  "detail": "There was a conflict when trying to complete your  request.\n\n consumer generation conflict - expected null but got 1  ",  "title": "Conflict"}]}
(3:20:15 PM) mriedem: consumer generation conflict - expected null but got 1 
(3:20:24 PM) mriedem: yup - so new consumer but we're passing a generation of 1
(3:20:28 PM) mriedem: from the old, now deleted consumer

I'll push a patch for this.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-09: Fix proposed to nova (master)

#19

Fix proposed to branch: master
Review: https://review.openstack.org/617040

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-26: Related fix merged to nova (master)

#20

Reviewed: https://review.openstack.org/617016
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=82b4e3ff7e1f217a67d7146024cc66f46c2b93dc
Submitter: Zuul
Branch: master

commit 82b4e3ff7e1f217a67d7146024cc66f46c2b93dc
Author: Matt Riedemann <email address hidden>
Date: Fri Nov 9 13:48:07 2018 -0500

Add debug logs when doubling-up allocations during scheduling

    During claim_resources in the scheduler, if the consumer (instance)
    has existing allocations, the scheduler thinks we're doing something
    like a resize to same host or evacuation, but it would be useful
    to know what the original allocations when doing that, so this adds
    logging of the original allocations that take us down the double-up
    path.

Change-Id: Ibfb0e97840141a4d60701f8c938fedad0fc4c758
Related-Bug: #1798688

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-07: Fix proposed to nova (master)

#21

Fix proposed to branch: master
Review: https://review.openstack.org/623596

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-07: Change abandoned on nova (master)

#22

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/617040

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-07:

#23

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/617017

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-12: Fix merged to nova (master)

#24

Reviewed: https://review.openstack.org/623596
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6369f39244558b147f7b0796269d9a86ce9b12d8
Submitter: Zuul
Branch: master

commit 6369f39244558b147f7b0796269d9a86ce9b12d8
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 7 17:27:16 2018 -0500

Remove allocations before setting vm_status to SHELVED_OFFLOADED

Tempest is intermittently failing a test which does the
following:

    1. Create a server.
    2. Shelve offload it.
    3. Unshelve it.

    Tempest waits for the server status to be SHELVED_OFFLOADED
    before unshelving the server, which goes through the
    scheduler to pick a compute node and claim resources on it.

    When shelve offloading a server, the resource allocations
    for the instance and compute node it was on are cleared, which
    will also delete the internal consumer record in the placement
    service.

    The race is that the allocations are removed during shelve
    offload *after* the server status changes to SHELVED_OFFLOADED.
    This leaves a window where unshelve is going through the
    scheduler and gets the existing allocations for the instance,
    which are non-empty and have a consumer generation. The
    claim_resources method in the scheduler then uses that
    consumer generation when PUTing the allocations. That PUT
    fails because in between the GET and PUT of the allocations,
    placement has deleted the internal consumer record. When
    PUTing the new allocations with a non-null consumer generation,
    placement returns a 409 conflict error because for a new
    consumer it expects the "consumer_generation" parameter to be
    None.

    This change handles the race by simply making sure the allocations
    are deleted (along with the related consumer record in placement)
    *before* the instance.vm_status is changed.

Change-Id: I2a6ccaff904c1f0759d55feeeef0ec1da32c65df
Closes-Bug: #1798688

Reviewed:  https://review.openstack.org/623596
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6369f39244558b147f7b0796269d9a86ce9b12d8
Submitter: Zuul
Branch:    master

commit 6369f39244558b147f7b0796269d9a86ce9b12d8
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Fri Dec 7 17:27:16 2018 -0500

Remove allocations before setting vm_status to SHELVED_OFFLOADED
    
    Tempest is intermittently failing a test which does the
    following:
    
    1. Create a server.
    2. Shelve offload it.
    3. Unshelve it.
    
    Tempest waits for the server status to be SHELVED_OFFLOADED
    before unshelving the server, which goes through the
    scheduler to pick a compute node and claim resources on it.
    
    When shelve offloading a server, the resource allocations
    for the instance and compute node it was on are cleared, which
    will also delete the internal consumer record in the placement
    service.
    
    The race is that the allocations are removed during shelve
    offload *after* the server status changes to SHELVED_OFFLOADED.
    This leaves a window where unshelve is going through the
    scheduler and gets the existing allocations for the instance,
    which are non-empty and have a consumer generation. The
    claim_resources method in the scheduler then uses that
    consumer generation when PUTing the allocations. That PUT
    fails because in between the GET and PUT of the allocations,
    placement has deleted the internal consumer record. When
    PUTing the new allocations with a non-null consumer generation,
    placement returns a 409 conflict error because for a new
    consumer it expects the "consumer_generation" parameter to be
    None.
    
    This change handles the race by simply making sure the allocations
    are deleted (along with the related consumer record in placement)
    *before* the instance.vm_status is changed.
    
    Change-Id: I2a6ccaff904c1f0759d55feeeef0ec1da32c65df
    Closes-Bug: #1798688

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-22: Fix included in openstack/nova 19.0.0.0rc1

#25

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-17: Fix merged to nova (stable/rocky)

#26

Reviewed: https://review.opendev.org/c/openstack/nova/+/771985
Committed: https://opendev.org/openstack/nova/commit/1121a59edb48fe133df14bfd4384eef04ce687a7
Submitter: "Zuul (22348)"
Branch: stable/rocky

commit 1121a59edb48fe133df14bfd4384eef04ce687a7
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 7 17:27:16 2018 -0500

Remove allocations before setting vm_status to SHELVED_OFFLOADED

Tempest is intermittently failing a test which does the
following:

    1. Create a server.
    2. Shelve offload it.
    3. Unshelve it.

    Tempest waits for the server status to be SHELVED_OFFLOADED
    before unshelving the server, which goes through the
    scheduler to pick a compute node and claim resources on it.

    When shelve offloading a server, the resource allocations
    for the instance and compute node it was on are cleared, which
    will also delete the internal consumer record in the placement
    service.

    The race is that the allocations are removed during shelve
    offload *after* the server status changes to SHELVED_OFFLOADED.
    This leaves a window where unshelve is going through the
    scheduler and gets the existing allocations for the instance,
    which are non-empty and have a consumer generation. The
    claim_resources method in the scheduler then uses that
    consumer generation when PUTing the allocations. That PUT
    fails because in between the GET and PUT of the allocations,
    placement has deleted the internal consumer record. When
    PUTing the new allocations with a non-null consumer generation,
    placement returns a 409 conflict error because for a new
    consumer it expects the "consumer_generation" parameter to be
    None.

    This change handles the race by simply making sure the allocations
    are deleted (along with the related consumer record in placement)
    *before* the instance.vm_status is changed.

    Change-Id: I2a6ccaff904c1f0759d55feeeef0ec1da32c65df
    Closes-Bug: #1798688
    (cherry picked from commit 6369f39244558b147f7b0796269d9a86ce9b12d8)

Reviewed:  https://review.opendev.org/c/openstack/nova/+/771985
Committed: https://opendev.org/openstack/nova/commit/1121a59edb48fe133df14bfd4384eef04ce687a7
Submitter: "Zuul (22348)"
Branch:    stable/rocky

commit 1121a59edb48fe133df14bfd4384eef04ce687a7
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Fri Dec 7 17:27:16 2018 -0500

Remove allocations before setting vm_status to SHELVED_OFFLOADED
    
    Tempest is intermittently failing a test which does the
    following:
    
    1. Create a server.
    2. Shelve offload it.
    3. Unshelve it.
    
    Tempest waits for the server status to be SHELVED_OFFLOADED
    before unshelving the server, which goes through the
    scheduler to pick a compute node and claim resources on it.
    
    When shelve offloading a server, the resource allocations
    for the instance and compute node it was on are cleared, which
    will also delete the internal consumer record in the placement
    service.
    
    The race is that the allocations are removed during shelve
    offload *after* the server status changes to SHELVED_OFFLOADED.
    This leaves a window where unshelve is going through the
    scheduler and gets the existing allocations for the instance,
    which are non-empty and have a consumer generation. The
    claim_resources method in the scheduler then uses that
    consumer generation when PUTing the allocations. That PUT
    fails because in between the GET and PUT of the allocations,
    placement has deleted the internal consumer record. When
    PUTing the new allocations with a non-null consumer generation,
    placement returns a 409 conflict error because for a new
    consumer it expects the "consumer_generation" parameter to be
    None.
    
    This change handles the race by simply making sure the allocations
    are deleted (along with the related consumer record in placement)
    *before* the instance.vm_status is changed.
    
    Change-Id: I2a6ccaff904c1f0759d55feeeef0ec1da32c65df
    Closes-Bug: #1798688
    (cherry picked from commit 6369f39244558b147f7b0796269d9a86ce9b12d8)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-11-11: Fix included in openstack/nova rocky-eol

#27

This issue was fixed in the openstack/nova rocky-eol release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-11-11: Change abandoned on nova (stable/queens)

#28

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/queens
Review: https://review.opendev.org/c/openstack/nova/+/771986
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	High	Matt Riedemann
Queens	New	Undecided	Unassigned
Rocky	Fix Released	Undecided	Unassigned

OpenStack Compute (nova)

AllocationUpdateFailed_Remote: Failed to update allocations for consumer. Error: another process changed the consumer after the report client read the consumer state during the claim

Bug Description

Other bug subscribers

Remote bug watches