resource was allocated repeatedly in some cases

Bug #1790569 reported by fupingxie
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

resources allocated can not be cleaned.
1. Migrate an Instance VM-1 from node-0 to node-1.
2. Scheduler recieved the message and claim resouces for VM-1 on node-1.
3. However, if RabbitMQ exception for pressure overload or other reason, conductor can not received a response , and migration becomes error.
4. At last, the resources allocated for VM-1 on node-1 leaved that, can not be cleaned.

fupingxie (fpxie)
Changed in nova:
assignee: nobody → fupingxie (fpxie)
Revision history for this message
fupingxie (fpxie) wrote :
Download full text (4.4 KiB)

2018-08-27 08:50:38.084 17745 ERROR oslo.messaging._drivers.impl_rabbit [req-7f47bb6e-0dae-4fb2-b8c1-594002df9260 4f2101a461a6470a867278c965b0cde4 2dd72e94503e4222b80ef789ed9b7f99 - default default] Failed to publish message to topic 'nova': 'NoneType' object has no attribute '__getitem__'
2018-08-27 08:50:38.085 17745 ERROR oslo.messaging._drivers.impl_rabbit [req-7f47bb6e-0dae-4fb2-b8c1-594002df9260 4f2101a461a6470a867278c965b0cde4 2dd72e94503e4222b80ef789ed9b7f99 - default default] Unable to connect to AMQP server on 10.127.3.64:5672 after None tries: 'NoneType' object has no attribute '__getitem__'
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server [req-7f47bb6e-0dae-4fb2-b8c1-594002df9260 4f2101a461a6470a867278c965b0cde4 2dd72e94503e4222b80ef789ed9b7f99 - default default] Exception during message handling: MessageDeliveryFailure: Unable to connect to AMQP server on 10.127.3.64:5672 after None tries: 'NoneType' object has no attribute '__getitem__'
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 1200, in schedule_and_build_instances
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server limits=host['limits'])
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/rpcapi.py", line 1129, in build_and_run_instance
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server limits=limits)
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 152, in cast
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server self.transport._send(self.target, msg_ctxt, msg, retry=self.retry)
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server timeout=timeout, retry=retry)
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2018-08-27 08:50:38.086 17745 ERROR oslo_messaging.rpc.server retry=retry)
2018-08-27 08:50:38.086 17745...

Read more...

Changed in nova:
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

Can you provide more details on the scenario that's failing? Is this a build reschedule scenario where the first selected host fails to build the server, so it's rescheduled to a second host and is built successfully there, but the allocations are not cleaned up from the first node for some reason?

Revision history for this message
Matt Riedemann (mriedem) wrote :

From the trace, it looks like conductor was unable to even cast to compute to build the server on the first selected host? But we don't do reschedules/retries within conductor schedule_and_build_instances, unless you have your own patches that does that? Anyway, we need a more clear description of the actual problem.

Changed in nova:
status: In Progress → Incomplete
Changed in nova:
status: Incomplete → In Progress
fupingxie (fpxie)
description: updated
Revision history for this message
fupingxie (fpxie) wrote :

@Matt Riedemann, I am so sorry to say that, the LOG in #1 is not exact.

I meet this problem only once, step like this:
1. Gigrate an Instance VM-1 from node-0 to node-1.
2. Scheduler recieved the message and claim resouces for VM-1 on node-1.
3. However, if RabbitMQ exception for pressure overload or other resaon, conductor can not received a response , and migration becomes error.
4. At last, the resources allocated for VM-1 on node-1 leaved that, can not be cleaned.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/602219

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by fupingxie (<email address hidden>) on branch: master
Review: https://review.openstack.org/582899
Reason: refer to https://review.openstack.org/#/c/602219/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by fupingxie (<email address hidden>) on branch: master
Review: https://review.openstack.org/602219

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by fupingxie (<email address hidden>) on branch: master
Review: https://review.opendev.org/582899

Revision history for this message
Matt Riedemann (mriedem) wrote :

Which release are you using? Since Queens conductor should move the allocations for the source node held by the instance to the migration record, and then the scheduler will claim resources on the dest node using the instance consumer record. If the scheduler fails to reply to the call from conductor, the conductor task should cleanup the allocations in the rollback routine:

https://github.com/openstack/nova/blob/stable/queens/nova/conductor/tasks/migrate.py#L318

https://github.com/openstack/nova/blob/stable/queens/nova/conductor/tasks/live_migrate.py#L129

Are you using Queens or newer and not seeing that happen? Or is the RPC cast from conductor to the compute methods failing somehow and we're not properly rolling back?

tags: added: migrate resource-tracker
Revision history for this message
Matt Riedemann (mriedem) wrote :

Would probably need a set of logs for the failure described above regarding migration failing. I'm sure there are probably places in the conductor code where we're not handling those types of errors but would need some more specific details with a log traceback.

Changed in nova:
status: In Progress → Incomplete
assignee: fupingxie (fpxie) → nobody
Revision history for this message
Matt Riedemann (mriedem) wrote :

All open patches have been abandoned so I'm marking this incomplete and removing the assignee. I don't doubt this is something that could happen to leak allocations, but we'd need a more specific recreate I think. As a workaround, the operator could use the osc-placement CLIs to cleanup the unwanted allocations, or delete all allocations for the instance and then run the 'nova-manage placement heal_allocations' command to correct the allocations for the instance in placement.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.