Unable to delete instance because cyborg.get_client() failed

Bug #1873387 reported by Brin Zhang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
High
Wenping Song

Bug Description

When our cloud platform is not deployed or the Cyborg service is removed for some reason, we should be able to delete the instance correctly.

Today, if there is no Cyborg service, or depolyed before and removed later, if we want to cleanup the instance with 'accel:device_profile' in it's flavor when the server build failed, or the user want to delete an older instance contains 'accel:device_profile', that will be terminated and deleted because we did not handle the exception at [1].

[1] https://opendev.org/openstack/nova/src/branch/master/nova/compute/utils.py#L1559

Part of more details as below:

2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] File "/var/lib/kolla/venv/lib/pyth
on2.7/site-packages/keystoneauth1/adapter.py", line 328, in get
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] return self.request(url, 'GET',
**kwargs)
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] File "/var/lib/kolla/venv/lib/pyth
on2.7/site-packages/keystoneauth1/adapter.py", line 213, in request
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] return self.session.request(url,
 method, **kwargs)
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] File "/var/lib/kolla/venv/lib/pyth
on2.7/site-packages/keystoneauth1/session.py", line 706, in request
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] **endpoint_filter)
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] File "/var/lib/kolla/venv/lib/pyth
on2.7/site-packages/keystoneauth1/session.py", line 1113, in get_endpoint
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] return auth.get_endpoint(self, *
*kwargs)
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] File "build/bdist.linux-x86_64/egg
/nova/context.py", line 79, in get_endpoint
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] File "/var/lib/kolla/venv/lib/pyth
on2.7/site-packages/keystoneauth1/access/service_catalog.py", line 400, in url_for
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] endpoint_id=endpoint_id).url
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] File "/var/lib/kolla/venv/lib/pyth
on2.7/site-packages/keystoneauth1/access/service_catalog.py", line 462, in endpoint_data_for
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] raise exceptions.EndpointNotFoun
d(msg)
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f] EndpointNotFound: ['internal', 'publ
ic'] endpoint for accelerator service in RegionTwo region not found
2020-04-17 03:26:58.868 7 ERROR nova.compute.manager [instance: d378d6a4-2cf5-4076-8bc0-82544f25f34f]

Brin Zhang (zhangbailin)
Changed in nova:
importance: Undecided → Medium
assignee: nobody → Brin Zhang (zhangbailin)
tags: added: cyborg
Changed in nova:
status: New → Confirmed
tags: added: ussuri-rc-potential
Changed in nova:
status: Confirmed → Triaged
importance: Medium → High
Wenping Song (wenping1)
Changed in nova:
assignee: Brin Zhang (zhangbailin) → Wenping Song (wenping1)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/720670

Changed in nova:
status: Triaged → In Progress
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

As discussed in the patch [1] the scenario to deploy cyborg then remove the cyborg from the deployment without first removing all the users of cyborg is not valid. The deployer should first clean up all the cyborg users before removes cyborg from the deployment. Nova today also fails due to missing neutron or cinder service in similar situation.

Also creating an instance with accel:device_profile in the flavor without having cyborg deployed fails early enough not to leave a nova instance in ERROR state to be deleted.

I'm marking this bug as invalid

[1] https://review.opendev.org/720670

Changed in nova:
status: In Progress → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Wenping Song (<email address hidden>) on branch: master
Review: https://review.opendev.org/720670

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.