_bury_in_cell0 could not handle instance duplicate exception

Bug #1857306 reported by Yang Youseok on 2019-12-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Unassigned
Ocata
Undecided
Unassigned
Pike
Undecided
Unassigned
Queens
Undecided
Unassigned
Rocky
Undecided
Unassigned
Stein
Undecided
Unassigned
Train
Undecided
Unassigned

Bug Description

For stable/stein

if there were NoValidHost from scheduler, conductor should create Instance object in cell0. But I found if there is additional exception(InstanceExist) in the function, conductor could not catch the exception which result in instance state stuck in 'scheduling'.

How to reproduce

1. osapi_compute_unique_server_name_scope = True
2. Cellv2
3. Create instance with hostname which previously built and without any valid host.

Result

1. Conductor does not make instance DB creation in cell0 from _bury_in_cell0

What to expect

1. nova_cell0 create instance in cell0 with error state.

Thanks

Matt Riedemann (mriedem) wrote :

osapi_compute_unique_server_name_scope = True is not a valid choice for that option:

https://docs.openstack.org/nova/stein/configuration/config.html#DEFAULT.osapi_compute_unique_server_name_scope

Valid Values: ‘’, project, global

What is the actual value you had set?

I'm not saying this is an invalid bug, and it's likely easy to recreate (I don't think this option is set in many deployments), I'm just wondering how you have things configured specifically to recreate it with a functional test in-tree so we can be assured of the fix.

tags: added: cells regression
Matt Riedemann (mriedem) wrote :

I'm also wondering what the behavior was with this prior to cells v2. I suppose the API would create the instance in the nova DB and check osapi_compute_unique_server_name_scope which would result in an error response to the user. With cells v2 the instance creation was moved to conductor after scheduling and that code isn't handling the InstanceExists error from the DB API, nor is the API/build request code checking for osapi_compute_unique_server_name_scope like the DB API.

Matt Riedemann (mriedem) wrote :

It also appears that we have a separate bug where the PUT /servers/{server_id} API does not handle InstanceExists which would result in a 500 response which looks latent and separate from cells v2 changes in the mitaka/newton era timeframe.

Matt Riedemann (mriedem) wrote :

Looks like the regression was introduced in Ocata: https://review.opendev.org/#/c/319379/

Changed in nova:
status: New → Triaged
importance: Undecided → Low
Matt Riedemann (mriedem) wrote :

Also note that I don't think osapi_compute_unique_server_name_scope=global works anymore with multiple cell support introduced in Pike because osapi_compute_unique_server_name_scope is checked in the DB API which is per-cell so you could have multiple servers with the same name created in multiple cells.

Yang Youseok (ileixe) wrote :

@Matt.

We actually used the 'global' for the config, and I also realize the config looks like not work where conductor runs the code, so I made this bug report.

Matt Riedemann (mriedem) wrote :

Yup, I've posted a patch that recreates the various parts of the bug:

https://review.opendev.org/700456

Matt Riedemann (mriedem) wrote :

The check for osapi_compute_unique_server_name_scope is likely going to have to move from the per-cell DB API code to the REST API service layer and if set will iterate the cells looking for a duplicate and return the 409 response if a duplicate is found. That could be racy though so conductor will likely still need to handle the InstanceExists error, but I'm not sure what to do with it. I guess one option is just not creating the instance in the cell0/non-cell0 DB and also deleting the build_requests record for that instance, but that means having a response with a server id to not having one, which might be a bit weird for the user experience. Alternatively the DB API instance_create method is going to have to handle creating a server with a duplicate name but putting it into ERROR status immediately.

Matt Riedemann (mriedem) wrote :

I suggest working with Dan Smith on what is best to do in each case.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers