_bury_in_cell0 could not handle instance duplicate exception

Bug #1857306 reported by Yang Youseok
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Low
Rajesh Tailor
Ocata
New
Undecided
Unassigned
Pike
New
Undecided
Unassigned
Queens
New
Undecided
Unassigned
Rocky
New
Undecided
Unassigned
Stein
New
Undecided
Unassigned
Train
New
Undecided
Unassigned

Bug Description

For stable/stein

if there were NoValidHost from scheduler, conductor should create Instance object in cell0. But I found if there is additional exception(InstanceExist) in the function, conductor could not catch the exception which result in instance state stuck in 'scheduling'.

How to reproduce

1. osapi_compute_unique_server_name_scope = True
2. Cellv2
3. Create instance with hostname which previously built and without any valid host.

Result

1. Conductor does not make instance DB creation in cell0 from _bury_in_cell0

What to expect

1. nova_cell0 create instance in cell0 with error state.

Thanks

Revision history for this message
Matt Riedemann (mriedem) wrote :

osapi_compute_unique_server_name_scope = True is not a valid choice for that option:

https://docs.openstack.org/nova/stein/configuration/config.html#DEFAULT.osapi_compute_unique_server_name_scope

Valid Values: ‘’, project, global

What is the actual value you had set?

I'm not saying this is an invalid bug, and it's likely easy to recreate (I don't think this option is set in many deployments), I'm just wondering how you have things configured specifically to recreate it with a functional test in-tree so we can be assured of the fix.

tags: added: cells regression
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm also wondering what the behavior was with this prior to cells v2. I suppose the API would create the instance in the nova DB and check osapi_compute_unique_server_name_scope which would result in an error response to the user. With cells v2 the instance creation was moved to conductor after scheduling and that code isn't handling the InstanceExists error from the DB API, nor is the API/build request code checking for osapi_compute_unique_server_name_scope like the DB API.

Revision history for this message
Matt Riedemann (mriedem) wrote :

It also appears that we have a separate bug where the PUT /servers/{server_id} API does not handle InstanceExists which would result in a 500 response which looks latent and separate from cells v2 changes in the mitaka/newton era timeframe.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like the regression was introduced in Ocata: https://review.opendev.org/#/c/319379/

Changed in nova:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Matt Riedemann (mriedem) wrote :

Also note that I don't think osapi_compute_unique_server_name_scope=global works anymore with multiple cell support introduced in Pike because osapi_compute_unique_server_name_scope is checked in the DB API which is per-cell so you could have multiple servers with the same name created in multiple cells.

Revision history for this message
Yang Youseok (ileixe) wrote :

@Matt.

We actually used the 'global' for the config, and I also realize the config looks like not work where conductor runs the code, so I made this bug report.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/700456

Revision history for this message
Matt Riedemann (mriedem) wrote :

Yup, I've posted a patch that recreates the various parts of the bug:

https://review.opendev.org/700456

Revision history for this message
Matt Riedemann (mriedem) wrote :

The check for osapi_compute_unique_server_name_scope is likely going to have to move from the per-cell DB API code to the REST API service layer and if set will iterate the cells looking for a duplicate and return the 409 response if a duplicate is found. That could be racy though so conductor will likely still need to handle the InstanceExists error, but I'm not sure what to do with it. I guess one option is just not creating the instance in the cell0/non-cell0 DB and also deleting the build_requests record for that instance, but that means having a response with a server id to not having one, which might be a bit weird for the user experience. Alternatively the DB API instance_create method is going to have to handle creating a server with a duplicate name but putting it into ERROR status immediately.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I suggest working with Dan Smith on what is best to do in each case.

Amit Uniyal (auniyal)
Changed in nova:
assignee: nobody → Amit Uniyal (auniyal)
Amit Uniyal (auniyal)
Changed in nova:
assignee: Amit Uniyal (auniyal) → nobody
Rajesh Tailor (ratailor)
Changed in nova:
assignee: nobody → Rajesh Tailor (ratailor)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/860938

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/923395

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.