Default setting for "list_records_by_skipping_down_cells" causes unexpected results.

Bug #1996758 reported by Arun Mani
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Arun Mani

Bug Description

Problem:

When a query to compute server GET all_tenants is sent we receive a cell timeout and no response is received.

Root Cause:

The default Openstack behaviour with Cells is that when any cell does not respond it is skipped and the API continues to return a success 200 response. In the logs we see "Cell %s is not responding and hence is being omitted from the results" . This behaviour caused empty list of resources to be sent back to the caller. Any caller using this API assumes there are no resources in the cell and proceeds.

Workaround:

The solution here was to change the default configuration of "list_records_by_skipping_down_cells" to False. This meant when any cell did not return results a 500 error was returned, which now indicates a problem with the API. This will alert the caller correctly and can be handled in the right way.

PS: This is observed with the wallaby version of Openstack.

Arun Mani (arun-mani)
Changed in nova:
assignee: nobody → Arun Mani (arun-mani)
Revision history for this message
Arun Mani (arun-mani) wrote :

Details from the nova-api log,

2022-10-20 03:25:08.944 2274055 WARNING nova.compute.multi_cell_list [req-186660d8-293e-4524-b6de-7662e4700e42 f3126e1aa6e79429606bcbaf5ee60b97f479cbfb7a0fc8b69960402e8112a2c0 5ee7a74d72ad4b1c9aa30de3a3b2bf5e - 59147e9e7ea64a9a8ba59ccf0542931a 59147e9e7ea64a9a8ba59ccf0542931a] Cell 5a761e7e-a5ba-46af-a312-a9bc725991de is not responding and hence is being omitted from the results
2022-10-20 03:25:10.061 2274055 INFO nova.osapi_compute.wsgi.server [req-186660d8-293e-4524-b6de-7662e4700e42 f3126e1aa6e79429606bcbaf5ee60b97f479cbfb7a0fc8b69960402e8112a2c0 5ee7a74d72ad4b1c9aa30de3a3b2bf5e - 59147e9e7ea64a9a8ba59ccf0542931a 59147e9e7ea64a9a8ba59ccf0542931a] 192.168.60.141,127.0.0.1 "GET /v2.1/5ee7a74d72ad4b1c9aa30de3a3b2bf5e/servers?all_tenants=True HTTP/1.1" status: 200 len: 363696 time: 114.1267443

Revision history for this message
Dan Smith (danms) wrote (last edit ):

Can you please report:

1. Your nova version
2. The microversion you're requesting
3. The actual request you're making

The default value exists very specifically to avoid reporting 500 for a cloud where only a portion of it is unreachable. If you have four cells in four datacenters and one becomes unreachable for a period of time, it would be undesirable to report 500 for everything.

Also, as of microversion 2.69, you should be getting partial results for down cells per this doc:

https://docs.openstack.org/api-guide/compute/down_cells.html

If that is *not* happening, then that's a legit bug and we should figure out why. However, the default for list_records_by_skipping_down_cells is what it is for a reason:

https://bugs.launchpad.net/nova/+bug/1726301

Arun Mani (arun-mani)
description: updated
Changed in nova:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.