openstack server list returns an empty list

Bug #1669473 reported by Mehdi Abaakouk
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
devstack
Fix Released
High
Matt Riedemann

Bug Description

Hi,

The telemetry gate is broken since 2 days.

"openstack server list" returns an empty list while one instance is present.

It looks working here according the len of the response:
http://logs.openstack.org/56/439156/2/check/gate-telemetry-dsvm-integration-gnocchi-ubuntu-xenial/d4a6c69/logs/screen-n-api.txt.gz#_2017-03-02_09_16_25_647

Then heat delete one of the two instances

Then, we list the instances and got an empty list:

http://logs.openstack.org/56/439156/2/check/gate-telemetry-dsvm-integration-gnocchi-ubuntu-xenial/d4a6c69/logs/screen-n-api.txt.gz#_2017-03-02_09_16_43_150

Cheers,

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is pretty odd. The compute API has a cache of the cell mappings and at this point we have cell0:

http://logs.openstack.org/56/439156/2/check/gate-telemetry-dsvm-integration-gnocchi-ubuntu-xenial/d4a6c69/logs/screen-n-api.txt.gz#_2017-03-02_09_07_05_528

And at this point we have both cell0 and cell1:

http://logs.openstack.org/56/439156/2/check/gate-telemetry-dsvm-integration-gnocchi-ubuntu-xenial/d4a6c69/logs/screen-n-api.txt.gz#_2017-03-02_09_13_54_642

But given this code:

https://github.com/openstack/nova/blob/68e808623e42442482e92f0d61a6667876f29e19/nova/compute/api.py#L2524

I don't see how that happens unless you restart nova-api, which I'm not seeing happen in these logs.

We see at this point that we get some instances back:

http://logs.openstack.org/56/439156/2/check/gate-telemetry-dsvm-integration-gnocchi-ubuntu-xenial/d4a6c69/logs/screen-n-api.txt.gz#_2017-03-02_09_16_25_647

And on that first one we see we process cell0 and cell1:

2017-03-02 09:16:25.115 11780 DEBUG nova.compute.api [req-877073fc-1f61-4cc4-95b6-f29c6fe40ea0 admin admin] Skipping already-collected cell0 list _get_instances_by_filters_all_cells /opt/stack/new/nova/nova/compute/api.py:2538
2017-03-02 09:16:25.115 11780 DEBUG nova.compute.api [req-877073fc-1f61-4cc4-95b6-f29c6fe40ea0 admin admin] Listing 1000 instances in cell None _get_instances_by_filters_all_cells /opt/stack/new/nova/nova/compute/api.py:2541

But at this point we don't:

http://logs.openstack.org/56/439156/2/check/gate-telemetry-dsvm-integration-gnocchi-ubuntu-xenial/d4a6c69/logs/screen-n-api.txt.gz#_2017-03-02_09_16_43_150

And we only process cell0:

2017-03-02 09:16:43.121 11781 DEBUG nova.compute.api [req-8fcbfa32-2138-4e98-a1a7-b6dd4d6355af admin admin] Skipping already-collected cell0 list _get_instances_by_filters_all_cells /opt/stack/new/nova/nova/compute/api.py:2538

And the instance would be in cell1 but for whatever reason we don't process that one, it's like it's not in the CELLS cache anymore but I don't see how that could happen w/o restarting nova-api.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/440644

Revision history for this message
Matt Riedemann (mriedem) wrote :

According to logstash this started around 2/28:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AssertionError%3A%20json%20path%20%24.servers%5B0%5D.status%20cannot%20match%20%7Bu'servers'%3A%20%5B%5D%7D%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

From looking at devstack, we start nova-api here:

http://logs.openstack.org/56/439156/2/check/gate-telemetry-dsvm-integration-gnocchi-ubuntu-xenial/d4a6c69/logs/devstacklog.txt.gz#_2017-03-02_08_59_34_428

That's the only time we start nova-api, we don't restart it at all so I don't see how the CELLS cache in nova-api is changing, unless there is a race to update it since we don't have a lock on it.

We are running with 2 nova-api workers though, so maybe the CELLS cache in one worker has 1 cell (cell0) and the other worker has 2 cells, so sometimes things work and sometimes they don't depending on which worker we hit.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Ceilometer is querying nova-api after n-api is running but before simple_cell_setup creates cell1:

http://logs.openstack.org/56/439156/2/check/gate-telemetry-dsvm-integration-gnocchi-ubuntu-xenial/d4a6c69/logs/screen-n-api.txt.gz#_2017-03-02_09_07_05_559

So that's going to cache cell0 in the CELLS cache in the compute API on that worker and later requests to list instances in that worker are not going to get the instance because it doesn't have cell1 in the cache.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/440652

Revision history for this message
Matt Riedemann (mriedem) wrote :

gnocchi test patch which depends on the proposed nova workaround:

https://review.openstack.org/#/c/440657/

Revision history for this message
Matt Riedemann (mriedem) wrote :

And a proposed workaround to devstack: https://review.openstack.org/#/c/440651/

Revision history for this message
Matt Riedemann (mriedem) wrote :

It also looks like devstack change https://review.openstack.org/#/c/427392/ triggered this because it moved the creation of the cell0 mapping earlier in the process. So before that change, when ceilometer would list instances we didn't have any cells mapped so nothing would get cached in the compute API's CELLS list. But after that change, we'd create cell0 early, then when ceilometer lists instances it caches cell0 and then we create cell1 later and don't update the cache.

Matt Riedemann (mriedem)
Changed in nova:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Matt Riedemann (mriedem)
Changed in devstack:
assignee: nobody → Mehdi Abaakouk (sileht)
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

After talking about this a bit in nova IRC, we also realized that devstack should probably be updated to create cell1 after we map cell0, so those are created before nova-api is started. And then change simple_cell_setup to discover_hosts at the end, which has to be after nova-compute is started to discover the compute node.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/440739

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is the actual fix in devstack: https://review.openstack.org/#/c/440739/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (master)

Reviewed: https://review.openstack.org/440739
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=f15224c740b880842e8d34e9a6c2ad08ba34448f
Submitter: Jenkins
Branch: master

commit f15224c740b880842e8d34e9a6c2ad08ba34448f
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 2 12:45:47 2017 -0500

    Create cell1 cell before n-api starts

    Change ac5fdb4c4090efd682cc5c55aa30ec433da29fc7 introduced
    a problem for gnocchi CI because the deployments steps
    are now:

    1. create cell0
    2. start nova-api (with multiple workers)
    3. install ceilometer via extras
    4. ceilometer calls nova-api to list servers; at this point
       nova-api getes the list of cells and caches them, which
       will just be cell0
    5. create cell1 via simple_cell_setup which also discovers
       the n-cpu node so we can schedule instances
    6. gnocchi tests create and list instances and at this point it hits
       an n-api worker that only has cell0 cached so it does not
       find some test servers it created and fails.

    The cell0 and cell1 cells should be created in the nova_api db
    before starting n-api so that when we first list instances, we
    store both cells in the cache that's in n-api. This deployment
    order is also how the nova docs describe rolling out cells v2
    but the way we were doing this devstack wasn't following that,
    or accounting for when devstack plugins are loaded via extras.

    This change creates the main cell1 cell earlier in the setup
    before n-api is started, and then changes to just run
    discover_hosts at the end after n-cpu is running (which is what
    simple_cell_setup and map_cell_and_hosts would do implicitly).

    Change-Id: I38eab6707340253a10159a169ae61d34784c2d28
    Related-Bug: #1669473

Matt Riedemann (mriedem)
Changed in nova:
status: In Progress → Won't Fix
Changed in devstack:
status: In Progress → Fix Released
assignee: Mehdi Abaakouk (sileht) → Matt Riedemann (mriedem)
importance: Undecided → High
no longer affects: nova
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on devstack (master)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: master
Review: https://review.openstack.org/440651

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers