gate-grenade-dsvm-neutron-multinode-ubuntu-xenial fails with "No host-to-cell mapping found for selected host"

Bug #1708039 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
devstack
Fix Released
Medium
Sean Dague
grenade
Invalid
Undecided
Unassigned

Bug Description

http://logs.openstack.org/54/489754/3/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/7e19ce3/logs/old/screen-n-cond.txt.gz#_2017-08-01_21_57_19_559

2017-08-01 21:57:19.559 13639 ERROR nova.conductor.manager [req-139dd26f-7dd6-48e9-98e1-6bdc2db116af tempest-AttachInterfacesTestJSON-206807782 tempest-AttachInterfacesTestJSON-206807782] No host-to-cell mapping found for selected host ubuntu-xenial-2-node-osic-cloud1-disk-10201108-763019. Setup is incomplete.

This was found in the nova-conductor logs on the Ocata side of Grenade running against a patch in Pike.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is likely a race between the time that nova-compute starts and when devstack-gate runs tools/discover_hosts.sh in devstack.

Revision history for this message
Matt Riedemann (mriedem) wrote :

One compute node record is created here:

http://logs.openstack.org/54/489754/3/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/7e19ce3/logs/old/screen-n-cpu.txt.gz#_2017-08-01_21_51_12_886

2017-08-01 21:51:12.886 13572 INFO nova.compute.resource_tracker [req-21d845c5-9cc7-4467-be2c-64377f554fea - -] Compute_service record created for ubuntu-xenial-2-node-osic-cloud1-disk-10201108:ubuntu-xenial-2-node-osic-cloud1-disk-10201108

That's 6 minutes before the failures in the conductor logs.

The subnode compute node record is created here:

http://logs.openstack.org/54/489754/3/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/7e19ce3/logs/subnode-2/old/screen-n-cpu.txt.gz#_2017-08-01_21_56_50_543

2017-08-01 21:56:50.543 24035 INFO nova.compute.resource_tracker [req-663ad965-aa47-4bb2-82be-ecc385a23b09 - -] Compute_service record created for ubuntu-xenial-2-node-osic-cloud1-disk-10201108-763019:ubuntu-xenial-2-node-osic-cloud1-disk-10201108-763019

That's about 30 seconds before the failure, and that's the node that is not mapped: ubuntu-xenial-2-node-osic-cloud1-disk-10201108-763019

The weird thing is I don't even see the discover_hosts script being run, but it should happen from this part of the devstack-gate run:

https://github.com/openstack-infra/devstack-gate/blob/94602b3cd50f870df1bac738f596e09c6483dd74/devstack-vm-gate.sh#L707

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm not sure why discover_hosts isn't being run from devstack-gate, but grenade will run 'nova-manage cell_v2 simple_cell_setup' which will create the cell0 and cell1 mappings, sync the db for cell0, and discover compute hosts:

http://logs.openstack.org/54/489754/3/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/7e19ce3/logs/grenade.sh.txt.gz#_2017-08-01_21_52_05_564

2017-08-01 21:52:05.564 | + lib/nova:create_cell:956 : nova-manage cell_v2 simple_cell_setup --transport-url rabbit://stackrabbit:secretrabbit@10.16.232.107:5672/

That runs 4 minutes before the subnode compute node record is created, so that's why it's not mapped to cell1 and anything trying to use it fails.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Oh here is the discover_hosts call:

http://logs.openstack.org/54/489754/3/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/7e19ce3/logs/grenade.sh.txt.gz#_2017-08-01_21_56_47_845

2017-08-01 21:56:47.845 | + ./post-stack.sh:main:8 : /opt/stack/old/devstack/tools/discover_hosts.sh
2017-08-01 21:56:50.142 | /usr/local/lib/python2.7/dist-packages/pymysql/cursors.py:166: Warning: (3090, u"Changing sql mode 'NO_AUTO_CREATE_USER' is deprecated. It will be removed in a future release.")
2017-08-01 21:56:50.142 | result = self._query(query)
2017-08-01 21:56:50.268 | Found 2 cell mappings.
2017-08-01 21:56:50.268 | Skipping cell0 since it does not contain hosts.
2017-08-01 21:56:50.268 | Getting compute nodes from cell: b2890c36-f40c-4272-9fa9-3da7d8ec9b0f
2017-08-01 21:56:50.268 | 2017-08-01 21:56:50.268 19869 DEBUG oslo_concurrency.lockutils [req-22c8b9d6-d020-457c-9c18-868bc468a067 - -] Lock "b2890c36-f40c-4272-9fa9-3da7d8ec9b0f" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270
2017-08-01 21:56:50.269 | 2017-08-01 21:56:50.269 19869 DEBUG oslo_concurrency.lockutils [req-22c8b9d6-d020-457c-9c18-868bc468a067 - -] Lock "b2890c36-f40c-4272-9fa9-3da7d8ec9b0f" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282
2017-08-01 21:56:50.357 | Found 1 computes in cell: b2890c36-f40c-4272-9fa9-3da7d8ec9b0f
2017-08-01 21:56:50.357 | Checking host mapping for compute host 'ubuntu-xenial-2-node-osic-cloud1-disk-10201108': 2365307a-c87d-4545-aff9-c8762a087954

So that discovers ubuntu-xenial-2-node-osic-cloud1-disk-10201108 which is the first compute node, but it doesn't discover the subnode compute node, which is created at 2017-08-01 21:56:50.543 - which is 3 seconds after discover_hosts runs.

Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/488381/ is a related fix in devstack.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is probably just latent behavior and we get lucky with the compute node registrations sometimes. I'm just really surprised with the hit count on this that it hasn't gotten more attention:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22No%20host-to-cell%20mapping%20found%20for%20selected%20host%5C%22%20AND%20tags%3A%5C%22screen-n-cond.txt%5C%22%20AND%20voting%3A1&from=7d

198 hits in voting jobs in 7 days, check and gate, all failures.

Matt Riedemann (mriedem)
Changed in devstack:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Sean Dague (sdague)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (master)

Reviewed: https://review.openstack.org/488381
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=c2fe916fc7c6c00cdfa0085e198eaf2ad4d915d1
Submitter: Jenkins
Branch: master

commit c2fe916fc7c6c00cdfa0085e198eaf2ad4d915d1
Author: Sean Dague <email address hidden>
Date: Fri Jul 28 11:29:18 2017 +0000

    Wait for compute service to check in

    With cell v2, on initial bring up, discover hosts can't run unless all
    the compute nodes have checked in. The documentation says that you
    should run ``nova service-list --binary nova-compute`` and see all
    your hosts before running discover hosts. This isn't really viable in
    a multinode devstack because of how things are brought up in parts.

    We can however know that stack.sh will not complete before the compute
    node is up by waiting for the compute node to check in before
    completing. This happens quite late in the stack.sh run, so shouldn't
    add any extra time in most runs.

    Cells v1 and Xenserver don't use real hostnames in the service table
    (they encode complex data that is hostname like to provide more
    topology information than just hostnames). They are exempted from this
    check.

    Related-Bug: #1708039

    Change-Id: I32eb59b9d6c225a3e93992be3a3b9f4b251d7189

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/501371

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (stable/pike)

Reviewed: https://review.openstack.org/501371
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=cfe00779818b270cd008047b05a415fec8c8eccd
Submitter: Jenkins
Branch: stable/pike

commit cfe00779818b270cd008047b05a415fec8c8eccd
Author: Sean Dague <email address hidden>
Date: Fri Jul 28 11:29:18 2017 +0000

    Wait for compute service to check in

    With cell v2, on initial bring up, discover hosts can't run unless all
    the compute nodes have checked in. The documentation says that you
    should run ``nova service-list --binary nova-compute`` and see all
    your hosts before running discover hosts. This isn't really viable in
    a multinode devstack because of how things are brought up in parts.

    We can however know that stack.sh will not complete before the compute
    node is up by waiting for the compute node to check in before
    completing. This happens quite late in the stack.sh run, so shouldn't
    add any extra time in most runs.

    Cells v1 and Xenserver don't use real hostnames in the service table
    (they encode complex data that is hostname like to provide more
    topology information than just hostnames). They are exempted from this
    check.

    Related-Bug: #1708039

    Change-Id: I32eb59b9d6c225a3e93992be3a3b9f4b251d7189
    (cherry picked from commit c2fe916fc7c6c00cdfa0085e198eaf2ad4d915d1)

tags: added: in-stable-pike
Matt Riedemann (mriedem)
Changed in devstack:
status: In Progress → Fix Released
Changed in grenade:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (stable/ocata)

Reviewed: https://review.openstack.org/490635
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=2c7f377b157ea174e6a645cc8ac9e95a45f33727
Submitter: Jenkins
Branch: stable/ocata

commit 2c7f377b157ea174e6a645cc8ac9e95a45f33727
Author: Sean Dague <email address hidden>
Date: Fri Jul 28 11:29:18 2017 +0000

    Wait for compute service to check in

    With cell v2, on initial bring up, discover hosts can't run unless all
    the compute nodes have checked in. The documentation says that you
    should run ``nova service-list --binary nova-compute`` and see all
    your hosts before running discover hosts. This isn't really viable in
    a multinode devstack because of how things are brought up in parts.

    We can however know that stack.sh will not complete before the compute
    node is up by waiting for the compute node to check in before
    completing. This happens quite late in the stack.sh run, so shouldn't
    add any extra time in most runs.

    Cells v1 and Xenserver don't use real hostnames in the service table
    (they encode complex data that is hostname like to provide more
    topology information than just hostnames). They are exempted from this
    check.

    Related-Bug: #1708039

    Change-Id: I32eb59b9d6c225a3e93992be3a3b9f4b251d7189
    (cherry picked from commit c2fe916fc7c6c00cdfa0085e198eaf2ad4d915d1)

tags: added: in-stable-ocata
Revision history for this message
Matt Riedemann (mriedem) wrote :

Sounds like this is cropping up again in grenade jobs on stable/pike where the old side (ocata) is timeing out because libvirt takes more than a minute to startup, likely due to some spectre mitigations in the libvirt-bin package in the ocata UCA:

https://launchpad.net/~ubuntu-cloud-archive/+archive/ubuntu/ocata-staging/+sourcepub/8774271/+listing-archive-extra

So we likely need to bump the timeout in devstack on both stable/pike and stable/ocata since they both use the Ocata UCA.

The libvirt-bin package in the Pike UCA doesn't seem to have the same issue, or fix:

https://launchpad.net/~ubuntu-cloud-archive/+archive/ubuntu/pike-staging/+sourcepub/8806001/+listing-archive-extra

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.