mixed OS job fails tempest - nova 'host is not mapped to any cell'

Bug #1981459 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Undecided
Unassigned

Bug Description

At [1] the new mixed OS job is failing various compute related tempest tests with trace like

 tempest.exceptions.BuildErrorException: Server d8f91945-90a8-4b98-b1b9-006b395725f3 failed to build and is in ERROR status
 Details: Fault: {'code': 400, 'created': '2022-07-11T17:49:01Z', 'message': "Host 'node-0002799831.localdomain' is not mapped to any cell"}. Server boot request ID: req-bed0cc1c-1c77-4ad3-95a7-7811b6a3dbce.

It looks like there was an issue with nova cell_v2 discovery that should have been done with [2]. In the deployment logs [3] it was skipped and we see the warning:

 2022-07-11 17:44:43 | 2022-07-11 17:44:43.955766 | fa163e91-64ab-607a-41d4-0000000000f5 | SKIPPED | discover via nova_manager? | undercloud
 2022-07-11 17:44:43 | 2022-07-11 17:44:43.956936 | fa163e91-64ab-607a-41d4-0000000000f5 | TIMING | discover via nova_manager? | undercloud | 0:10:50.129360 | 0.02s
 2022-07-11 17:44:43 | 2022-07-11 17:44:43.968889 | fa163e91-64ab-607a-41d4-0000000000f6 | TASK | discover via nova_api?
 2022-07-11 17:44:43 | 2022-07-11 17:44:43.990011 | fa163e91-64ab-607a-41d4-0000000000f6 | SKIPPED | discover via nova_api? | undercloud
 2022-07-11 17:44:43 | 2022-07-11 17:44:43.990888 | fa163e91-64ab-607a-41d4-0000000000f6 | TIMING | discover via nova_api? | undercloud | 0:10:50.163311 | 0.02s
 2022-07-11 17:44:44 | 2022-07-11 17:44:44.004691 | fa163e91-64ab-607a-41d4-0000000000f7 | TASK | Warn if no discovery host available
 2022-07-11 17:44:44 | 2022-07-11 17:44:44.040426 | fa163e91-64ab-607a-41d4-0000000000f7 | IGNORED | Warn if no discovery host available | undercloud | error={"changed": false, "msg": "No hosts available to run nova cell_v2 host discovery."}

We are missing 'nova_manager' or 'nova_api' in the generated ansible inventory [4] - needed by [2] to enable running of the cell v2 discovery container. This is because the job uses two deployments first deploying controller with centos9/wallaby containers followed a second deployment for compute with centos8/wallaby containers. The second compute deployment uses the inventory at [4] and so we have no nova_api or nova_manager there it would be in the "controller inventory" (inventory generated per deployment).

[1] https://logserver.rdoproject.org/58/43558/12/check/tripleo-ci-centos-8-9-mixed-os/5627895/logs/undercloud/var/log/tempest/stestr_results.html.gz
[2] https://opendev.org/openstack/tripleo-heat-templates/src/commit/48deb4cbb53d187454c7de82e7125900e93926d1/deployment/nova/nova-compute-common-container-puppet.yaml#L59-L78
[3] https://logserver.rdoproject.org/58/43558/12/check/tripleo-ci-centos-8-9-mixed-os/5627895/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
[4] https://logserver.rdoproject.org/58/43558/12/check/tripleo-ci-centos-8-9-mixed-os/5627895/logs/undercloud/home/zuul/config-download/overcloud/tripleo-ansible-inventory.yaml.txt.gz

Revision history for this message
Marios Andreou (marios-b) wrote :

running the discovery manually works

        * [zuul@node-woo ~]$ sudo podman exec nova_api nova-manage cell_v2 discover_hosts --by-service --verbose

        * Found 2 cell mappings.
Skipping cell0 since it does not contain hosts.
Getting computes from cell 'default': 0b9d7406-1643-47a8-a1cd-woo
Creating host mapping for service node-woo.localdomain
Found 1 unmapped computes in cell: 0b9d7406-1643-47a8-a1cd-woo

i am adding this into the deploy play there https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/841764/28/playbooks/multinode-overcloud-mixed-os-deploy-compute.yml#62

lets see if tempest gets further (I still have my doubts about neutron/networking and https://bugs.launchpad.net/tripleo/+bug/1981322)

Revision history for this message
Takashi Kajinami (kajinamit) wrote (last edit ):

In the past we executed cell discovery in compute nodes but we migrated it to controller nodes when we've removed database options from nova.conf in compute nodes, based on suggestion and request from Nova folks to "remove any credential which is not necessary".

To overcome this we need to somehow trigger the command in controller nodes which are managed by the central stack or give a flag to workaround the error then let users to run the command manually after the deployment.

Revision history for this message
Marios Andreou (marios-b) wrote :

ack thanks for checking Takashi.

as i wrote in comment #1 i'm trying to do that with https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/841764/28/playbooks/multinode-overcloud-mixed-os-deploy-compute.yml#62 (run after the compute deployment is complete)

Revision history for this message
Marios Andreou (marios-b) wrote :

marking this one as resolved - running it as commented above has worked OK

https://logserver.rdoproject.org/58/43558/13/check/tripleo-ci-centos-8-9-multinode-mixed-os/e621742/job-output.txt

(from debug: task)

2022-07-14 13:08:37.220794 | primary | "Found 2 cell mappings.",
2022-07-14 13:08:37.220806 | primary | "Skipping cell0 since it does not contain hosts.",
2022-07-14 13:08:37.220813 | primary | "Getting computes from cell 'default': 3975c1a1-c913-496c-ac30-2b548d21056e",
2022-07-14 13:08:37.220821 | primary | "Creating host mapping for service node-0002809711.localdomain",
2022-07-14 13:08:37.220829 | primary | "Found 1 unmapped computes in cell: 3975c1a1-c913-496c-ac30-2b548d21056e",
2022-07-14 13:08:37.220836 | primary | "+-----------+--------------------------------------+-----------------------------+",
2022-07-14 13:08:37.220846 | primary | "| Cell Name | Cell UUID | Hostname |",
2022-07-14 13:08:37.220853 | primary | "+-----------+--------------------------------------+-----------------------------+",
2022-07-14 13:08:37.220860 | primary | "| default | 3975c1a1-c913-496c-ac30-2b548d21056e | node-0002809711.localdomain |",

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.