Nova cells deployment broken

Bug #1915727 reported by Doug Szumski
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Triaged
Medium
Unassigned
Train
New
Medium
Unassigned
Ussuri
New
Medium
Unassigned
Victoria
New
Medium
Unassigned
Wallaby
Triaged
Medium
Unassigned

Bug Description

Seen in stable/ussuri:

When deploying an environment with multiple cells, some tasks fail to run correctly. For example this task should only run once, but instead runs against a top level controller and a cell controller which then both delegate to the same host causing a failure.

```
 TASK [nova-cell : Get a list of existing cells] ***************************************************************************************************************
 fatal: [os-ctrl01 -> 10.188.1.50]: FAILED! => {"changed": false, "failed_when_result": true, "msg": "Container exited with non-zero return code 143", "rc": 143
 , "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
 ok: [os-cellctrl01 -> 10.188.1.50]
```

This appears to be a regression caused by refactoring in:

https://review.opendev.org/c/openstack/kolla-ansible/+/715474/

If that patch is reverted, then the cells deployment proceeds as normal.

Doug Szumski (dszumski)
description: updated
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Weird, because I have managed to run the cells CI job successfully pretty recently (even with ProxySQL shards! lol) in https://review.opendev.org/c/openstack/kolla-ansible/+/770621.

Perhaps it's, as you briefly mentioned, the effect of task delegation which can go wrong depending on your inventory.
Could you share the relevant parts of yours?
It likely differs from CI.

Revision history for this message
Doug Szumski (dszumski) wrote :

Yes, you're right. So what I basically have is an inventory modelled on this:

https://docs.openstack.org/kayobe/latest/configuration/reference/nova-cells.html

In the Kolla Ansible docs this is the 'dedicated cell controller topology' [1]. I could start by proposing a patch to replicate this bug in CI, and then we could take it from there. It looks like CI is testing a shared topology without the dedicated cell controllers and therefore doesn't pick this issue up.

A snippet from my inventory looks like this:

```
$ cat /etc/kolla/inventory/overcloud/group_vars/cell0001
nova_cell_name: cell0001
nova_cell_ironic_cell_name: cell0001
nova_cell_novncproxy_group: cell0001-vnc
nova_cell_conductor_group: cell0001-control
nova_cell_compute_group: cell0001-compute
nova_cell_compute_ironic_group: cell0001-control

[control:children]
top-level-controllers

[cell-control:children]
cell-controllers

# Top level top-level-controllers group.
[top-level-controllers]
os-ctrl01
os-ctrl02
os-ctrl03

[cell-controllers]
os-cellctrl01
os-cellctrl02
os-cellctrl03

[cell0001]
os-cellctrl01

[cell0001-control]
os-cellctrl01

[cell0001-compute]
os-compute0001
os-compute0002 ..

[cell0001-vnc]
os-cellctrl01
```

[1] https://docs.openstack.org/kolla-ansible/latest/reference/compute/nova-cells-guide.html#dedicated-cell-controller-topology

Mark Goddard (mgoddard)
Changed in kolla-ansible:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Mark Goddard (mgoddard) wrote :

The task in the description is imported from two places, but only in discover_computes.yml is it delegated. In https://review.opendev.org/c/openstack/kolla-ansible/+/715474/, the following condition was removed from that task:

  when: inventory_hostname == groups[nova_conductor.group][0] | default(None)

At the same time, the following was added to the import of get_cell_settings.yml:

  delegate_to: "{{ groups[nova_cell_conductor_group][0] }}"

Finally, in deploy.yml, the include of discover_computes.yml was changed from:

  when: inventory_hostname in groups[nova_cell_conductor_group]

to:

    when:
    # Run discovery when one or more compute hosts are in the Ansible batch,
    # and there is a cell conductor in the inventory to delegate to.
    - all_computes_in_batch | length > 0
    - inventory_hostname == all_computes_in_batch[0]
    - groups[nova_cell_conductor_group] | length > 0

So the task is supposed to be executed on one of the compute nodes, and delegated to one of the cell conductors.

However, in the logs in the description, the task has been executed on what appear to be a controller and a cell controller.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Some context for the variables:

    # List of virtualised compute hypervisors in this Ansible play batch.
    virt_computes_in_batch: >-
      {{ groups[nova_cell_compute_group] |
         intersect(ansible_play_batch) |
         list }}
    # List of iroinc compute hosts in this Ansible play batch.
    ironic_computes_in_batch: >-
      {{ (groups[nova_cell_compute_ironic_group] |
          intersect(ansible_play_batch) |
          list)
         if nova_cell_services['nova-compute-ironic'].enabled | bool else [] }}
    all_computes_in_batch: "{{ virt_computes_in_batch + ironic_computes_in_batch }}"

Perhaps the issue is those ironic computes, which typically run on controllers.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.