kolla-ansible

Nova cells deployment broken

Series ussuri
Bug #1915727

Bug #1915727 reported by Doug Szumski on 2021-02-15

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
kolla-ansible	Triaged	Medium	Unassigned	kolla-ansible 12.0.0 "wallaby"
Train	New	Medium	Unassigned
Ussuri	New	Medium	Unassigned
Victoria	New	Medium	Unassigned
Wallaby	Triaged	Medium	Unassigned	kolla-ansible 12.0.0 "wallaby"

Bug Description

Seen in stable/ussuri:

When deploying an environment with multiple cells, some tasks fail to run correctly. For example this task should only run once, but instead runs against a top level controller and a cell controller which then both delegate to the same host causing a failure.

```
TASK [nova-cell : Get a list of existing cells] ***************************************************************************************************************
fatal: [os-ctrl01 -> 10.188.1.50]: FAILED! => {"changed": false, "failed_when_result": true, "msg": "Container exited with non-zero return code 143", "rc": 143
, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
ok: [os-cellctrl01 -> 10.188.1.50]
```

This appears to be a regression caused by refactoring in:

https://review.opendev.org/c/openstack/kolla-ansible/+/715474/

If that patch is reverted, then the cells deployment proceeds as normal.

See original description

Doug Szumski (dszumski) on 2021-02-15

description:

updated

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2021-02-15:

Weird, because I have managed to run the cells CI job successfully pretty recently (even with ProxySQL shards! lol) in https://review.opendev.org/c/openstack/kolla-ansible/+/770621.

Perhaps it's, as you briefly mentioned, the effect of task delegation which can go wrong depending on your inventory.
Could you share the relevant parts of yours?
It likely differs from CI.

Revision history for this message

Doug Szumski (dszumski) wrote on 2021-02-15:

Yes, you're right. So what I basically have is an inventory modelled on this:

https://docs.openstack.org/kayobe/latest/configuration/reference/nova-cells.html

In the Kolla Ansible docs this is the 'dedicated cell controller topology' [1]. I could start by proposing a patch to replicate this bug in CI, and then we could take it from there. It looks like CI is testing a shared topology without the dedicated cell controllers and therefore doesn't pick this issue up.

A snippet from my inventory looks like this:

```
$ cat /etc/kolla/inventory/overcloud/group_vars/cell0001
nova_cell_name: cell0001
nova_cell_ironic_cell_name: cell0001
nova_cell_novncproxy_group: cell0001-vnc
nova_cell_conductor_group: cell0001-control
nova_cell_compute_group: cell0001-compute
nova_cell_compute_ironic_group: cell0001-control

[control:children]
top-level-controllers

[cell-control:children]
cell-controllers

# Top level top-level-controllers group.
[top-level-controllers]
os-ctrl01
os-ctrl02
os-ctrl03

[cell-controllers]
os-cellctrl01
os-cellctrl02
os-cellctrl03

[cell0001]
os-cellctrl01

[cell0001-control]
os-cellctrl01

[cell0001-compute]
os-compute0001
os-compute0002 ..

[cell0001-vnc]
os-cellctrl01
```

[1] https://docs.openstack.org/kolla-ansible/latest/reference/compute/nova-cells-guide.html#dedicated-cell-controller-topology

Mark Goddard (mgoddard) on 2021-02-16

Changed in kolla-ansible:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Mark Goddard (mgoddard) wrote on 2021-02-16:

The task in the description is imported from two places, but only in discover_computes.yml is it delegated. In https://review.opendev.org/c/openstack/kolla-ansible/+/715474/, the following condition was removed from that task:

when: inventory_hostname == groups[nova_conductor.group][0] | default(None)

At the same time, the following was added to the import of get_cell_settings.yml:

delegate_to: "{{ groups[nova_cell_conductor_group][0] }}"

Finally, in deploy.yml, the include of discover_computes.yml was changed from:

when: inventory_hostname in groups[nova_cell_conductor_group]

to:

    when:
    # Run discovery when one or more compute hosts are in the Ansible batch,
    # and there is a cell conductor in the inventory to delegate to.
    - all_computes_in_batch | length > 0
    - inventory_hostname == all_computes_in_batch[0]
    - groups[nova_cell_conductor_group] | length > 0

So the task is supposed to be executed on one of the compute nodes, and delegated to one of the cell conductors.

However, in the logs in the description, the task has been executed on what appear to be a controller and a cell controller.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2021-02-16:

Some context for the variables:

    # List of virtualised compute hypervisors in this Ansible play batch.
    virt_computes_in_batch: >-
      {{ groups[nova_cell_compute_group] |
         intersect(ansible_play_batch) |
         list }}
    # List of iroinc compute hosts in this Ansible play batch.
    ironic_computes_in_batch: >-
      {{ (groups[nova_cell_compute_ironic_group] |
          intersect(ansible_play_batch) |
          list)
         if nova_cell_services['nova-compute-ironic'].enabled | bool else [] }}
    all_computes_in_batch: "{{ virt_computes_in_batch + ironic_computes_in_batch }}"

Perhaps the issue is those ironic computes, which typically run on controllers.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.