Zed HAProxy updates happen ahead of dependent changes in services

Bug #2002645 reported by Andrew Bonney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Undecided
Unassigned

Bug Description

When upgrading from Yoga (25.0.0) to Zed (26.0.0), the HAProxy reconfiguration which happens during setup-infrastructure includes steps to:

- Enable proxy protocol support for connections to Galera
- Switch health checks on some services like Octavia and Ironic to use /healthcheck endpoints which didn't exist previously

As the HAProxy deployment happens early on, access to Galera, Octavia, Ironic and potentially other services goes down until we reach their upgrade steps. For Galera this is pretty quickly afterwards, but for Octavia and Ironic this could be hours later. In the mean time these services will be inaccessible.

Ideally the config changes referenced need to be deployed first, or perhaps backported to Yoga with a requirement to be running that minor release before attempting an upgrade to Zed.

(As a side note, I haven't got the Galera proxy protocol change working yet in our deployment, but this is probably a config issue that I'll either patch or add another bug report for).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (stable/zed)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (master)
Changed in openstack-ansible:
status: New → In Progress
Revision history for this message
Andrew Bonney (andrewbonney) wrote :

Galera proxy protocol is now working for us, but required a change to the default value of 'galera_server_proxy_protocol_networks' as we use separate networks for deployment and operation. I wonder if this might warrant a release note adding to check this variable as there doesn't appear to have been one in the original patches, and it has the potential to break database access: https://review.opendev.org/q/topic:proxyprotocol

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Tbh I'm thinking about possible implications of changing ansible_host to management_address there:
https://opendev.org/openstack/openstack-ansible/src/branch/master/inventory/group_vars/galera_all.yml#L46

At the same time galera_monitoring_allowed_source might be absolutely valid as monitoring can happen through SSH network and not mgmt.

I assume this will work for your scenario and should fit better overall.

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

Looking at our inventory, ansible_host and container_address always seem to be the same, so if management_address comes from https://opendev.org/openstack/openstack-ansible/src/branch/master/inventory/group_vars/all/all.yml#L34 I don't think it'll change the outcome in our case.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote (last edit ):

Oh, but that's a bug!

So I have dummy playbook:

```
  - hosts: az1-controller_hosts,dev2-az1-os-control01_cinder_api_container-7a5de4f0
    tasks:
    - debug:
        msg: "{{ item }}"
      with_items:
        - "{{ management_address }}"
        - "{{ hostvars[inventory_hostname]['ansible_host'] }}"
```

In openstack_user_config:

```
cidr_networks:
  container: 10.21.8.0/22
.....
az1-controller_hosts: &controller_az1
  dev2-az1-os-control01:
    ip: 10.21.0.21
```

And it's output is:

```
  ok: [dev2-az1-os-control01] => (item=10.21.0.21) => {
      "msg": "10.21.0.21"
  }
  ok: [dev2-az1-os-control01] => (item=10.21.0.21) => {
      "msg": "10.21.0.21"
  }
  ok: [dev2-az1-os-control01_cinder_api_container-7a5de4f0] => (item=10.21.9.245) => {
      "msg": "10.21.9.245"
  }
  ok: [dev2-az1-os-control01_cinder_api_container-7a5de4f0] => (item=10.21.9.245) => {
      "msg": "10.21.9.245"
  }
```

At the time I have /22 networks there. This also means, that services that may run on bare metal will listen not on mgmt IP, but on SSH one.
So it's independent bug that needs fixing IMO.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/openstack-ansible/+/870113

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/openstack-ansible/+/869974
Committed: https://opendev.org/openstack/openstack-ansible/commit/befd8424e2efd4e1bebe89b5085032bf120de148
Submitter: "Zuul (22348)"
Branch: master

commit befd8424e2efd4e1bebe89b5085032bf120de148
Author: Dmitriy Rabotyagov <email address hidden>
Date: Thu Jan 12 13:52:52 2023 +0100

    Skip haproxy with setup-infrastructure for upgrades

    As we've enabled /healtcheck URIs for services [1], that we want to
    verify, we need to apply haproxy backend changes only when all services
    are re-configured, otherwise haproxy will fail to poll them and
    mark all backends as down, which would mean continious downtime until
    corresponsive roles will run to fix service api-paste configuration.

    [1] https://review.opendev.org/q/topic:osa/healthcheck
    Closes-Bug: #2002645

    Change-Id: I004f1a86b4dba3c6e356ac14bb70a43abc17f538

Changed in openstack-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (stable/zed)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/openstack-ansible/+/870068
Committed: https://opendev.org/openstack/openstack-ansible/commit/2dc11a39cc0aca97d30fd18e7d7d1d55f48d5950
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 2dc11a39cc0aca97d30fd18e7d7d1d55f48d5950
Author: Dmitriy Rabotyagov <email address hidden>
Date: Thu Jan 12 13:52:52 2023 +0100

    Skip haproxy with setup-infrastructure for upgrades

    As we've enabled /healtcheck URIs for services [1], that we want to
    verify, we need to apply haproxy backend changes only when all services
    are re-configured, otherwise haproxy will fail to poll them and
    mark all backends as down, which would mean continious downtime until
    corresponsive roles will run to fix service api-paste configuration.

    [1] https://review.opendev.org/q/topic:osa/healthcheck
    Closes-Bug: #2002645

    Change-Id: I004f1a86b4dba3c6e356ac14bb70a43abc17f538
    (cherry picked from commit befd8424e2efd4e1bebe89b5085032bf120de148)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on openstack-ansible (stable/zed)

Change abandoned by "Dmitriy Rabotyagov <email address hidden>" on branch: stable/zed
Review: https://review.opendev.org/c/openstack/openstack-ansible/+/869967

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/openstack-ansible/+/871483

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible 26.0.1

This issue was fixed in the openstack/openstack-ansible 26.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/openstack-ansible/+/870113
Committed: https://opendev.org/openstack/openstack-ansible/commit/c8ecc9fa10dea222f928cbdc1fb090adebe28b81
Submitter: "Zuul (22348)"
Branch: master

commit c8ecc9fa10dea222f928cbdc1fb090adebe28b81
Author: Dmitriy Rabotyagov <email address hidden>
Date: Fri Jan 13 18:27:52 2023 +0100

    Add management_ip option for metal hosts

    In cases when SSH and mgmt networks are different, it might be important
    to have valid management_address that services are relying on when
    listening on interfaces. At the moment for bare metal hosts
    management_address will be equal to ansible_host which leads to
    unpredictable behaviour under some scenarios. With management_ip we allow
    to define another IP address that will be used as container/management
    address for bare metal host, while `ip` will still represent
    ansible_host.

    Related-Bug: #2002645
    Change-Id: I3152ae7985319e85b9ea520021f9eea6f5850341

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible 27.0.0.0rc1

This issue was fixed in the openstack/openstack-ansible 27.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/openstack-ansible/+/871483
Committed: https://opendev.org/openstack/openstack-ansible/commit/6a0646470a51a95586c2e5359f693c5f361ad1ba
Submitter: "Zuul (22348)"
Branch: master

commit 6a0646470a51a95586c2e5359f693c5f361ad1ba
Author: Dmitriy Rabotyagov <email address hidden>
Date: Mon Jan 23 16:07:48 2023 +0100

    Ensure management_address is used instead of ansible_host

    ansible_host in deployments is designed to represent SSH address.
    In cases, when address that used for SSH is different from management
    network, this might lead to undpredictable results on bare metal hosts.

    Thus we're ensuring that management (container) address is used
    for services to listen on and to interact through.

    Related-Bug: #2002645
    Depends-On: https://review.opendev.org/c/openstack/openstack-ansible/+/870113
    Change-Id: I8a5e817d024eef5453fa072d8fee5aeca9bed67b

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.