centos-8-standalone-upgrade-ussuri failing after upgrade: Router ip address unreachable

Bug #1940844 reported by Douglas Viroel
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

tripleo-ci-centos-8-standalone-upgrade-ussuri is failing since 08/18:
https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-standalone-upgrade-ussuri

Quickstart logs shows the following error:
TASK [os_tempest : Ping router ip address] *************************************
Thursday 19 August 2021 09:16:10 +0000 (0:00:00.057) 0:51:45.643 *******
FAILED - RETRYING: Ping router ip address (5 retries left).
FAILED - RETRYING: Ping router ip address (4 retries left).
FAILED - RETRYING: Ping router ip address (3 retries left).
FAILED - RETRYING: Ping router ip address (2 retries left).
FAILED - RETRYING: Ping router ip address (1 retries left).
fatal: [undercloud]: FAILED! => {
    "attempts": 5,
    "changed": true,
    "cmd": "set -e\nping -c2 \"192.168.24.166\"\n",
    "delta": "0:00:03.117114",
    "end": "2021-08-19 09:17:22.351846",
    "rc": 1,
    "start": "2021-08-19 09:17:19.234732"
}

STDOUT:
PING 192.168.24.166 (192.168.24.166) 56(84) bytes of data.
From 192.168.24.1 icmp_seq=1 Destination Host Unreachable
From 192.168.24.1 icmp_seq=2 Destination Host Unreachable

https://ee891e3c169b2fbfc45a-6f76ca65e5801c05e86d356ab1b2046b.ssl.cf1.rackcdn.com/periodic/opendev.org/openstack/tripleo-heat-templates/stable/ussuri/tripleo-ci-centos-8-standalone-upgrade-ussuri/523eeee/logs/quickstart_install.log

It is also possible to see more Host Unreachable errors here:
https://ee891e3c169b2fbfc45a-6f76ca65e5801c05e86d356ab1b2046b.ssl.cf1.rackcdn.com/periodic/opendev.org/openstack/tripleo-heat-templates/stable/ussuri/tripleo-ci-centos-8-standalone-upgrade-ussuri/523eeee/logs/undercloud/var/log/extra/errors.txt
like:
oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.168.24.3' ([Errno 113] EHOSTUNREACH)")

Revision history for this message
Alex Schultz (alex-schultz) wrote :

I checked rpms from the last successful and the first failure and there were no RPM changes. So it looks like maybe containers or config changes.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Actually for the periodic job comparison the container hashes are the same so it doesn't appear to be a rpm/container problem.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

We see the port created in OVN . The mysql errors earlier during the upgrade process than the failures from os_tempest

Douglas Viroel (dviroel)
tags: added: promotion-blocker
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I checked it on the test node and here is what I found.
When job failed it was ovs 2.12 running:

$ sudo ovs-vsctl show
a05a7746-2638-49c9-8f3a-4c293fb6aff2
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-ctlplane
        fail_mode: standalone
        Port br-ctlplane
            Interface br-ctlplane
                type: internal
    Bridge br-int
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
    ovs_version: "2.12.0"

and because of that there was no patch ports between bridges created and in that it was same bug as [CIX][BZ:1989974][osp17][rhel8][RHOS17][RHOS16.2]Multinode jobs are failing to ping the router ip address in os_tempest.

But there was already installed ovs 2.13:
sudo su bash-4.4# rpm -qa | grep openvswitch rdo-openvswitch-2.13-3.el8.noarch python3-rdo-openvswitch-2.13-3.el8.noarch openvswitch-selinux-extra-policy-1.0-28.el8.noarch python3-openvswitch2.13-2.13.0-122.el8.x86_64 network-scripts-openvswitch2.13-2.13.0-122.el8.x86_64 rdo-network-scripts-openvswitch-2.13-3.el8.noarch openvswitch2.13-2.13.0-122.el8.x86_64

So I decided to restart openvswitch on the node, and then all seemed to be configured properly:

bash-4.4# systemctl restart openvswitch
bash-4.4# ovs-vsctl show
a05a7746-2638-49c9-8f3a-4c293fb6aff2
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-ctlplane
        fail_mode: standalone
        Port patch-provnet-eb568cc3-4ac1-43a2-9ca8-080bc5a44aca-to-br-int
            Interface patch-provnet-eb568cc3-4ac1-43a2-9ca8-080bc5a44aca-to-br-int
                type: patch
                options: {peer=patch-br-int-to-provnet-eb568cc3-4ac1-43a2-9ca8-080bc5a44aca}
        Port br-ctlplane
            Interface br-ctlplane
                type: internal
    Bridge br-int
        fail_mode: secure
        datapath_type: system
        Port br-int
            Interface br-int
                type: internal
        Port patch-br-int-to-provnet-eb568cc3-4ac1-43a2-9ca8-080bc5a44aca
            Interface patch-br-int-to-provnet-eb568cc3-4ac1-43a2-9ca8-080bc5a44aca
                type: patch
                options: {peer=patch-provnet-eb568cc3-4ac1-43a2-9ca8-080bc5a44aca-to-br-int}
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
    ovs_version: "2.13.5"

So it seems that we should install ovs 2.13 instead of 2.12 at the beginning of the job or restart it after update of that package before tempest is run.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/805894

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/805895

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

After comments in https://review.opendev.org/c/openstack/tripleo-heat-templates/+/805980 I checked logs from failed and passed jobs once again. I think that the difference is in ovn package: when it passed it was ovn2.13.x86_64 20.12.0-135.el8 and when it failed it was ovn2.13-20.12.0-161.el8.x86_64 - most likely ovn2.13-20.12.0-161.el8.x86_64 can't work properly with openvswitch2.12 which was installed on the node initially (and as never restarted, used for all the time).

IMO the best solution to fix that issue will be to install supported ovs version (openvswitch2.13) since the beginning of the job so it will be used even without restart.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart/+/806955

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by "Ronelle Landy <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart/+/806955
Reason: This should be in upgrades

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Hi,

I'm not convinced it should be in upgrade code as standalone upgrade doesn't include any reboot of the "controller" which should happen during a normal upgrade.

Here we can see that ovs get special treatment during upgrade and hence is not restarted:

I that log : https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_80f/807057/1/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/80fa688/logs/undercloud/var/log/dnf.rpm.log

we have the installation of ovs (I guess that the train version):

2021-09-03T10:19:04+0000 SUBDEBUG Installed: network-scripts-10.00.15-1.el8.x86_64
2021-09-03T10:19:04+0000 SUBDEBUG Installed: network-scripts-openvswitch-2.12.0-1.1.el8.x86_64
2021-09-03T10:19:04+0000 SUBDEBUG Installed: openvswitch-2.12.0-1.1.el8.x86_64
2021-09-03T10:19:05+0000 SUBDEBUG Installed: rdo-openvswitch-2.12-1.el8.noarch

then the upgrade of ovs

2021-09-03T11:22:30+0000 SUBDEBUG Installed: openvswitch-selinux-extra-policy-1.0-28.el8.noarch
2021-09-03T11:23:07+0000 SUBDEBUG Installed: openvswitch2.13-2.13.0-122.el8.x86_64
2021-09-03T11:23:09+0000 SUBDEBUG Upgrade: rdo-openvswitch-1:2.13-3.el8.noarch
2021-09-03T11:23:09+0000 SUBDEBUG Upgraded: rdo-openvswitch-2.12-1.el8.noarch

which happens during the special treatment of ovs, hence it's certainly not restarted:

https://zuul.opendev.org/t/openstack/build/80fa688140e9461ba7094c558da6549f/log/logs/undercloud/home/zuul/standalone_upgrade.log?severity=0#1960-1962

2021-09-03 11:22:22 | 2021-09-03 11:22:22.604074 | fa163ece-c078-d86c-8770-000000000669 | TIMING | Set leapp facts | standalone | 0:02:41.006733 | 0.04s
2021-09-03 11:22:22 | 2021-09-03 11:22:22.631304 | fa163ece-c078-d86c-8770-00000000066a | TASK | Special treatment for OpenvSwitch
2021-09-03 11:23:13 | 2021-09-03 11:23:13.866375 | fa163ece-c078-d86c-8770-00000000066a | CHANGED | Special treatment for OpenvSwitch | standalone
2021-09-03 11:23:13 | 2021-09-03 11:23:13.868822 | fa163ece-c078-d86c-8770-00000000066a | TIMING | Special treatment for OpenvSwitch | standalone | 0:03:32.271480 | 51.24s

Then we have ovn container upgraded so at that point:

 - ovn is alive with ussuri version
 - ovs is still at train version

I think it's safe to have the ci job restarting ovs:

 - ussuri upgrade path from train does have node rebooted during the upgrade, so the restart of ovs would happen;
 - catching train/ussuri incompatibility in ovn/ovs may not be relevant at all for the standalone upgrade: this is open for debate, but it seems to me that incompatibility could happen here;
 - modifying the upgrade for the ussuri standalone upgrade path doesn't feel right as:
   - it doesn't exist downstream: hence this change will get very little coverage;

Revision history for this message
Ronelle Landy (rlandy) wrote :

Going back to looking at putting this change in the CI

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Ok, so we have to restart it for CI, since there is no the layered product specific treatment available here https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/ansible_plugins/modules/tripleo_ovs_upgrade.py#L172-L186

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

IIUC the problem is the specific ov treatment script is not managing properly ovs/ovn update from openvswitch-2.12 in train to openvswitch2.13 (note it's different package naming as from ussuri we use NFV SIG builds).

If you think it's the best solution, we may move Train to use openvswitch2.13 from NFV SIG too. That would actually make the same version in train and ussuri.

Revision history for this message
wes hayutin (weshayutin) wrote :

TRAIN to USSURI
https://logserver.rdoproject.org/11/35111/15/check/periodic-tripleo-ci-centos-8-standalone-upgrade-ussuri/b658743/logs/quickstart_files/playbook_executions.log

TRAIN REPO-SETUP
https://logserver.rdoproject.org/11/35111/15/check/periodic-tripleo-ci-centos-8-standalone-upgrade-ussuri/b658743/logs/undercloud/home/zuul/repo_setup.sh.txt.gz
* notes: yum update executes

USSURI REPO-SETUP
https://logserver.rdoproject.org/11/35111/15/check/periodic-tripleo-ci-centos-8-standalone-upgrade-ussuri/b658743/logs/undercloud/home/zuul/repo_setup_upgrade.sh.txt.gz
* notes: yum update DOES NOT EXECUTE
* notes: sudo systemctl restart openvswitch

yum update DOES NOT EXECUTE in n + 1 ( ussuri ) because:
https://opendev.org/openstack/tripleo-quickstart-extras/src/branch/master/playbooks/multinode-standalone-upgrade.yml#L18-L24
    - name: Standalone upgrade generate new dlrn repo-setup script.
      include_role:
        name: repo-setup
        tasks_from: create-repo-script
      vars:
        repo_setup_script: repo_setup_upgrade.sh
        repo_setup_run_update: false

Adding patch to force the upgrade of openvswitch and restart and print version

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart/+/806955
Committed: https://opendev.org/openstack/tripleo-quickstart/commit/11429159cb9277631d3d5141ad2a1421b31ea7a3
Submitter: "Zuul (22348)"
Branch: master

commit 11429159cb9277631d3d5141ad2a1421b31ea7a3
Author: Ronelle Landy <email address hidden>
Date: Wed Sep 1 12:33:53 2021 -0400

    Install and Restart openvswitch for upgrade to ussuri

    For ussuri only...
    * print openvswitch version
    * upgrade openvswitch from 2.12 -> 2.13-3
    * restart openvswitch
    * print openvswitch version

    Change-Id: Ia4b0f82f3b8fa9c63964610d4969a613d4c3b40f
    Related-Bug: #1940844

Revision history for this message
wes hayutin (weshayutin) wrote :

Job Project Branch Pipeline Change Duration Start time Result
tripleo-ci-centos-8-standalone-upgrade-ussuri openstack/puppet-tripleo stable/ussuri check 805528,1 2 hrs 24 mins 59 secs 2021-09-09 18:06:45 SUCCESS
tripleo-ci-centos-8-standalone-upgrade-ussuri openstack/python-tripleoclient stable/ussuri check 807251,1 2 hrs 51 mins 20 secs 2021-09-09 17:58:07 SUCCESS
tripleo-ci-centos-8-standalone-upgrade-ussuri openstack/os-net-config stable/ussuri check 807649,2 2 hrs 38 mins 48 secs 2021-09-09 17:54:23 SUCCESS

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/805980

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/ussuri)

Change abandoned by "Marios Andreou <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/805896
Reason: abandoning per http://lists.openstack.org/pipermail/openstack-discuss/2022-April/028026.html - so we can move EOL https://review.opendev.org/c/openstack/releases/+/834049

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/victoria)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/805895

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/wallaby)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/805894

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.