centos8 standalone-upgrade-ussuri fails tempest ping router IP

Bug #1895822 reported by Marios Andreou
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Alex Schultz

Bug Description

At [1] the tripleo-ci-centos-8-standalone-upgrade-ussuri job is failing tempest after seemingly successful deployment [2] and upgrade [3]. The top level error trace looks like:

        2020-09-15 15:02:07.057448 | primary | TASK [os_tempest : Ping router ip address] *************************************
        2020-09-15 15:02:07.057502 | primary | Tuesday 15 September 2020 15:02:07 +0000 (0:00:00.065) 1:10:33.351 *****
        2020-09-15 15:02:10.745010 | primary | FAILED - RETRYING: Ping router ip address (5 retries left).
        2020-09-15 15:02:24.365896 | primary | FAILED - RETRYING: Ping router ip address (4 retries left).
        2020-09-15 15:02:38.005903 | primary | FAILED - RETRYING: Ping router ip address (3 retries left).
        2020-09-15 15:02:51.638488 | primary | FAILED - RETRYING: Ping router ip address (2 retries left).
        2020-09-15 15:03:05.266932 | primary | FAILED - RETRYING: Ping router ip address (1 retries left).
        2020-09-15 15:03:18.902046 | primary | fatal: [undercloud]: FAILED! => {
        2020-09-15 15:03:18.902122 | primary | "attempts": 5,

Poking a little further I see a few errors related to network and mysql however I am not sure which is the original/root cause.

At [4] neutron and [5] keystone containers log we see the following many times:

        2020-09-11 07:41:16.044 139 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))")

At [6] ovn_controller.log there is the following repeated many times:

        2020-09-15T14:07:34.872418652+00:00 stderr F 2020-09-15T14:07:34Z|00014|main|INFO|OVNSB commit failed, force recompute next time.

At [7] container-puppet-mysql we have the following repeated many times:

        2020-09-15T13:05:17.914381835+00:00 stderr F <13>Sep 15 13:05:17 puppet-user: Init failed, could not perform requested operations

At [8] pacemaker log we have the following a few times:

        Sep 15 14:12:32 standalone.localdomain pacemaker-controld [325957] (services_os_action_execute) warning: Cannot execute '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or directory (2)
        Sep 15 14:12:32 standalone.localdomain pacemaker-controld [325957] (lrmd_api_get_metadata_params) error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers

[1] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/job-output.txt
[2] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/home/zuul/standalone_deploy.log
[3] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/home/zuul/standalone_upgrade.log
[4] https://e453f1d8808c5b6bd184-223d8b88d73ea59070ac36b627fdc3bc.ssl.cf2.rackcdn.com/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ad8724d/logs/undercloud/var/log/containers/neutron/server.log
[5] https://e453f1d8808c5b6bd184-223d8b88d73ea59070ac36b627fdc3bc.ssl.cf2.rackcdn.com/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ad8724d/logs/undercloud/var/log/containers/keystone/keystone.log
[6] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/var/log/containers/stdouts/ovn_controller.log
[7] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/var/log/containers/stdouts/container-puppet-mysql.log
[8] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/var/log/pacemaker/pacemaker.log

tags: added: promotion-blocker
Revision history for this message
Marios Andreou (marios-b) wrote :

spent some more time digging at logs. I am not clear yet if this is an issue with HA/mysql or if it is an issue with ovs/ovn networking. I am leaning towards networking at the moment.

I'll reach out to network and pidone squads to check here - adding pointers to some error messages in the logs I came across just now:

        * https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/var/log/containers/mysql/mysqld.log

                * 2020-09-15 14:32:08 0 [Note] InnoDB: Starting shutdown...
                * 2020-09-15 14:32:09 0 [Note] /usr/libexec/mysqld: Shutdown complete
                * 2020-09-15 14:32:30 0 [Note] WSREP: Found saved state: cebd6089-f754-11ea-ac23-9b5df17a204a:8702, safe_to_bootstrap: 1
                * 2020-09-15 14:32:30 0 [Note] /usr/libexec/mysqld: ready for connections.
        Version: '10.3.17-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server
                  2020-09-15 14:32:31 0 [Note] InnoDB: Buffer pool(s) load completed at 200915 14:32:31
                * 2020-09-15 14:38:29 259 [Warning] Aborted connection 259 to db: 'nova_api' user: 'nova_api' host: '192.168.24.1' (Got an error reading communication packets)

        * https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/var/log/containers/openvswitch/ovn-controller.log

                * 2020-09-15T14:41:08.575Z|00051|lflow|WARN|Dropped 19 log messages in last 1622 seconds (most recently, 1607 seconds ago) due to excessive rate
                * 2020-09-15T14:50:41.820Z|00004|fatal_signal(ovn_pinctrl0)|WARN|terminating with signal 15 (Terminated)

        * https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/var/log/containers/openvswitch/ovsdb-server-sb.log
                * 2020-09-15T13:16:17.971Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.12.0
                * 2020-09-15T13:16:19.947Z|00005|reconnect|WARN|unix#6: connection dropped (Connection reset by peer)
                * 2020-09-15T14:12:32.939Z|00005|reconnect|WARN|unix#0: connection dropped (Broken pipe)
                * 2020-09-15T14:59:56.235Z|00002|daemon_unix(monitor)|INFO|pid 152 died, exit status 0, exiting

        * https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/var/log/containers/stdouts/ovn-dbs-bundle.log

                * 2020-09-15T13:16:18.332804574+00:00 stderr F (operation_finished) notice: ovndb_servers_start_0:48:stderr [ ovn-nbctl: transaction error: {"details":"insert operation not allowed when database server is in read only mode","error":"not allowed"} ]

Revision history for this message
Martin Mágr (mmagr) wrote :
Download full text (16.2 KiB)

I'm facing the same issue with OVS based networking:

Deploy fail:
020-09-16 10:52:58,277 p=83091 u=mistral n=ansible | TASK [tripleo-keystone-resources : Check Keystone public endpoint status] ******
2020-09-16 10:52:58,277 p=83091 u=mistral n=ansible | Wednesday 16 September 2020 10:52:58 -0400 (0:00:07.246) 0:28:33.359 ***
2020-09-16 10:52:58,899 p=83091 u=mistral n=ansible | ok: [undercloud] => (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:52:59,255 p=83091 u=mistral n=ansible | ok: [undercloud] => (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:52:59,611 p=83091 u=mistral n=ansible | failed: [undercloud] (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:53:00,016 p=83091 u=mistral n=ansible | ok: [undercloud] => (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:53:00,271 p=83091 u=mistral n=ansible | ok: [undercloud] => (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:53:00,677 p=83091 u=mistral n=ansible | ok: [undercloud] => (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:53:00,981 p=83091 u=mistral n=ansible | ok: [undercloud] => (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:53:01,338 p=83091 u=mistral n=ansible | ok: [undercloud] => (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:53:01,695 p=83091 u=mistral n=ansible | ok: [undercloud] => (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:53:02,049 p=83091 u=mistral n=ansible | ok: [undercloud] => (item=None) => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:53:02,050 p=83091 u=mistral n=ansible | fatal: [undercloud]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2020-09-16 10:53:02,055 p=83091 u=mistral n=ansible | NO MORE HOSTS LEFT *************************************************************

/var/log/containers/httpd/keystone/keystone_wsgi_error.log:
[Tue Sep 15 23:52:38.988613 2020] [wsgi:error] [pid 1293] [remote 172.17.1.149:47570] File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool...

Revision history for this message
Marios Andreou (marios-b) wrote :

@martin what in particular made you suspect you have the same issue I can't tell from the logs in comment #2

Revision history for this message
Marios Andreou (marios-b) wrote :
Download full text (6.1 KiB)

did some more digging - leaning more towards this being a neutron issue but still not closer to understanding why.

I used https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_62c/750456/3/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/62c4529/logs/ as an example and trying to line up the timings. After the upgrade starts I see a number of errors in the openvswitch/ovn logs in particular "OVNSB commit failed, force recompute next time" followed by "WARN|tcp:192.168.24.1:6642: connection dropped (Broken pipe)":

"UPGRADE start/end": 20:33:00.699450 -> 21:33:44

        * https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_62c/750456/3/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/62c4529/job-output.txt
        * 2020-09-17 20:33:00.699450 | primary | TASK [standalone-upgrade : Upgrade the standalone] *****************************
        2020-09-17 20:33:00.699475 | primary | Thursday 17 September 2020 20:33:00 +0000 (0:00:02.279) 0:07:32.348 ****
        2020-09-17 21:33:44.045462 | primary | changed: [undercloud]

"MYSQL SHUTDOWN/STARTUP/aborted connection"

        * https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_62c/750456/3/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/62c4529/logs/undercloud/var/log/containers/mysql/mysqld.log
        * 2020-09-17 20:40:13 0 [Note] /usr/libexec/mysqld: Shutdown complete
        * 2020-09-17 20:45:28 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
        * 2020-09-17 21:02:12 0 [Note] /usr/libexec/mysqld: Shutdown complete
        * 2020-09-17 21:02:32 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
        * 2020-09-17 21:02:33 0 [Note] /usr/libexec/mysqld: ready for connections.
        * 2020-09-17 21:09:50 299 [Warning] Aborted connection 299 to db: 'glance' user: 'glance' host: '192.168.24.1' (Got an error reading communication packets)

"OVS/OVN logs:"

        * https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_62c/750456/3/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/62c4529/logs/undercloud/var/log/containers/openvswitch/ovn-controller.log

          * 2020-09-17T20:40:10.632Z|00087|reconnect|INFO|tcp:192.168.24.1:6642: connection closed by peer
            2020-09-17T20:40:11.633Z|00088|reconnect|INFO|tcp:192.168.24.1:6642: connecting...
            2020-09-17T20:40:11.634Z|00089|reconnect|INFO|tcp:192.168.24.1:6642: connected
            2020-09-17T20:40:11.634Z|00090|main|INFO|OVNSB commit failed, force recompute next time.
            2020-09-17T20:40:11.653Z|00091|main|INFO|OVNSB IDL reconnected, force recompute.
            2020-09-17T20:41:11.444Z|00092|jsonrpc|WARN|tcp:192.168.24.1:6642: send error: Broken pipe
            2020-09-17T20:41:11.445Z|00093|main|INFO|OVNSB commit failed, force recompute next time.
            2020-09-17T20:41:11.445Z|00094|reconnect|WARN|tcp:192.168.24.1:6642: connection dropped (Broken pipe)

        * https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_62c/7504...

Read more...

Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (4.2 KiB)

Here is my analysis of https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/job-output.txt

Timeline:
1) Standalone upgrade starts at 2020-09-15 13:59:31 and completes successfully at 2020-09-15 15:00:42

Note that towards the end of the upgrade we can observe a number of scary messages such as:
2020-09-15 15:00:03.942 8 ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service status 2003, "Can't connect to MySQL server on '192.168.24.3'

The reason for these error messages is that one of the post-upgrade tasks in tht restarts the ovn-dbs-bundle (Ia7cf78e1f5e46235147bdf67c03b58d774244774) which brings down both VIPs (I expected only one VIPs to go down but apparently both 24.1 and 24.3 get restarted, I will investigate that separately. It does not seem too important just yet)

2) The ovn-dbs restart is in any case fully completed at 15:00:04:
Sep 15 15:00:00 standalone.localdomain pacemaker-execd [325954] (log_finished) info: finished - rsc:ovn-dbs-bundle-podman-0 action:start call_id:100 pid:586688 exit-code:0 exec-time:2309ms queue-time:0ms
Sep 15 15:00:04 standalone.localdomain pacemaker-controld [325957] (process_lrm_event) notice: Result of start operation for ip-192.168.24.1 on standalone: 0 (ok) | call=102 key=ip-192.168.24.1_start_0 confirmed=true cib-update=385
Sep 15 15:00:04 standalone.localdomain pacemaker-controld [325957] (process_lrm_event) notice: Result of start operation for ip-192.168.24.3 on standalone: 0 (ok) | call=103 key=ip-192.168.24.3_start_0 confirmed=true cib-update=387

3) The router for the failing os_tempest ping gets successfully created at:
2020-09-15 15:02:01.467358 | primary | TASK [os_tempest : Create router] **********************************************
2020-09-15 15:02:01.467377 | primary | Tuesday 15 September 2020 15:02:01 +0000 (0:00:02.308) 1:10:27.761 *****
2020-09-15 15:02:04.475258 | primary | ok: [undercloud -> 127.0.0.2]
2020-09-15 15:02:04.504699 | primary |
2020-09-15 15:02:04.504764 | primary | TASK [os_tempest : Get router admin state and ip address] **********************
2020-09-15 15:02:04.504777 | primary | Tuesday 15 September 2020 15:02:04 +0000 (0:00:03.037) 1:10:30.799 *****
2020-09-15 15:02:04.557379 | primary | ok: [undercloud -> 127.0.0.2]

4) The ping itself fails at 15:02:07:
2020-09-15 15:02:07.057448 | primary | TASK [os_tempest : Ping router ip address] *************************************
2020-09-15 15:02:07.057502 | primary | Tuesday 15 September 2020 15:02:07 +0000 (0:00:00.065) 1:10:33.351 *****
2020-09-15 15:02:10.745010 | primary | FAILED - RETRYING: Ping router ip address (5 retries left).
2020-09-15 15:02:24.365896 | primary | FAILED - RETRYING: Ping router ip address (4 retries left).

After the failure during the log collection we do see in the ovn logs that we have a port corresponding to the ip tempest is pinging (192.168.24.122):
router 5e5e16a8-7c81-4aea-a56f-9edbb3343a34 (neutron-28fab0bc-bb0b-4e75-9204-cb19cc28246f) (aka router)
    port lrp-9b278e0...

Read more...

Changed in tripleo:
assignee: nobody → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
Martin Mágr (mmagr) wrote :

The reason why I think I had the same issuse is the fact that when you check the mysqld.log in the job [1] you can see that all cloud services have issues using DB. After upgrade on my env the cloud more or less worked, but apparently there was a network issue, which in my case made redeploy after FFU fail 100% time on the Keystone check step.

[1] https://e453f1d8808c5b6bd184-223d8b88d73ea59070ac36b627fdc3bc.ssl.cf2.rackcdn.com/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ad8724d/logs/undercloud/var/log/containers/mysql/mysqld.log

Revision history for this message
Jakub Libosvar (libosvar) wrote :

Sergii provided me an environment and that was extremely helpful, thanks for that. I looked there and I saw the external network subnet is the same as subnet used for the control plane. It means there was a route for the FIP to the br-ctplane instead of br-ex, br-ex was used in the bridge mappings for the given provider network.

When I changed the bridge mappings to use br-ctplane instead, the pings works. However this is not the solution because I think the control plane subnet should differ from the external public subnet. The network configuration should be changed for the external network so this can work properly (separate external and control plane networks).

Changed in tripleo:
assignee: Sergii Golovatiuk (sgolovatiuk) → Alex Schultz (alex-schultz)
Changed in tripleo:
assignee: Alex Schultz (alex-schultz) → nobody
Revision history for this message
Alex Schultz (alex-schultz) wrote :
Download full text (5.9 KiB)

This is a job config problem. The initial installation does not include --control-virtual-ip in it's execution. The normal standalone job has this defined, the upgrade job is missing this as part of the initial deployment. The upgrade execution includes --control-virtual-ip.

Normal standalone trupleo_deploy.sh:
https://0522a0f118ced5ed6a93-4c6fae75ca48c2a9b52e94b381f06ed2.ssl.cf2.rackcdn.com/753546/6/check/tripleo-ci-centos-8-standalone/eabe4ae/logs/undercloud/home/zuul/tripleo_deploy.sh

#!/bin/bash
# This file is managed by ansible
set -xeo pipefail

export DEPLOY_CONTROL_VIP=192.168.24.3
export DEPLOY_DEPLOYMENT_USER=zuul
export DEPLOY_LOCAL_IP=192.168.24.1/24
export DEPLOY_OUTPUT_DIR=/home/zuul
export DEPLOY_ROLES_FILE=/usr/share/openstack-tripleo-heat-templates/roles/Standalone.yaml
export DEPLOY_STACK=standalone
export DEPLOY_STANDALONE_ROLE=Standalone
export DEPLOY_TEMPLATES=/usr/share/openstack-tripleo-heat-templates
export DEPLOY_TIMEOUT_ARG=90
openstack tripleo deploy --templates $DEPLOY_TEMPLATES --standalone --yes --output-dir $DEPLOY_OUTPUT_DIR --stack $DEPLOY_STACK --standalone-role $DEPLOY_STANDALONE_ROLE --timeout $DEPLOY_TIMEOUT_ARG -e /usr/share/openstack-tripleo-heat-templates/environments/standalone/standalone-tripleo.yaml -e /home/zuul/containers-prepare-parameters.yaml -e /home/zuul/standalone_parameters.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml -r $DEPLOY_ROLES_FILE --deployment-user $DEPLOY_DEPLOYMENT_USER --local-ip $DEPLOY_LOCAL_IP --control-virtual-ip $DEPLOY_CONTROL_VIP >/home/zuul/standalone_deploy.log 2>&1

Upgrade tripleo_deploy.sh:
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/home/zuul/tripleo_deploy.sh

#!/bin/bash
# This file is managed by ansible
set -xeo pipefail

export DEPLOY_DEPLOYMENT_USER=zuul
export DEPLOY_LOCAL_IP=192.168.24.1/24
export DEPLOY_OUTPUT_DIR=/home/zuul
export DEPLOY_ROLES_FILE=/usr/share/openstack-tripleo-heat-templates/roles/Standalone.yaml
export DEPLOY_STACK=standalone
exp...

Read more...

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Likely caused by https://review.opendev.org/#/c/725782/ since the last successful run of this was on 8/18 when that patch was merged. The issue is likely that on train we're not pacemaker enabled and on ussuri we are. So the upgrade from non-pacemaker to pacemaker is failing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart (master)

Fix proposed to branch: master
Review: https://review.opendev.org/753817

Changed in tripleo:
assignee: nobody → Alex Schultz (alex-schultz)
status: Triaged → In Progress
Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

OK after more discussion just now on irc with @Sergii... so the fix from comment #11 makes the *deployment* have HA but then https://review.opendev.org/#/c/753817/ will add the docker-ha for the upgrade commands too.

so we need both. I added /#/c/753817/ on the test at https://review.opendev.org/739457 let's see if we get a green run

Revision history for this message
Marios Andreou (marios-b) wrote :

per comment #12, unfortunately test still fails @

        * https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_cbd/739457/30/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/cbdd981/job-output.txt
        * 2020-09-24 14:22:39.653261 | primary | FAILED - RETRYING: Ping router ip address (2 retries left).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-quickstart (master)

Reviewed: https://review.opendev.org/753817
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart/commit/?id=e8567839a7f144cb00be141a176f50cacb48877e
Submitter: Zuul
Branch: master

commit e8567839a7f144cb00be141a176f50cacb48877e
Author: Alex Schultz <email address hidden>
Date: Wed Sep 23 12:26:39 2020 -0600

    Enable HA always for upgrade job

    In Ussuri we enabled pacemaker by default so when we landed a change[0]
    in quickstart to handle this logic, it broke the upgrade job because
    the ussuri job uses train initially and gets a non-ha standalone which
    it tries to upgrade to HA. This results in an incorrect network config.
    Since really the exepectation is that we'd always be upgrade HA to HA,
    let's test that instead.

    [0] https://review.opendev.org/#/c/725782/
    Closes-Bug: #1895822

    Change-Id: I78c1a0cf68534e574b14ad505404139d93983324

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Rafael Folco (rafaelfolco) wrote :

Apparently the issue hasn't been fixed yet:

https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-standalone-upgrade-ussuri

https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_177/754366/1/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/1775d17/job-output.txt

2020-09-28 06:11:57.107266 | primary | TASK [os_tempest : Ping router ip address] *************************************
2020-09-28 06:11:57.107329 | primary | Monday 28 September 2020 06:11:57 +0000 (0:00:00.081) 1:08:12.313 ******
2020-09-28 06:12:00.758247 | primary | FAILED - RETRYING: Ping router ip address (5 retries left).
2020-09-28 06:12:14.388017 | primary | FAILED - RETRYING: Ping router ip address (4 retries left).
2020-09-28 06:12:28.019650 | primary | FAILED - RETRYING: Ping router ip address (3 retries left).
2020-09-28 06:12:41.588559 | primary | FAILED - RETRYING: Ping router ip address (2 retries left).
2020-09-28 06:12:55.219646 | primary | FAILED - RETRYING: Ping router ip address (1 retries left).
2020-09-28 06:13:08.863241 | primary | fatal: [undercloud]: FAILED! => {
2020-09-28 06:13:08.863322 | primary | "attempts": 5,
2020-09-28 06:13:08.863351 | primary | "changed": true,
2020-09-28 06:13:08.863375 | primary | "cmd": "set -e\nping -c2 \"192.168.24.129\"\n",
2020-09-28 06:13:08.863417 | primary | "delta": "0:00:03.101290",
2020-09-28 06:13:08.863443 | primary | "end": "2020-09-28 06:13:08.821880",
2020-09-28 06:13:08.863469 | primary | "rc": 1,
2020-09-28 06:13:08.863492 | primary | "start": "2020-09-28 06:13:05.720590"
2020-09-28 06:13:08.863516 | primary | }
2020-09-28 06:13:08.863540 | primary |
2020-09-28 06:13:08.863564 | primary | STDOUT:
2020-09-28 06:13:08.863605 | primary |
2020-09-28 06:13:08.863630 | primary | PING 192.168.24.129 (192.168.24.129) 56(84) bytes of data.
2020-09-28 06:13:08.863655 | primary | From 192.168.24.1 icmp_seq=1 Destination Host Unreachable
2020-09-28 06:13:08.863678 | primary | From 192.168.24.1 icmp_seq=2 Destination Host Unreachable

Changed in tripleo:
status: Fix Released → Triaged
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :
Download full text (8.6 KiB)

If we look at job closely it fails on https://opendev.org/openstack/openstack-ansible-os_tempest/src/branch/master/tasks/tempest_resources.yml#L283-L291

If we look at interfaces before upgrade they will be as

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:ef:f4:eb brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.23/24 brd 192.168.122.255 scope global dynamic noprefixroute ens3
       valid_lft 3147sec preferred_lft 3147sec
    inet6 fe80::5054:ff:feef:f4eb/64 scope link
       valid_lft forever preferred_lft forever
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:25:a0:f8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.229/24 brd 192.168.122.255 scope global dynamic noprefixroute ens4
       valid_lft 3271sec preferred_lft 3271sec
    inet6 fe80::5054:ff:fe25:a0f8/64 scope link
       valid_lft forever preferred_lft forever
4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether c6:c0:26:55:35:d3 brd ff:ff:ff:ff:ff:ff
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 02:85:da:68:e5:4e brd ff:ff:ff:ff:ff:ff
    inet 192.168.24.2/24 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::85:daff:fe68:e54e/64 scope link
       valid_lft forever preferred_lft forever
6: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 02:85:da:68:e5:4e brd ff:ff:ff:ff:ff:ff
    inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet 192.168.24.3/32 brd 192.168.24.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet6 fe80::85:daff:fe68:e54e/64 scope link
       valid_lft forever preferred_lft forever
7: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether fa:6f:46:31:8f:42 brd ff:ff:ff:ff:ff:ff

After upgrade run they will be as

: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:ef:f4:eb brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.23/24 brd 192.168.122.255 scope global dynamic noprefixroute ens3
       valid_lft 2369sec preferred_lft 2369sec
    inet6 fe80::5054:ff:feef:f4eb/64 scope link
       valid_lft forever preferred_lft forever
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 q...

Read more...

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

digging some more today - per comment https://bugs.launchpad.net/tripleo/+bug/1895822/comments/8 - there *is* a --control-virtual-ip being passed in both cases afaics

per https://bugs.launchpad.net/tripleo/+bug/1895822/comments/16 as just discussed with sergii on freenode #oooq - I think the env files are the same being passed in both cases?

I agree there is something off about the network config going train->ussuri. We are not getting this bug for ussuri-master jobs - example at [1] which has the 'Ping router ip address' task executed twice (after deploy & after upgrade) without issue.

We changed something train -> ussuri - it might still be related to the switch to 'ha by default' [2].

I've been sanity checking deploy vs upgrade network config but haven't spotted something yet eg [3] vs [4] main diff is

    {"network_config": [{"addresses": [{"ip_netmask": "{{ ctlplane_ip }}/24"}],

vs

    {"network_config": [{"addresses": [{"ip_netmask": "{{ ctlplane_ip }}/24"}, {"ip_netmask": "192.168.24.3/32"}, {"ip_netmask": "192.168.24.1/32"}],

[1] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e30/752419/12/check/tripleo-ci-centos-8-standalone-upgrade/e30bfb8/job-output.txt
[2] https://review.opendev.org/#/c/359060/
[3] https://ac0934616631b7150b73-eba45f55476b984e1cae6ece42d21924.ssl.cf2.rackcdn.com/739457/30/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3933f4c/logs/undercloud/home/zuul/standalone-ansible-i8z8idi8/Standalone/NetworkConfig
[4] https://ac0934616631b7150b73-eba45f55476b984e1cae6ece42d21924.ssl.cf2.rackcdn.com/739457/30/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3933f4c/logs/undercloud/home/zuul/standalone-ansible-v1_y03ip/Standalone/NetworkConfig

Revision history for this message
Marios Andreou (marios-b) wrote :

really not sure about this yet but ...

I started looking for things that are in Ussuri but no Train... I compared that

        * https://opendev.org/openstack/tripleo-heat-templates/commits/branch/stable/ussuri/environments/standalone/standalone-tripleo.yaml
        * https://opendev.org/openstack/tripleo-heat-templates/commits/branch/stable/train/environments/standalone/standalone-tripleo.yaml

in particular this commit seems interesting and missing in train: https://opendev.org/openstack/tripleo-heat-templates/commit/c712355e4bae4ef2fc1b83e5603c0364dbd50a78 * Deprecate Keepalived service https://review.opendev.org/#/c/657067/

I just cherry-picked it to train
https://review.opendev.org/755059 for testing/sanity but as i said... not sure that's it yet.

Revision history for this message
Marios Andreou (marios-b) wrote :

this bit in particular is what caught my eye for comment #19 https://review.opendev.org/#/c/657067/49/net-config-standalone.j2.yaml

Revision history for this message
Marios Andreou (marios-b) wrote :

After the upgrade, the resulting network configuration has no br-ex [1] only br-ctlplane. This may explain what jakub saw and commented at comment #7 above.
Not sure if this is *why* there is no br-ex but I noticed that the os-net-config data is different in deployment [2] vs upgrade [3]. In particular the upgrade os-net-config config.json has the extra control_virtual_ip ("192.168.24.3/32") and public_virtual_ip passed in ("192.168.24.1/32"). I think the patch I attempted to cherrypick to Train at [4] would add those but I don't know if that's the reason or if it's somehow related.

I also see these 'martian source' br-ex messages in the journal [5] which make sense if we are removing br-ex during the upgrade

        Sep 28 16:41:44 standalone.localdomain kernel: IPv4: martian source 192.168.24.119 from 192.168.24.119, on dev br-ex

Am hoping to point some of the DF folks at this message perhaps it will help get us closer to the issue - grateful for any thoughts here thank you.

[1] https://ac0934616631b7150b73-eba45f55476b984e1cae6ece42d21924.ssl.cf2.rackcdn.com/739457/30/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3933f4c/logs/undercloud/var/log/extra/network.txt
[2] https://ac0934616631b7150b73-eba45f55476b984e1cae6ece42d21924.ssl.cf2.rackcdn.com/739457/30/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3933f4c/logs/undercloud/etc/os-net-config/config.json.2020-09-28T16%3A05%3A24
[3] https://ac0934616631b7150b73-eba45f55476b984e1cae6ece42d21924.ssl.cf2.rackcdn.com/739457/30/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3933f4c/logs/undercloud/etc/os-net-config/config.json
[4] https://review.opendev.org/#/c/755059/1/net-config-standalone.j2.yaml
[5] https://ac0934616631b7150b73-eba45f55476b984e1cae6ece42d21924.ssl.cf2.rackcdn.com/739457/30/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3933f4c/logs/undercloud/var/log/extra/journal.txt

Revision history for this message
Alex Schultz (alex-schultz) wrote :

There is a br-ex in the network config.

https://799a0e5cbc4205600635-877a0f70cfb612990952d1399a198d7a.ssl.cf5.rackcdn.com/750595/1/check/tripleo-ci-centos-8-standalone-upgrade/2ecbaf6/logs/undercloud/etc/os-net-config/config.json

You'd want to check the NetworkConfig when comparing but br-ex is being configured in both.

For train we have:

https://799a0e5cbc4205600635-877a0f70cfb612990952d1399a198d7a.ssl.cf5.rackcdn.com/750595/1/check/tripleo-ci-centos-8-standalone-upgrade/2ecbaf6/logs/undercloud/home/zuul/standalone-ansible-s__he90q/Standalone/standalone/NetworkConfig

br-ex should get added via:
    sed -i "s/: \"bridge_name/: \"${bridge_name:-''}/g" /etc/os-net-config/config.json
    sed -i "s/interface_name/${interface_name:-''}/g" /etc/os-net-config/config.json

These values get invoked with br-ex from:

        tripleo_network_config_bridge_name: "{{ neutron_physical_bridge_name }}"
        tripleo_network_config_interface_name: "{{ neutron_public_interface_name }}"

This is defined in the core playbook. https://799a0e5cbc4205600635-877a0f70cfb612990952d1399a198d7a.ssl.cf5.rackcdn.com/750595/1/check/tripleo-ci-centos-8-standalone-upgrade/2ecbaf6/logs/undercloud/home/zuul/standalone-ansible-s__he90q/deploy_steps_playbook.yaml

In ussuri we don't have that stuff anymore because we backported the network config generation so it doesn't rely on that anymore.

https://799a0e5cbc4205600635-877a0f70cfb612990952d1399a198d7a.ssl.cf5.rackcdn.com/750595/1/check/tripleo-ci-centos-8-standalone-upgrade/2ecbaf6/logs/undercloud/home/zuul/standalone-ansible-oewpzx4m/Standalone/NetworkConfig

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/755607

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Alex Schultz (alex-schultz) wrote :

It's there in [1], just under ipv6 because there is no ipv4 address associated with it.

5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UNKNOWN qlen 1000

The network config just seems wrong. Per the tripleo-docs we always use br-ctlplane instead of br-ex but I'm trying to remember why that's configured that way. In upstream CI br-ex was always already configured via the undercloud-setup bits for multinode so I'm wondering if we're just running into poor CI setup vs there actually being a problem

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Ok so in Train, the os-net-config network config only includes the ctlplane/24 for the address. In ussuri this changed to also include the ctlplane and public vips as /32 because of the deprecation of keepalived.

Because ctlplane_ip is listed twice with a /24 and a /32 we get both. I'm trying to track down what changed (or if this has always been broken). It seems like we need to exclude a vip address if it's already the ctlplane_ip.

It looks like in train, we only added the ctlplane vip as a vip and not the public vip like we do in ussuri.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/756521

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/755607
Reason: https://review.opendev.org/#/c/756521 is likely the correct fix. we can keep using br-ex, we just need to configure it correctly

tags: added: train-backport-potential ussuri-backport-potential
Revision history for this message
Alex Schultz (alex-schultz) wrote :

shouldn't affect train, it was a change in ussuri to get rid of keepalived. Unless we backport that, this is not needed

tags: removed: train-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/756532

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-tripleoclient (master)

Fix proposed to branch: master
Review: https://review.opendev.org/756562

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/756521
Reason: https://review.opendev.org/#/c/756562/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/ussuri)

Change abandoned by Alex Schultz (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/756532

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-tripleoclient (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/756563

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/756577

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/756579

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/756706

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/756707

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/756715

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Ok so I think it's the bridgemap & network config. Doing the upgrade using the documentation with the following configuration works fine:

  NeutronPublicInterface: eth1
  NeutronBridgeMappings: "datacentre:br-ctlplane"
  NeutronPhysicalBridge: br-ctlplane

In CI, we only configure NeutronPublicInterface: br-ex. NuetronPhysicalBridge is br-ex by default. So it appears that the br-ctlplane interface that we use with br-ex is not being properly connected.

I think what happens is we have:

br-ex -> br-ctlplane (with ips)
 ^- neutron

So routing gets weird. I think we need

br-ex -> br-ctlplane (with ips)
            ^- neutron

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/757119

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/756577
Reason: we need this defined because the framework assumes there is always a control_virtual_ip

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/756706
Reason: we need this defined because the framework assumes there is always a control_virtual_ip

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/ussuri)

Change abandoned by Alex Schultz (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/756707

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Schultz (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/756579

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/757119
Reason: we don't need to change ips if we can change the br-ex allocation

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on python-tripleoclient (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/756562
Reason: This actually doesn't cause a problem (TM)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on python-tripleoclient (stable/ussuri)

Change abandoned by Alex Schultz (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/756563

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/757756

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/757900

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ansible (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/756715

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/757900
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=c86fa4fb4ee6f4bf86252ca3e1fde25fa55c3677
Submitter: Zuul
Branch: master

commit c86fa4fb4ee6f4bf86252ca3e1fde25fa55c3677
Author: Alex Schultz <email address hidden>
Date: Tue Oct 13 09:42:28 2020 -0600

    Don't manage bridge mappings in scenario file

    The bridge mappings should be managed in the standalone parameters. This
    bridge mapping prevents us from being able to change the datacentre
    mapping in CI.

    Change-Id: I6b5b9db75a11c2347720258a39b03aa28702dbf1
    Related-Bug: #1895822

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/757756
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=eaf976ae0e1795b47b372ed23125f35a86b518f0
Submitter: Zuul
Branch: master

commit eaf976ae0e1795b47b372ed23125f35a86b518f0
Author: yatinkarel <email address hidden>
Date: Tue Oct 13 13:55:13 2020 +0530

    Handle migration of br-ex network

    [1] changing br-ex network from 192.168.24 to 172.16.1
    for standalone jobs.

    In order to allow the migration need to adjust
    tempest configuration. This can be reverted once
    [1] and [2] lands.

    [1] https://review.opendev.org/#/c/757605
    [2] https://review.opendev.org/#/c/755607

    Related-Bug: #1895822
    Change-Id: I1865db911661092debb133fd2638c9a6a9bd2e47

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/755607
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=fa1bd4ad28d4e7aaec7c2cde29603118d840247a
Submitter: Zuul
Branch: master

commit fa1bd4ad28d4e7aaec7c2cde29603118d840247a
Author: Alex Schultz <email address hidden>
Date: Thu Oct 1 12:03:33 2020 -0600

    Standalone configure neutron bridge correctly

    This change updates the default bridge mapping from datacentre:br-ex to
    datacentre:br-ctlplane. We're doing this because in the standalone in
    CI, we configure a br-ex before running the standalone (via
    undercloud-setup) and want to attach our br-ctlplane to it. We then want
    to ensure that we use br-ctlplane for the neutron access to the external
    network to prevent weird routing issues when we have two bridges on the
    same subnet.

    Depends-On: https://review.opendev.org/#/c/757605/
    Change-Id: I0e5aa3f58746dc0b92bd35ade7792f323b5647f7
    Related-Bug: #1895822

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Hello Alex,

I know you are already working on sc12 along with this bug, but just in case:-

tripleo-ci-centos-8-scenario012-standalone is now also failing at tempest for ussuri/train for both check and periodic jobs with same error:-

~~~
2020-10-22 07:40:54.215940 | primary | TASK [os_tempest : Ping router ip address] *************************************
2020-10-22 07:40:54.215980 | primary | Thursday 22 October 2020 07:40:54 +0000 (0:00:00.111) 0:46:43.145 ******
2020-10-22 07:40:58.032949 | primary | FAILED - RETRYING: Ping router ip address (5 retries left).
2020-10-22 07:41:11.729033 | primary | FAILED - RETRYING: Ping router ip address (4 retries left).
2020-10-22 07:41:25.488429 | primary | FAILED - RETRYING: Ping router ip address (3 retries left).
2020-10-22 07:41:39.185617 | primary | FAILED - RETRYING: Ping router ip address (2 retries left).
2020-10-22 07:41:52.881059 | primary | FAILED - RETRYING: Ping router ip address (1 retries left).
2020-10-22 07:42:06.576933 | primary | fatal: [undercloud]: FAILED! => {
2020-10-22 07:42:06.577289 | primary | "attempts": 5,
2020-10-22 07:42:06.577361 | primary | "changed": true,
2020-10-22 07:42:06.577407 | primary | "cmd": "set -e\nping -c2 \"192.168.24.120\"\n",
2020-10-22 07:42:06.577449 | primary | "delta": "0:00:03.066736",
2020-10-22 07:42:06.577492 | primary | "end": "2020-10-22 07:42:06.538032",
2020-10-22 07:42:06.577534 | primary | "rc": 1,
2020-10-22 07:42:06.577573 | primary | "start": "2020-10-22 07:42:03.471296"
2020-10-22 07:42:06.577612 | primary | }
~~~

Logs:- https://zuul.openstack.org/build/83960e41edca4d64b9d43cf8f168a08a/log/job-output.txt

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/759295

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/759296

Revision history for this message
Alex Schultz (alex-schultz) wrote :

cherry picked the CI change back for scenario012 https://review.opendev.org/759295 https://review.opendev.org/759296

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/759296
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=f07ca38a82c659988d4630f9e5ce87da474a861d
Submitter: Zuul
Branch: stable/train

commit f07ca38a82c659988d4630f9e5ce87da474a861d
Author: Alex Schultz <email address hidden>
Date: Tue Oct 13 09:42:28 2020 -0600

    Don't manage bridge mappings in scenario file

    The bridge mappings should be managed in the standalone parameters. This
    bridge mapping prevents us from being able to change the datacentre
    mapping in CI.

    Change-Id: I6b5b9db75a11c2347720258a39b03aa28702dbf1
    Related-Bug: #1895822
    (cherry picked from commit c86fa4fb4ee6f4bf86252ca3e1fde25fa55c3677)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/759295
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=805fe6e4196306883a66c0d1e3adc8a43d9702d2
Submitter: Zuul
Branch: stable/ussuri

commit 805fe6e4196306883a66c0d1e3adc8a43d9702d2
Author: Alex Schultz <email address hidden>
Date: Tue Oct 13 09:42:28 2020 -0600

    Don't manage bridge mappings in scenario file

    The bridge mappings should be managed in the standalone parameters. This
    bridge mapping prevents us from being able to change the datacentre
    mapping in CI.

    Change-Id: I6b5b9db75a11c2347720258a39b03aa28702dbf1
    Related-Bug: #1895822
    (cherry picked from commit c86fa4fb4ee6f4bf86252ca3e1fde25fa55c3677)

tags: added: in-stable-ussuri
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Changed in tripleo:
milestone: wallaby-3 → wallaby-rc1
Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Changed in tripleo:
milestone: xena-1 → xena-2
Changed in tripleo:
milestone: xena-2 → xena-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ci (master)

Change abandoned by "Alex Schultz <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/757895

Revision history for this message
Rabi Mishra (rabi) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :
Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/878747
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/6887104038ba5d0152c77abe4a38e98d72693b89
Submitter: "Zuul (22348)"
Branch: master

commit 6887104038ba5d0152c77abe4a38e98d72693b89
Author: yatinkarel <email address hidden>
Date: Tue Mar 28 14:03:21 2023 +0530

    Fix neutron_bridge_mappings default for standalone

    br-tenant is not created as part of standalone
    deployments but bridge_mapping was refererring
    it. This results into unnecessary Warning[1] in
    ovn-controller logs, this patch fixes it.

    [1] Bridge 'br-tenant' not found for network 'tenant'

    Related-Bug: #1895822
    Change-Id: I9b23d6842cd518971b325ffd29b51d171c353b4f

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.