Comment 5 for bug 1895822

Revision history for this message
Michele Baldessari (michele) wrote :

Here is my analysis of https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/job-output.txt

Timeline:
1) Standalone upgrade starts at 2020-09-15 13:59:31 and completes successfully at 2020-09-15 15:00:42

Note that towards the end of the upgrade we can observe a number of scary messages such as:
2020-09-15 15:00:03.942 8 ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service status 2003, "Can't connect to MySQL server on '192.168.24.3'

The reason for these error messages is that one of the post-upgrade tasks in tht restarts the ovn-dbs-bundle (Ia7cf78e1f5e46235147bdf67c03b58d774244774) which brings down both VIPs (I expected only one VIPs to go down but apparently both 24.1 and 24.3 get restarted, I will investigate that separately. It does not seem too important just yet)

2) The ovn-dbs restart is in any case fully completed at 15:00:04:
Sep 15 15:00:00 standalone.localdomain pacemaker-execd [325954] (log_finished) info: finished - rsc:ovn-dbs-bundle-podman-0 action:start call_id:100 pid:586688 exit-code:0 exec-time:2309ms queue-time:0ms
Sep 15 15:00:04 standalone.localdomain pacemaker-controld [325957] (process_lrm_event) notice: Result of start operation for ip-192.168.24.1 on standalone: 0 (ok) | call=102 key=ip-192.168.24.1_start_0 confirmed=true cib-update=385
Sep 15 15:00:04 standalone.localdomain pacemaker-controld [325957] (process_lrm_event) notice: Result of start operation for ip-192.168.24.3 on standalone: 0 (ok) | call=103 key=ip-192.168.24.3_start_0 confirmed=true cib-update=387

3) The router for the failing os_tempest ping gets successfully created at:
2020-09-15 15:02:01.467358 | primary | TASK [os_tempest : Create router] **********************************************
2020-09-15 15:02:01.467377 | primary | Tuesday 15 September 2020 15:02:01 +0000 (0:00:02.308) 1:10:27.761 *****
2020-09-15 15:02:04.475258 | primary | ok: [undercloud -> 127.0.0.2]
2020-09-15 15:02:04.504699 | primary |
2020-09-15 15:02:04.504764 | primary | TASK [os_tempest : Get router admin state and ip address] **********************
2020-09-15 15:02:04.504777 | primary | Tuesday 15 September 2020 15:02:04 +0000 (0:00:03.037) 1:10:30.799 *****
2020-09-15 15:02:04.557379 | primary | ok: [undercloud -> 127.0.0.2]

4) The ping itself fails at 15:02:07:
2020-09-15 15:02:07.057448 | primary | TASK [os_tempest : Ping router ip address] *************************************
2020-09-15 15:02:07.057502 | primary | Tuesday 15 September 2020 15:02:07 +0000 (0:00:00.065) 1:10:33.351 *****
2020-09-15 15:02:10.745010 | primary | FAILED - RETRYING: Ping router ip address (5 retries left).
2020-09-15 15:02:24.365896 | primary | FAILED - RETRYING: Ping router ip address (4 retries left).

After the failure during the log collection we do see in the ovn logs that we have a port corresponding to the ip tempest is pinging (192.168.24.122):
router 5e5e16a8-7c81-4aea-a56f-9edbb3343a34 (neutron-28fab0bc-bb0b-4e75-9204-cb19cc28246f) (aka router)
    port lrp-9b278e05-ae3c-49a9-b9ac-770319bb366e
        mac: "fa:16:3e:fc:fb:b2"
        networks: ["192.168.74.1/28"]
    port lrp-adaa967e-02bf-4207-8cbb-18637f0d7cac
        mac: "fa:16:3e:f7:51:b2"
        networks: ["192.168.24.122/24"]
        gateway chassis: [b4f013e5-31a0-44de-a8e7-a511ac4dc739]
    nat 9770fed2-490f-40d0-9c61-e8a62c7704b1
        external ip: "192.168.24.122"
        logical ip: "192.168.74.0/28"
        type: "snat"

So I think there are two main possibilities here:
A) After Ia7cf78e1f5e46235147bdf67c03b58d774244774 or some unrelated change we need more time for things to settle (this feels a bit less likely)
B) Something else more low-level with the network is going on (I kinda expected to see 192.168.24.122 somewhere in https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee1/739457/27/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/ee18aa6/logs/undercloud/var/log/extra/network-netns but it might be that with OVN and openvswitch things are a bit different than what I expect them)