Timeline:
1) Standalone upgrade starts at 2020-09-15 13:59:31 and completes successfully at 2020-09-15 15:00:42
Note that towards the end of the upgrade we can observe a number of scary messages such as:
2020-09-15 15:00:03.942 8 ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service status 2003, "Can't connect to MySQL server on '192.168.24.3'
The reason for these error messages is that one of the post-upgrade tasks in tht restarts the ovn-dbs-bundle (Ia7cf78e1f5e46235147bdf67c03b58d774244774) which brings down both VIPs (I expected only one VIPs to go down but apparently both 24.1 and 24.3 get restarted, I will investigate that separately. It does not seem too important just yet)
2) The ovn-dbs restart is in any case fully completed at 15:00:04:
Sep 15 15:00:00 standalone.localdomain pacemaker-execd [325954] (log_finished) info: finished - rsc:ovn-dbs-bundle-podman-0 action:start call_id:100 pid:586688 exit-code:0 exec-time:2309ms queue-time:0ms
Sep 15 15:00:04 standalone.localdomain pacemaker-controld [325957] (process_lrm_event) notice: Result of start operation for ip-192.168.24.1 on standalone: 0 (ok) | call=102 key=ip-192.168.24.1_start_0 confirmed=true cib-update=385
Sep 15 15:00:04 standalone.localdomain pacemaker-controld [325957] (process_lrm_event) notice: Result of start operation for ip-192.168.24.3 on standalone: 0 (ok) | call=103 key=ip-192.168.24.3_start_0 confirmed=true cib-update=387
3) The router for the failing os_tempest ping gets successfully created at:
2020-09-15 15:02:01.467358 | primary | TASK [os_tempest : Create router] **********************************************
2020-09-15 15:02:01.467377 | primary | Tuesday 15 September 2020 15:02:01 +0000 (0:00:02.308) 1:10:27.761 *****
2020-09-15 15:02:04.475258 | primary | ok: [undercloud -> 127.0.0.2]
2020-09-15 15:02:04.504699 | primary |
2020-09-15 15:02:04.504764 | primary | TASK [os_tempest : Get router admin state and ip address] **********************
2020-09-15 15:02:04.504777 | primary | Tuesday 15 September 2020 15:02:04 +0000 (0:00:03.037) 1:10:30.799 *****
2020-09-15 15:02:04.557379 | primary | ok: [undercloud -> 127.0.0.2]
After the failure during the log collection we do see in the ovn logs that we have a port corresponding to the ip tempest is pinging (192.168.24.122):
router 5e5e16a8-7c81-4aea-a56f-9edbb3343a34 (neutron-28fab0bc-bb0b-4e75-9204-cb19cc28246f) (aka router)
port lrp-9b278e05-ae3c-49a9-b9ac-770319bb366e
mac: "fa:16:3e:fc:fb:b2"
networks: ["192.168.74.1/28"]
port lrp-adaa967e-02bf-4207-8cbb-18637f0d7cac
mac: "fa:16:3e:f7:51:b2"
networks: ["192.168.24.122/24"]
gateway chassis: [b4f013e5-31a0-44de-a8e7-a511ac4dc739]
nat 9770fed2-490f-40d0-9c61-e8a62c7704b1
external ip: "192.168.24.122"
logical ip: "192.168.74.0/28"
type: "snat"
Here is my analysis of https:/ /storage. bhs.cloud. ovh.net/ v1/AUTH_ dcaab5e32b234d5 6b626f72581e364 4c/zuul_ opendev_ logs_ee1/ 739457/ 27/check/ tripleo- ci-centos- 8-standalone- upgrade- ussuri/ ee18aa6/ job-output. txt
Timeline:
1) Standalone upgrade starts at 2020-09-15 13:59:31 and completes successfully at 2020-09-15 15:00:42
Note that towards the end of the upgrade we can observe a number of scary messages such as: up.drivers. db [-] Unexpected error while reporting service status 2003, "Can't connect to MySQL server on '192.168.24.3'
2020-09-15 15:00:03.942 8 ERROR nova.servicegro
The reason for these error messages is that one of the post-upgrade tasks in tht restarts the ovn-dbs-bundle (Ia7cf78e1f5e46 235147bdf67c03b 58d774244774) which brings down both VIPs (I expected only one VIPs to go down but apparently both 24.1 and 24.3 get restarted, I will investigate that separately. It does not seem too important just yet)
2) The ovn-dbs restart is in any case fully completed at 15:00:04: localdomain pacemaker-execd [325954] (log_finished) info: finished - rsc:ovn- dbs-bundle- podman- 0 action:start call_id:100 pid:586688 exit-code:0 exec-time:2309ms queue-time:0ms localdomain pacemaker-controld [325957] (process_lrm_event) notice: Result of start operation for ip-192.168.24.1 on standalone: 0 (ok) | call=102 key=ip- 192.168. 24.1_start_ 0 confirmed=true cib-update=385 localdomain pacemaker-controld [325957] (process_lrm_event) notice: Result of start operation for ip-192.168.24.3 on standalone: 0 (ok) | call=103 key=ip- 192.168. 24.3_start_ 0 confirmed=true cib-update=387
Sep 15 15:00:00 standalone.
Sep 15 15:00:04 standalone.
Sep 15 15:00:04 standalone.
3) The router for the failing os_tempest ping gets successfully created at: ******* ******* ******* ******* ******* **** ******* ******* *
2020-09-15 15:02:01.467358 | primary | TASK [os_tempest : Create router] *******
2020-09-15 15:02:01.467377 | primary | Tuesday 15 September 2020 15:02:01 +0000 (0:00:02.308) 1:10:27.761 *****
2020-09-15 15:02:04.475258 | primary | ok: [undercloud -> 127.0.0.2]
2020-09-15 15:02:04.504699 | primary |
2020-09-15 15:02:04.504764 | primary | TASK [os_tempest : Get router admin state and ip address] *******
2020-09-15 15:02:04.504777 | primary | Tuesday 15 September 2020 15:02:04 +0000 (0:00:03.037) 1:10:30.799 *****
2020-09-15 15:02:04.557379 | primary | ok: [undercloud -> 127.0.0.2]
4) The ping itself fails at 15:02:07: ******* ******* ******* ******* **
2020-09-15 15:02:07.057448 | primary | TASK [os_tempest : Ping router ip address] *******
2020-09-15 15:02:07.057502 | primary | Tuesday 15 September 2020 15:02:07 +0000 (0:00:00.065) 1:10:33.351 *****
2020-09-15 15:02:10.745010 | primary | FAILED - RETRYING: Ping router ip address (5 retries left).
2020-09-15 15:02:24.365896 | primary | FAILED - RETRYING: Ping router ip address (4 retries left).
After the failure during the log collection we do see in the ovn logs that we have a port corresponding to the ip tempest is pinging (192.168.24.122): 7c81-4aea- a56f-9edbb3343a 34 (neutron- 28fab0bc- bb0b-4e75- 9204-cb19cc2824 6f) (aka router) ae3c-49a9- b9ac-770319bb36 6e 02bf-4207- 8cbb-18637f0d7c ac 24.122/ 24"] 31a0-44de- a8e7-a511ac4dc7 39] 490f-40d0- 9c61-e8a62c7704 b1
router 5e5e16a8-
port lrp-9b278e05-
mac: "fa:16:3e:fc:fb:b2"
networks: ["192.168.74.1/28"]
port lrp-adaa967e-
mac: "fa:16:3e:f7:51:b2"
networks: ["192.168.
gateway chassis: [b4f013e5-
nat 9770fed2-
external ip: "192.168.24.122"
logical ip: "192.168.74.0/28"
type: "snat"
So I think there are two main possibilities here: 35147bdf67c03b5 8d774244774 or some unrelated change we need more time for things to settle (this feels a bit less likely) /storage. bhs.cloud. ovh.net/ v1/AUTH_ dcaab5e32b234d5 6b626f72581e364 4c/zuul_ opendev_ logs_ee1/ 739457/ 27/check/ tripleo- ci-centos- 8-standalone- upgrade- ussuri/ ee18aa6/ logs/undercloud /var/log/ extra/network- netns but it might be that with OVN and openvswitch things are a bit different than what I expect them)
A) After Ia7cf78e1f5e462
B) Something else more low-level with the network is going on (I kinda expected to see 192.168.24.122 somewhere in https:/