network controller validations are potentially racy

Bug #1913725 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Alex Schultz

Bug Description

So this happened to me while I had [A] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/772838 applied, but I don't think that should affect this issue. I saw this in a large control plane env (3xctrl,3xnet,3xmsg,3xdb,2xcmp).

TLDR: the deployment sometimes (rather not oftern) fails with one node being unable to ping all the controllers

Failure:
2021-01-29 09:29:56.132773 | 525400b9-fe2e-5597-f2b1-000000004bb5 | FATAL | Check Controllers availability | networker-1 | item=172.17.1.52 | error={"ansible_loop_var": "controller", "changed": false, "cmd": ["ping", "-w", "10", "-c", "1", "172.17.1.52"], "controller": "172.17.1.52", "delta": "0:00:03.071556", "end": "2021-01-29 04:29:56.276464", "msg": "non-zero return code", "rc": 1, "start": "2021-01-29 04:29:53.204908", "stderr": "", "stderr_lines": [], "stdout": "PING 172.17.1.52 (172.17.1.52) 56(84) bytes of data.\nFrom 172.17.1.48 icmp_seq=1 Destination Host Unreachable\nFrom 172.17.1.48 icmp_seq=2 Destination Host Unreachable\nFrom 172.17.1.48 icmp_seq=3 Destination Host Unreachable\n\n--- 172.17.1.52 ping statistics ---\n3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 45ms\npipe 3", "stdout_lines": ["PING 172.17.1.52 (172.17.1.52) 56(84) bytes of data.", "From 172.17.1.48 icmp_seq=1 Destination Host Unreachable", "From 172.17.1.48 icmp_seq=2 Destination Host Unreachable", "From 172.17.1.48 icmp_seq=3 Destination Host Unreachable", "", "--- 172.17.1.52 ping statistics ---", "3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 45ms", "pipe 3"]}

The reason for this is, I believe, is that with the tripleo_free strategy we have no guarantee that the controller ping verification happens on all nodes *after* the network is configured on all nodes.

In fact in today's failure I see the following:
2021-01-29 09:29:48.015479 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:48.073385 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | database-0 | result={
2021-01-29 09:29:48.074311 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | database-0 | 0:07:20.495896 | 0.06s
2021-01-29 09:29:49.321407 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:49.430435 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | database-2 | result={
2021-01-29 09:29:49.431575 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | database-2 | 0:07:21.853166 | 0.11s
2021-01-29 09:29:49.969115 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:50.029775 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | messaging-1 | result={
2021-01-29 09:29:50.031358 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | messaging-1 | 0:07:22.452918 | 0.06s
2021-01-29 09:29:50.239686 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:50.408072 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | networker-1 | result={
2021-01-29 09:29:50.409420 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | networker-1 | 0:07:22.830995 | 0.17s
2021-01-29 09:29:50.486275 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:50.615958 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | messaging-0 | result={
2021-01-29 09:29:50.617039 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | messaging-0 | 0:07:23.038615 | 0.13s
2021-01-29 09:29:52.631912 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:52.728719 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | database-1 | result={
2021-01-29 09:29:52.730765 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | database-1 | 0:07:25.152339 | 0.10s
2021-01-29 09:29:53.919091 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:54.015006 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | messaging-2 | result={
2021-01-29 09:29:54.016237 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | messaging-2 | 0:07:26.437809 | 0.10s
2021-01-29 09:29:56.132773 | 525400b9-fe2e-5597-f2b1-000000004bb5 | FATAL | Check Controllers availability | networker-1 | item=172.17.1.52 | error={"ansible_loop_var": "controller", "changed": false, "cmd": ["ping", "-w", "10", "-c", "1", "172.17.1.52"], "controller": "172.17.1.52", "delta": "0:00:03.071556", "end": "2021-01-29 04:29:56.276464", "msg": "non-zero return code", "rc": 1, "start": "2021-01-29 04:29:53.204908", "stderr": "", "stderr_lines": [], "stdout": "PING 172.17.1.52 (172.17.1.52) 56(84) bytes of data.\nFrom 172.17.1.48 icmp_seq=1 Destination Host Unreachable\nFrom 172.17.1.48 icmp_seq=2 Destination Host Unreachable\nFrom 172.17.1.48 icmp_seq=3 Destination Host Unreachable\n\n--- 172.17.1.52 ping statistics ---\n3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 45ms\npipe 3", "stdout_lines": ["PING 172.17.1.52 (172.17.1.52) 56(84) bytes of data.", "From 172.17.1.48 icmp_seq=1 Destination Host Unreachable", "From 172.17.1.48 icmp_seq=2 Destination Host Unreachable", "From 172.17.1.48 icmp_seq=3 Destination Host Unreachable", "", "--- 172.17.1.52 ping statistics ---", "3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 45ms", "pipe 3"]}
2021-01-29 09:29:59.147481 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:59.206947 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | compute-0 | result={
2021-01-29 09:29:59.208049 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | compute-0 | 0:07:31.629635 | 0.06s
2021-01-29 09:30:03.214422 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:03.274123 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | networker-0 | result={
2021-01-29 09:30:03.275102 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | networker-0 | 0:07:35.696692 | 0.06s
2021-01-29 09:30:03.389089 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:03.449827 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | networker-2 | result={
2021-01-29 09:30:03.451547 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | networker-2 | 0:07:35.873118 | 0.06s
2021-01-29 09:30:04.401723 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:04.460046 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | compute-1 | result={
2021-01-29 09:30:04.461073 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | compute-1 | 0:07:36.882660 | 0.06s
2021-01-29 09:30:08.736141 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:08.796386 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | controller-0 | result={
2021-01-29 09:30:08.797481 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | controller-0 | 0:07:41.219068 | 0.06s
2021-01-29 09:30:10.794523 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:10.853261 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | controller-1 | result={
2021-01-29 09:30:10.854398 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | controller-1 | 0:07:43.275984 | 0.06s
2021-01-29 09:30:15.231534 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:15.288337 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | controller-2 | result={
2021-01-29 09:30:15.289439 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | controller-2 | 0:07:47.711026 | 0.06s

I think we need to make sure validations happen *after* all hosts completed the network config, like in:
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/772888

My only doubt is why have I seen this only when having [A] applied, but that could be mere coincidence since we're not talking about that many runs?

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Changed in tripleo:
assignee: nobody → Alex Schultz (alex-schultz)
status: Triaged → In Progress
tags: added: train-backport-potential ussuri-backport-potential victoria-backport-potential
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Changed in tripleo:
milestone: wallaby-3 → wallaby-rc1
Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 13.2.0

This issue was fixed in the openstack/tripleo-heat-templates 13.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.5.0

This issue was fixed in the openstack/tripleo-heat-templates 11.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.4.3

This issue was fixed in the openstack/tripleo-heat-templates 12.4.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 14.1.0

This issue was fixed in the openstack/tripleo-heat-templates 14.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.