So this happened to me while I had [A] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/772838 applied, but I don't think that should affect this issue. I saw this in a large control plane env (3xctrl,3xnet,3xmsg,3xdb,2xcmp).
TLDR: the deployment sometimes (rather not oftern) fails with one node being unable to ping all the controllers
Failure:
2021-01-29 09:29:56.132773 | 525400b9-fe2e-5597-f2b1-000000004bb5 | FATAL | Check Controllers availability | networker-1 | item=172.17.1.52 | error={"ansible_loop_var": "controller", "changed": false, "cmd": ["ping", "-w", "10", "-c", "1", "172.17.1.52"], "controller": "172.17.1.52", "delta": "0:00:03.071556", "end": "2021-01-29 04:29:56.276464", "msg": "non-zero return code", "rc": 1, "start": "2021-01-29 04:29:53.204908", "stderr": "", "stderr_lines": [], "stdout": "PING 172.17.1.52 (172.17.1.52) 56(84) bytes of data.\nFrom 172.17.1.48 icmp_seq=1 Destination Host Unreachable\nFrom 172.17.1.48 icmp_seq=2 Destination Host Unreachable\nFrom 172.17.1.48 icmp_seq=3 Destination Host Unreachable\n\n--- 172.17.1.52 ping statistics ---\n3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 45ms\npipe 3", "stdout_lines": ["PING 172.17.1.52 (172.17.1.52) 56(84) bytes of data.", "From 172.17.1.48 icmp_seq=1 Destination Host Unreachable", "From 172.17.1.48 icmp_seq=2 Destination Host Unreachable", "From 172.17.1.48 icmp_seq=3 Destination Host Unreachable", "", "--- 172.17.1.52 ping statistics ---", "3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 45ms", "pipe 3"]}
The reason for this is, I believe, is that with the tripleo_free strategy we have no guarantee that the controller ping verification happens on all nodes *after* the network is configured on all nodes.
In fact in today's failure I see the following:
2021-01-29 09:29:48.015479 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:48.073385 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | database-0 | result={
2021-01-29 09:29:48.074311 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | database-0 | 0:07:20.495896 | 0.06s
2021-01-29 09:29:49.321407 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:49.430435 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | database-2 | result={
2021-01-29 09:29:49.431575 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | database-2 | 0:07:21.853166 | 0.11s
2021-01-29 09:29:49.969115 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:50.029775 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | messaging-1 | result={
2021-01-29 09:29:50.031358 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | messaging-1 | 0:07:22.452918 | 0.06s
2021-01-29 09:29:50.239686 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:50.408072 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | networker-1 | result={
2021-01-29 09:29:50.409420 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | networker-1 | 0:07:22.830995 | 0.17s
2021-01-29 09:29:50.486275 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:50.615958 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | messaging-0 | result={
2021-01-29 09:29:50.617039 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | messaging-0 | 0:07:23.038615 | 0.13s
2021-01-29 09:29:52.631912 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:52.728719 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | database-1 | result={
2021-01-29 09:29:52.730765 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | database-1 | 0:07:25.152339 | 0.10s
2021-01-29 09:29:53.919091 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:54.015006 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | messaging-2 | result={
2021-01-29 09:29:54.016237 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | messaging-2 | 0:07:26.437809 | 0.10s
2021-01-29 09:29:56.132773 | 525400b9-fe2e-5597-f2b1-000000004bb5 | FATAL | Check Controllers availability | networker-1 | item=172.17.1.52 | error={"ansible_loop_var": "controller", "changed": false, "cmd": ["ping", "-w", "10", "-c", "1", "172.17.1.52"], "controller": "172.17.1.52", "delta": "0:00:03.071556", "end": "2021-01-29 04:29:56.276464", "msg": "non-zero return code", "rc": 1, "start": "2021-01-29 04:29:53.204908", "stderr": "", "stderr_lines": [], "stdout": "PING 172.17.1.52 (172.17.1.52) 56(84) bytes of data.\nFrom 172.17.1.48 icmp_seq=1 Destination Host Unreachable\nFrom 172.17.1.48 icmp_seq=2 Destination Host Unreachable\nFrom 172.17.1.48 icmp_seq=3 Destination Host Unreachable\n\n--- 172.17.1.52 ping statistics ---\n3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 45ms\npipe 3", "stdout_lines": ["PING 172.17.1.52 (172.17.1.52) 56(84) bytes of data.", "From 172.17.1.48 icmp_seq=1 Destination Host Unreachable", "From 172.17.1.48 icmp_seq=2 Destination Host Unreachable", "From 172.17.1.48 icmp_seq=3 Destination Host Unreachable", "", "--- 172.17.1.52 ping statistics ---", "3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 45ms", "pipe 3"]}
2021-01-29 09:29:59.147481 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:29:59.206947 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | compute-0 | result={
2021-01-29 09:29:59.208049 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | compute-0 | 0:07:31.629635 | 0.06s
2021-01-29 09:30:03.214422 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:03.274123 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | networker-0 | result={
2021-01-29 09:30:03.275102 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | networker-0 | 0:07:35.696692 | 0.06s
2021-01-29 09:30:03.389089 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:03.449827 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | networker-2 | result={
2021-01-29 09:30:03.451547 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | networker-2 | 0:07:35.873118 | 0.06s
2021-01-29 09:30:04.401723 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:04.460046 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | compute-1 | result={
2021-01-29 09:30:04.461073 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | compute-1 | 0:07:36.882660 | 0.06s
2021-01-29 09:30:08.736141 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:08.796386 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | controller-0 | result={
2021-01-29 09:30:08.797481 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | controller-0 | 0:07:41.219068 | 0.06s
2021-01-29 09:30:10.794523 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:10.853261 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | controller-1 | result={
2021-01-29 09:30:10.854398 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | controller-1 | 0:07:43.275984 | 0.06s
2021-01-29 09:30:15.231534 | 525400b9-fe2e-5597-f2b1-00000000005b | TASK | NetworkConfig stdout
2021-01-29 09:30:15.288337 | 525400b9-fe2e-5597-f2b1-00000000005b | OK | NetworkConfig stdout | controller-2 | result={
2021-01-29 09:30:15.289439 | 525400b9-fe2e-5597-f2b1-00000000005b | TIMING | tripleo-network-config : NetworkConfig stdout | controller-2 | 0:07:47.711026 | 0.06s
I think we need to make sure validations happen *after* all hosts completed the network config, like in:
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/772888
My only doubt is why have I seen this only when having [A] applied, but that could be mere coincidence since we're not talking about that many runs?
https:/ /review. opendev. org/c/openstack /tripleo- heat-templates/ +/773069