Node Validations break when default route is pushed via bgp

Bug #1904711 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Michele Baldessari

Bug Description

When deploying on a train-based predeployed server that has default routes injected via BGP and ECMP the deployment fails with:

TASK [AllNodesValidationConfig] ************************************************
fatal: [ctrl-1-0]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 99.99.1.1 closed.\r\n", "stderr_lines": ["Shared connection to 99.99.1.1 closed."], "stdout": "Trying to ping default gateway bgp...Pi
ng to bgp failed. Retrying...\r\nPing to bgp failed. Retrying...\r\nPing to bgp failed. Retrying...\r\nPing to bgp failed. Retrying...\r\nPing to bgp failed. Retrying...\r\nPing to bgp failed. Retrying...\r\nPing to bgp failed. Retrying...\r\nPing to bgp
 failed. Retrying...\r\nPing to bgp failed. Retrying...\r\nPing to bgp failed. Retrying...\r\nFAILURE\r\nbgp is not pingable.\r\n", "stdout_lines": ["Trying to ping default gateway bgp...Ping to bgp failed. Retrying...", "Ping to bgp failed. Retrying..."
, "Ping to bgp failed. Retrying...", "Ping to bgp failed. Retrying...", "Ping to bgp failed. Retrying...", "Ping to bgp failed. Retrying...", "Ping to bgp failed. Retrying...", "Ping to bgp failed. Retrying...", "Ping to bgp failed. Retrying...", "Ping t
o bgp failed. Retrying...", "FAILURE", "bgp is not pingable."]}

The reason is that the code at https://github.com/openstack/tripleo-heat-templates/blob/stable/train/validation-scripts/all-nodes.sh#L61 is not robust in this situation. The ip r output in this case is as follows:
[root@ctrl-1-0 ~]# ip r
default proto bgp src 99.99.1.1 metric 20
        nexthop via 100.65.1.1 dev eth0 weight 1
        nexthop via 100.64.0.1 dev eth1 weight 1
100.64.0.0/30 dev eth1 proto kernel scope link src 100.64.0.2
100.65.1.0/30 dev eth0 proto kernel scope link src 100.65.1.2
192.168.14.0/24 dev eth2 proto kernel scope link src 192.168.14.7

This is actually already fixed in victoria/master thanks to the ansible node_validation role move done there. This LP is to track the work there

Revision history for this message
Michele Baldessari (michele) wrote :

Tested on train on a BGP setup like the one above and it all worked correctly with the following patches applied:
Tripleo-ansible -> https://review.opendev.org/763053
THT -> https://review.opendev.org/763064

TASK [tripleo_nodes_validation : Check Default IPv4 Gateway availability] ******
Wednesday 18 November 2020 08:31:21 +0000 (0:00:00.765) 0:04:47.918 ****
ok: [ctrl-1-0] => {"changed": false, "cmd": ["ping", "-w", "10", "-c", "1", "100.64.0.1"], "delta": "0:00:00.006779", "end": "2020-11-18 08:3
1:21.924376", "rc": 0, "start": "2020-11-18 08:31:21.917597", "stderr": "", "stderr_lines": [], "stdout": "PING 100.64.0.1 (100.64.0.1) 56(84
) bytes of data.\n64 bytes from 100.64.0.1: icmp_seq=1 ttl=64 time=0.250 ms\n\n--- 100.64.0.1 ping statistics ---\n1 packets transmitted, 1 r
eceived, 0% packet loss, time 0ms\nrtt min/avg/max/mdev = 0.250/0.250/0.250/0.000 ms", "stdout_lines": ["PING 100.64.0.1 (100.64.0.1) 56(84)
bytes of data.", "64 bytes from 100.64.0.1: icmp_seq=1 ttl=64 time=0.250 ms", "", "--- 100.64.0.1 ping statistics ---", "1 packets transmitte
d, 1 received, 0% packet loss, time 0ms", "rtt min/avg/max/mdev = 0.250/0.250/0.250/0.000 ms"]}

Ussuri backports:
Tripleo-ansible -> https://review.opendev.org/763052
THT -> https://review.opendev.org/763058

Train backports:
Tripleo-ansible -> https://review.opendev.org/763053
THT -> https://review.opendev.org/763064

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/763052
Committed: https://opendev.org/openstack/tripleo-ansible/commit/277cfee98b5cfa2b89489fedb7ac49334855e2ab
Submitter: Zuul
Branch: stable/ussuri

commit 277cfee98b5cfa2b89489fedb7ac49334855e2ab
Author: Alex Schultz <email address hidden>
Date: Fri Aug 21 13:31:29 2020 -0600

    Add tripleo_nodes_validation role

    Convert the all-nodes-validation.sh to a native ansible role. This role
    pings the default gateway, checks that controllers are reachable and can
    validate that the hostname matches what is in /etc/hosts

    Partial-Bug: #1904711

    Change-Id: I5c1109780f007849c5306adf21fd54b0e9a31494
    (cherry picked from commit a19b9195fcae28f3790710645eee8a4dd531658d)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/763053
Committed: https://opendev.org/openstack/tripleo-ansible/commit/0cfeb91eb172814a37e7a5e6b048e0594a6bcd2b
Submitter: Zuul
Branch: stable/train

commit 0cfeb91eb172814a37e7a5e6b048e0594a6bcd2b
Author: Alex Schultz <email address hidden>
Date: Fri Aug 21 13:31:29 2020 -0600

    Add tripleo_nodes_validation role

    Convert the all-nodes-validation.sh to a native ansible role. This role
    pings the default gateway, checks that controllers are reachable and can
    validate that the hostname matches what is in /etc/hosts

    Note: Conflicts in zuul.d/molecule.yaml due to context diff
          Also renamed molecule/default/converge.yml to molecule/default/playbook.yml

    Partial-Bug: #1904711

    Change-Id: I5c1109780f007849c5306adf21fd54b0e9a31494
    (cherry picked from commit a19b9195fcae28f3790710645eee8a4dd531658d)
    (cherry picked from commit 51ad59a6779012cf68c8105b63a86cf26abf692a)

tags: added: in-stable-train
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.4.2

This issue was fixed in the openstack/tripleo-heat-templates 12.4.2 release.

Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

Revision history for this message
Marios Andreou (marios-b) wrote :

Bug status has been set to 'Fix-Released' based on the discussion and/or patches above. If you disagree please re-set 'Triaged' and reach out to us on freenode #tripleo thank you!

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.