ping br-ctlplane is failing too often, "Trying to ping default gateway"

Bug #1878101 reported by wes hayutin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Harald Jensås

Bug Description

TASK [AllNodesValidationConfig] ************************************************
Monday 11 May 2020 15:01:33 +0000 (0:00:01.766) 0:00:55.370 ************
fatal: [undercloud]: FAILED! => changed=true
  msg: non-zero return code
  rc: 1
  stderr: ''
  stderr_lines: <omitted>
  stdout: |-
    Trying to ping default gateway 198.72.124.1...Ping to 198.72.124.1 failed. Retrying...
    Ping to 198.72.124.1 failed. Retrying...
    Ping to 198.72.124.1 failed. Retrying...
    Ping to 198.72.124.1 failed. Retrying...
    Ping to 198.72.124.1 failed. Retrying...
    Ping to 198.72.124.1 failed. Retrying...
    Ping to 198.72.124.1 failed. Retrying...
    Ping to 198.72.124.1 failed. Retrying...
    Ping to 198.72.124.1 failed. Retrying...
    Ping to 198.72.124.1 failed. Retrying...
    FAILURE
    198.72.124.1 is not pingable.
  stdout_lines: <omitted>

https://e453f1d8808c5b6bd184-223d8b88d73ea59070ac36b627fdc3bc.ssl.cf2.rackcdn.com/722662/1/gate/tripleo-ci-centos-8-undercloud-containers/ad8d1e6/logs/undercloud/home/zuul/undercloud_install.log

another example:
https://6d806783ed4dfdd971c5-158a33a5449e12f6f494625dd8517fb1.ssl.cf5.rackcdn.com/726374/1/gate/tripleo-ci-centos-8-undercloud-containers/2e3a947/logs/undercloud/home/zuul/undercloud_install.log

https://e453f1d8808c5b6bd184-223d8b88d73ea59070ac36b627fdc3bc.ssl.cf2.rackcdn.com/722662/1/gate/tripleo-ci-centos-8-undercloud-containers/ad8d1e6/logs/undercloud/var/log/extra/network.txt

5: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
br-ctlplane 1500 0 0 0 0 67 0 0 0 BMRU

5: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UNKNOWN qlen 1000
    inet6 fe80::18f8:89ff:fec2:e44e/64 scope link

The mtu is 1500, is that an issue?

Could be an infra issue...
https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-undercloud-containers&pipeline=gate

http://dashboard-ci.tripleo.org/d/3sUHfh9Wz/jobs-exploration?orgId=1&var-influxdb_filter=job_name%7C%3D%7Ctripleo-ci-centos-8-undercloud-containers&var-influxdb_filter=pipeline%7C%3D%7Cgate&var-influxdb_filter=passed%7C%3D%7CFalse

Source:
https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/validation-scripts/all-nodes.sh#L61-L74

Not sure how much we can do here..

Revision history for this message
wes hayutin (weshayutin) wrote :

since we're resetting the gate multiple times on this issue... getting additional eyes may help.

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

not 100% related, but shouldn't this terrible bash script be converted to some ansible and, well, proper validation role using the validation framework and some in-flight validation process?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/727207

Changed in tripleo:
assignee: nobody → wes hayutin (weshayutin)
status: Triaged → In Progress
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :

OK...

Looking at the following for example...
https://1ee15b38e5321bd48ce9-8aefe49572c9ad91eb217e6fb5236bb6.ssl.cf2.rackcdn.com/712464/20/gate/tripleo-ci-centos-8-undercloud-containers/0639b3e/logs/undercloud/var/log/extra/network.txt

Jobs that fail end up w/ an ip conflict between a nic device and a bridge device, for example

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    inet 10.4.70.16/24 brd 10.4.70.255 scope global dynamic noprefixroute ens3
       valid_lft 84143sec preferred_lft 84143sec

5: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 10.4.70.16/24 brd 10.4.70.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever

This is probably causing things to fail SOME of the time..
In this case it passes..
https://746b9b3552a84dbbbf46-a133153af68a0459f51ed4b0ea995900.ssl.cf5.rackcdn.com/726374/1/gate/tripleo-ci-centos-8-undercloud-containers/6125543/logs/undercloud/home/zuul/undercloud_install.log

In this case it fails..
https://1ee15b38e5321bd48ce9-8aefe49572c9ad91eb217e6fb5236bb6.ssl.cf2.rackcdn.com/712464/20/gate/tripleo-ci-centos-8-undercloud-containers/0639b3e/logs/undercloud/var/log/extra/network.txt

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    inet 10.4.70.16/24 brd 10.4.70.255 scope global dynamic noprefixroute ens3
       valid_lft 84143sec preferred_lft 84143sec

5: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 10.4.70.16/24 brd 10.4.70.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever

https://1ee15b38e5321bd48ce9-8aefe49572c9ad91eb217e6fb5236bb6.ssl.cf2.rackcdn.com/712464/20/gate/tripleo-ci-centos-8-undercloud-containers/0639b3e/logs/undercloud/home/zuul/undercloud_install.log

TASK [AllNodesValidationConfig] ************************************************
Tuesday 12 May 2020 00:18:52 +0000 (0:00:02.022) 0:00:53.280 ***********
fatal: [undercloud]: FAILED! => changed=true
  msg: non-zero return code
  rc: 1
  stderr: ''
  stderr_lines: <omitted>
  stdout: |-
    Trying to ping default gateway 10.4.70.1...Ping to 10.4.70.1 failed. Retrying...
    Ping to 10.4.70.1 failed. Retrying...
    Ping to 10.4.70.1 failed. Retrying...
    Ping to 10.4.70.1 failed. Retrying...
    Ping to 10.4.70.1 failed. Retrying...

Some upstream nodes have two network devices vs 1 network device and we have not seen anything w/ two nics fail at this time.

for example:
https://746b9b3552a84dbbbf46-a133153af68a0459f51ed4b0ea995900.ssl.cf5.rackcdn.com/726374/1/gate/tripleo-ci-centos-8-undercloud-containers/6125543/logs/undercloud/home/zuul/undercloud_install.log

https://746b9b3552a84dbbbf46-a133153af68a0459f51ed4b0ea995900.ssl.cf5.rackcdn.com/726374/1/gate/tripleo-ci-centos-8-undercloud-containers/6125543/logs/undercloud/var/log/extra/network.txt

Revision history for this message
wes hayutin (weshayutin) wrote :

Kevin Carter put up the following patch WIP, to see if it will improve the pass rate.
https://review.opendev.org/727424

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Rafael Folco (rafaelfolco) wrote :

I noticed that br-ctlplane is misconfigured with ip from physical dev (ens3) instead of bridge's ip (br-ex). See 198.72.124.67 in network config below:

bad run:
https://ea4b2d4f8d5c187742d9-e77ce0ffc4ffea95c693e8c73a486dc2.ssl.cf2.rackcdn.com/725513/2/gate/tripleo-ci-centos-8-undercloud-containers/b556251/logs/undercloud/var/log/extra/network.txt
5: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane inet 198.72.124.67/24 brd 198.72.124.255 scope global br-ctlplane inet 192.168.24.3/24 brd 192.168.24.255 scope global secondary br-

good run:
https://80af50af6486b35b73a5-594c8dbcc2892c36fc68c95ad460eff0.ssl.cf1.rackcdn.com/710493/5/check/tripleo-ci-centos-8-undercloud-containers/54415ea/logs/undercloud/var/log/extra/network.txt
5: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane inet 192.168.24.3/24 brd 192.168.24.255 scope global secondary br-ctlplane inet 192.168.24.2/24 brd 192.168.24.255 scope global secondary br-ctlplane

Revision history for this message
Harald Jensås (harald-jensas) wrote :

https://ea4b2d4f8d5c187742d9-e77ce0ffc4ffea95c693e8c73a486dc2.ssl.cf2.rackcdn.com/725513/2/gate/tripleo-ci-centos-8-undercloud-containers/b556251/logs/undercloud/var/log/extra/network.txt

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    inet 198.72.124.67/24 brd 198.72.124.255 scope global noprefixroute ens3
       valid_lft forever preferred_lft forever
4: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1350 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 192.168.24.2/24 scope global br-ex
       valid_lft forever preferred_lft forever
5: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet 198.72.124.67/24 brd 198.72.124.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet 192.168.24.3/24 brd 192.168.24.255 scope global secondary br-ctlplane
       valid_lft forever preferred_lft forever

Why is there br-ex on the undercloud?

https://ea4b2d4f8d5c187742d9-e77ce0ffc4ffea95c693e8c73a486dc2.ssl.cf2.rackcdn.com/725513/2/gate/tripleo-ci-centos-8-undercloud-containers/b556251/logs/undercloud/etc/os-net-config/config.json

https://ea4b2d4f8d5c187742d9-e77ce0ffc4ffea95c693e8c73a486dc2.ssl.cf2.rackcdn.com/725513/2/gate/tripleo-ci-centos-8-undercloud-containers/b556251/logs/undercloud/home/zuul/undercloud.conf
local_interface = br-ex

It seems to me that local_interface should be ens3 in this case?

Revision history for this message
Harald Jensås (harald-jensas) wrote :

In #11 Rafael Folco (rafaelfolco) found that good vs bad runs end up with different IP's.

In undercloud.conf we set undercloud_public_host to a hostname.

undercloud_public_host = centos-8-inap-mtl01-0016518524-unique

This name seem to resolve different public_virtual_ip in a good vs bad job.

Good run:
https://80af50af6486b35b73a5-594c8dbcc2892c36fc68c95ad460eff0.ssl.cf1.rackcdn.com/710493/5/check/tripleo-ci-centos-8-undercloud-containers/54415ea/logs/undercloud/home/zuul/tripleo-heat-installer-templates/tripleoclient-hosts-portmaps.yaml

Bad run:
https://ea4b2d4f8d5c187742d9-e77ce0ffc4ffea95c693e8c73a486dc2.ssl.cf2.rackcdn.com/725513/2/gate/tripleo-ci-centos-8-undercloud-containers/b556251/logs/undercloud/home/zuul/tripleo-heat-installer-templates/tripleoclient-hosts-portmaps.yaml

Looking at the hosts file in the good vs bad run shows why:

Good run
https://80af50af6486b35b73a5-594c8dbcc2892c36fc68c95ad460eff0.ssl.cf1.rackcdn.com/710493/5/check/tripleo-ci-centos-8-undercloud-containers/54415ea/logs/undercloud/etc/hosts

192.168.24.2 centos-8-inap-mtl01-0016518524-unique <<- in /etc/hosts

Bad run:
https://ea4b2d4f8d5c187742d9-e77ce0ffc4ffea95c693e8c73a486dc2.ssl.cf2.rackcdn.com/725513/2/gate/tripleo-ci-centos-8-undercloud-containers/b556251/logs/undercloud/etc/hosts

198.72.124.67 centos-8-inap-mtl01-0016504709-unique <<- in /etc/hosts

Seems https://github.com/openstack/tripleo-quickstart-extras/blob/9cc7489cca85eee6ce9950c1ee1d5f01c8251efc/roles/undercloud-setup/tasks/hostname.yml#L19-L27 is what would set up that entry?

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Emilien Macchi (emilienm) wrote :

I wonder if it's related to https://review.opendev.org/#/c/657067/49/net-config-undercloud.j2.yaml
It merged 6 days ago, and this error in CI has been visible for the last 6 days...

Revision history for this message
Harald Jensås (harald-jensas) wrote :

What I don't get is why 192.168.24.2 is the IP of the centos-8-inap-mtl01-0016504709-unique entry for some jobs, while for other jobs it is 198.72.124.67 (i.e the actual address in the ansible var/fact used to set the entry in etc/hosts)

Can we simply flip tripleo_set_unique_hostname: true ?

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Hmm, the patch Emilien mention is interesting: https://review.opendev.org/#/c/657067/49/net-config-undercloud.j2.yaml

Sticking the address of 'undercloud_public_host' on br-ctlplane won't work if the network where the 'undercloud_public_host' IP reside is actually on a different interface.

For example:

eth0 = 172.20.0.20

In undercloud.conf I set
local_ip = 192.168.24.1/24
undercloud_admin_host = 192.168.24.3
undercloud_public_host = 172.20.0.21
local_interface = eth1

Now, br-ctlplane will get all three addresses:

5: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet 172.20.0.21/24 brd 198.72.124.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet 192.168.24.3/24 brd 192.168.24.255 scope global secondary br-ctlplane
       valid_lft forever preferred_lft forever

And the routing table end up as this:

default via 172.20.0.1 dev eth1 proto static metric 100
192.168.24.0/24 dev br-ex proto kernel scope link src 192.168.24.2
192.168.24.0/24 dev br-ctlplane proto kernel scope link src 192.168.24.1
172.20.0.0/24 dev br-ctlplane proto kernel scope link src 198.72.124.67
172.20.0.0/24 dev eth1 proto kernel scope link src 198.72.124.67 metric 100

 ^^ Note the first entry for network '172.20.0.0/24 is on 'dev br-ctlplane' i.e on eth1, while the 172.20.0.0/24 is actually only on eth0. Traffic for the gateway will be sent out the wrong device br-ctlplane instead of eth0.

I think deprecating keepalived actually introduced a regression here.
Using a IP (or a name resolving to an IP not on br-ctplane) was working prior to this change, I assume KeepaliveD would find an appropriate interface for the VIPs via some magic?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/727942

Changed in tripleo:
assignee: wes hayutin (weshayutin) → Harald Jensås (harald-jensas)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/727942
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1ebf115f8580f0cd2aceccda6615e396df113c9d
Submitter: Zuul
Branch: master

commit 1ebf115f8580f0cd2aceccda6615e396df113c9d
Author: Harald Jensås <email address hidden>
Date: Thu May 14 08:33:16 2020 +0200

    Use /32 netmask for VIPs

    Prior to commit c712355e4bae4ef2fc1b83e5603c0364dbd50a78
    KeepaliveD created the VIP addresses. KeepaliveD created
    the VIPs with /32 netmask, when moving the VIPs to the
    DeployedServerPortMap and adding them to the br-ctlplane
    interface the netmask of the ctlplane subnet was used
    (typically /24). The result is a routing table that
    potentially uses the incorrect device for traffic when
    the public VIP is not on in the ctlplane subnet.

    This change hard-codes the netmask for the VIP addresses
    to /32.

    blueprint replace-keepalived-undercloud
    Closes-Bug: #1878101
    Change-Id: I873e925d2250677f25b9ae51ed0b87bd1b8e6b32

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/729965

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/729965
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2f38880744e0281b0c00b41ab690cafd7b4eb7ed
Submitter: Zuul
Branch: master

commit 2f38880744e0281b0c00b41ab690cafd7b4eb7ed
Author: Harald Jensås <email address hidden>
Date: Thu May 21 15:42:54 2020 +0200

    Use /32 or /128 netmask for VIPs

    Commit 1ebf115f8580f0cd2aceccda6615e396df113c9d hard code
    the netmask for VIPs to /32. This will not work for IPv6.

    Add a conditional checking for ':' in the IP addresses for
    control_virtual_ip and public_virtual_ip and set netmask
    correctly based on IP version.

    Related-Bug: #1878101
    Change-Id: I00718cf436ba438ef19c1a42aa2d2004fe73dcd2

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/727207
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=9f176285c7d6fa7c8c214806fffc0a28c0112b96
Submitter: Zuul
Branch: master

commit 9f176285c7d6fa7c8c214806fffc0a28c0112b96
Author: Wes Hayutin <email address hidden>
Date: Tue May 12 08:31:17 2020 -0600

    move undercloud-containers to nv

    Partial-Bug: #1878101
    Change-Id: If1bd15276156c6386f93e3c0f2853d7043054264

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/731046

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/731046
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=4dba85d815b606d25ed31bd5a47ffc057a635e69
Submitter: Zuul
Branch: stable/ussuri

commit 4dba85d815b606d25ed31bd5a47ffc057a635e69
Author: Harald Jensås <email address hidden>
Date: Thu May 21 15:42:54 2020 +0200

    Use /32 or /128 netmask for VIPs

    Commit 1ebf115f8580f0cd2aceccda6615e396df113c9d hard code
    the netmask for VIPs to /32. This will not work for IPv6.

    Add a conditional checking for ':' in the IP addresses for
    control_virtual_ip and public_virtual_ip and set netmask
    correctly based on IP version.

    Related-Bug: #1878101
    Change-Id: I00718cf436ba438ef19c1a42aa2d2004fe73dcd2
    (cherry picked from commit 2f38880744e0281b0c00b41ab690cafd7b4eb7ed)

tags: added: in-stable-ussuri
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.