[queens promotion] fs035 is failing to deploy the overcloud - undercloud has no UDP FORWARD rules

Bug #1750880 reported by Ronelle Landy on 2018-02-21
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Michele Baldessari

Bug Description

In the latest Queens promotion run , the featureset035 test fails to deploy the overcloud at overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.1: - it appears Pacemaker fails to start:

            "Error: Systemd start for pcsd failed!",
            "journalctl log for pcsd:",
            "-- Logs begin at Wed 2018-02-21 16:13:21 UTC, end at Wed 2018-02-21 16:28:07 UTC. --",
            "Feb 21 16:26:14 overcloud-controller-0 systemd[1]: Starting PCS GUI and remote configuration interface...",
            "Feb 21 16:27:44 overcloud-controller-0 systemd[1]: pcsd.service start operation timed out. Terminating.",
            "Feb 21 16:28:07 overcloud-controller-0 systemd[1]: Failed to start PCS GUI and remote configuration interface.",
            "Feb 21 16:28:07 overcloud-controller-0 systemd[1]: Unit pcsd.service entered failed state.",
            "Feb 21 16:28:07 overcloud-controller-0 systemd[1]: pcsd.service failed.",
            "Error: /Stage[main]/Pacemaker::Service/Service[pcsd]/ensure: change from stopped to running failed: Systemd start for pcsd failed!",
            "Error: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Failed to call refresh: /sbin/pcs cluster auth overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 -u hacluster -p sVXxFE9BMTyX8KKb --force returned 1 instead of one of [0]",
            "Error: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: /sbin/pcs cluster auth overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 -u hacluster -p sVXxFE9BMTyX8KKb --force returned 1 instead of one of [0]",

See the failed deployment list linked below:

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-queens/5084413/undercloud/home/jenkins/failed_deployment_list.log.txt.gz

Full overcloud deploy log:

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-queens/5084413/undercloud/home/jenkins/overcloud_deploy.log.txt.gz

Ronelle Landy (rlandy) on 2018-02-21
tags: added: ci promotion-blocker
Changed in tripleo:
milestone: none → queens-rc1
importance: Undecided → Critical
status: New → Triaged
Michele Baldessari (michele) wrote :

So the failure is due to the pcsd service not starting because it times out:
Feb 21 16:26:14 overcloud-controller-0 systemd[1]: Starting PCS GUI and remote configuration interface...
Feb 21 16:27:44 overcloud-controller-0 systemd[1]: pcsd.service start operation timed out. Terminating.
Feb 21 16:28:07 overcloud-controller-0 systemd[1]: Failed to start PCS GUI and remote configuration interface.
Feb 21 16:28:07 overcloud-controller-0 systemd[1]: Unit pcsd.service entered failed state.
Feb 21 16:28:07 overcloud-controller-0 systemd[1]: pcsd.service failed.
Error: /Stage[main]/Pacemaker::Service/Service[pcsd]/ensure: change from stopped to running failed: Systemd start for pcsd failed!

The only time I have seen this happening is when pcsd could not resolve DNS properly, the queries it did would slow the startup down and eventually it timed out. Not sure this is the case here yet, though. We do not seem to collect /var/log/pcsd any longer?

Is this a reproduceable failure?

Michele Baldessari (michele) wrote :

Added an option to puppet-pacemaker so in the future we have the possibility to explicitly enable pcsd_debug: https://review.openstack.org/546760

Ronelle Landy (rlandy) on 2018-02-22
summary: - [queen promotion] fs035 is failing to deploy the overcloud - pacemaker
+ [queens promotion] fs035 is failing to deploy the overcloud - pacemaker
is not starting

So in the bast we have seen something related to this in https://bugzilla.redhat.com/show_bug.cgi?id=1438870 (Improvements on that tracked here https://bugzilla.redhat.com/show_bug.cgi?id=1452561)

My suspicions are in DNS queries timing out as pcsd does some, and maybe the /etc/hosts files are not the only thing that gets consulted.

Ideally we could start pcsd like this:
[root@rh73-node1:~]# cd /usr/lib/pcsd/
[root@rh73-node1:/usr/lib/pcsd]# export GEM_HOME=/usr/lib/pcsd/vendor/bundle/ruby
[root@rh73-node1:/usr/lib/pcsd]# export GEM_PATH="$GEM_HOME:"`gem env gempath`
[root@rh73-node1:/usr/lib/pcsd]# export RACK_ENV=production
[root@rh73-node1:/usr/lib/pcsd]# export PCSD_DEBUG=true
[root@rh73-node1:/usr/lib/pcsd]# ruby -I/usr/lib/pcsd/ /usr/lib/pcsd/ssl.rb

I'll poke this some more here, but it is a bit hard without more logs

summary: - [queens promotion] fs035 is failing to deploy the overcloud - pacemaker
- is not starting
+ [queens promotion] fs035 is failing to deploy the overcloud - pcsd times
+ out while starting
Michele Baldessari (michele) wrote :

Ok so maybe I am on to something here. On controller-0 we have:
nameserver 38.145.33.91
nameserver 38.145.32.66
nameserver 38.145.32.79

+ ip route
default via 192.168.24.1 dev eth0

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1350 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:f4:4b:6f brd ff:ff:ff:ff:ff:ff
    inet 192.168.24.15/24 brd 192.168.24.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fef4:4b6f/64 scope link
       valid_lft forever preferred_lft forever

So we are going towards the internet (aka our dns servers) via the undercloud.

On the undercloud we have the correct nat rules to break out:
*nat
:PREROUTING ACCEPT [3471:247475]
:INPUT ACCEPT [856:52186]
:OUTPUT ACCEPT [129545:7804244]
:POSTROUTING ACCEPT [129545:7804244]
:BOOTSTACK_MASQ - [0:0]
:DOCKER - [0:0]
-A PREROUTING -d 169.254.169.254/32 -i br-ctlplane -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 8775
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -j BOOTSTACK_MASQ
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A BOOTSTACK_MASQ -s 192.168.24.0/24 -d 192.168.24.0/24 -j RETURN
-A BOOTSTACK_MASQ -s 192.168.24.0/24 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN

*but* in the FORWARD chain we have this:
*filter
:INPUT ACCEPT [0:0]
:FORWARD DROP [2899:203965]
:OUTPUT ACCEPT [5947438:33329856685]
-A FORWARD -j neutron-filter-top
-A FORWARD -j neutron-openvswi-FORWARD
-A FORWARD -d 192.168.24.0/24 -p tcp -m state --state NEW -m comment --comment "140 ctlplane-subnet cidr nat ipv4" -j ACCEPT
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT

Is it me or we are missing the UDP forwarding rule?

Fix proposed to branch: master
Review: https://review.openstack.org/547066

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
status: Triaged → In Progress
summary: - [queens promotion] fs035 is failing to deploy the overcloud - pcsd times
- out while starting
+ [queens promotion] fs035 is failing to deploy the overcloud - undercloud
+ has no UDP FORWARD rules
Michele Baldessari (michele) wrote :

Ok I need to tweak the review to let some more traffic through, we're currently only letting -d <ctlplane> through which is clearly not enough

Michele Baldessari (michele) wrote :

Hohum last pass did not fix pcsd timeout. I still suspect some dns timeouts (maybe due to ipv6/ipv4). I tried with the following to get a local reproducer:
export WD=/var/tmp/fs001; ./quickstart.sh --clean --no-clone --release master-tripleo-ci --tags all --config $WD/config/general_config/featureset035.yml -w $WD
--teardown all haa-09

But that fails differently, so I'd appreciate any hints on how to reproduce that. If nothing comes up I'll try a series of patches to increase debug logging in pcsd

Michele Baldessari (michele) wrote :

Also this bug might be the same as https://bugs.launchpad.net/tripleo/+bug/1750874

Michele Baldessari (michele) wrote :

So I went ahead anstrace a pcsd process in an environment that is similar to fs035 (ipv6, netisolation, unreachable dns servers):
107999 12:41:27.305418 socket(AF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 7<socket:[483166]>
107999 12:41:27.305591 connect(7<socket:[483166]>, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.23.1")}, 16) = 0
107999 12:41:27.305649 poll([{fd=7<socket:[483166]>, events=POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}])
107999 12:41:27.305684 sendto(7<socket:[483166]>, "\345%\1\0\0\1\0\0\0\0\0\0\26overcloud-controller-0\vlocaldomain\vlocaldomain\0\0\1\0\1", 64, MSG_NOSIGNAL, NULL, 0) = 64
107999 12:41:27.305771 poll([{fd=7<socket:[483166]>, events=POLLIN}], 1, 5000) = 0 (Timeout)
107999 12:41:32.310922 socket(AF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 8<socket:[484381]>
107999 12:41:32.311021 connect(8<socket:[484381]>, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, 16) = 0
107999 12:41:32.311064 poll([{fd=8<socket:[484381]>, events=POLLOUT}], 1, 0) = 1 ([{fd=8, revents=POLLOUT}])
107999 12:41:32.311096 sendto(8<socket:[484381]>, "\345%\1\0\0\1\0\0\0\0\0\0\26overcloud-controller-0\vlocaldomain\vlocaldomain\0\0\1\0\1", 64, MSG_NOSIGNAL, NULL, 0) = 64
107999 12:41:32.311151 poll([{fd=8<socket:[484381]>, events=POLLIN}], 1, 5000) = 0 (Timeout)
107999 12:41:37.313031 poll([{fd=7<socket:[483166]>, events=POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}])
107999 12:41:37.313095 sendto(7<socket:[483166]>, "\345%\1\0\0\1\0\0\0\0\0\0\26overcloud-controller-0\vlocaldomain\vlocaldomain\0\0\1\0\1", 64, MSG_NOSIGNAL, NULL, 0) = 64
107999 12:41:37.313159 poll([{fd=7<socket:[483166]>, events=POLLIN}], 1, 5000) = 0 (Timeout)
107999 12:41:42.314021 poll([{fd=8<socket:[484381]>, events=POLLOUT}], 1, 0) = 1 ([{fd=8, revents=POLLOUT}])
107999 12:41:42.314169 sendto(8<socket:[484381]>, "\345%\1\0\0\1\0\0\0\0\0\0\26overcloud-controller-0\vlocaldomain\vlocaldomain\0\0\1\0\1", 64, MSG_NOSIGNAL, NULL, 0) = 64
107999 12:41:42.314420 poll([{fd=8<socket:[484381]>, events=POLLIN}], 1, 5000) = 0 (Timeout)
107999 12:41:47.318989 close(7<socket:[483166]>) = 0
107999 12:41:47.319058 close(8<socket:[484381]>) = 0

In my environment pcsd in the end managed to start before the systemd timeout (but it got very close to reaching it). So I suspect we're seeing the same in this CI environment.

Michele Baldessari (michele) wrote :

I am now moderately certain that his is a clone of https://bugs.launchpad.net/tripleo/+bug/1750874

wes hayutin (weshayutin) on 2018-02-28
tags: removed: alert ci promotion-blocker

Change abandoned by Michele Baldessari (<email address hidden>) on branch: master
Review: https://review.openstack.org/547066

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.