periodic master scen1 standalone fails/timeout 'manage firewall rules'

Bug #1983718 reported by Marios Andreou
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

At [1][2] the periodic-tripleo-ci-centos-9-scenario001-standalone-master is failing during standalone deployment. Though zuul status is Failed this is actually a timeout - from [3]

        * 2022-08-04 23:28:27.803309 | primary | fatal: [undercloud]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time - 5700s"}

Digging a bit, in both cases looks like the "Manage firewall rules" task is taking over 40 mins causing the timeout.

Fron [1]

        * 2022-08-04 13:37:31.522883 | fa163e5e-0e24-8c7a-9d13-000000000a6a | TASK | Manage firewall rules
2022-08-04 14:19:39.965488 | fa163e5e-0e24-8c7a-9d13-000000000a6a | CHANGED | Manage firewall rules | standalone
2022-08-04 14:19:39.966843 | fa163e5e-0e24-8c7a-9d13-000000000a6a | TIMING | tripleo_firewall : Manage firewall rules | standalone | 0:43:20.423767 | 2528.44s

From [2]

        * 2022-08-04 21:49:42.321436 | fa163ece-2845-eadd-a47b-000000000a64 | TASK | Manage firewall rules
2022-08-04 22:31:48.302603 | fa163ece-2845-eadd-a47b-000000000a64 | CHANGED | Manage firewall rules | standalone
2022-08-04 22:31:48.304079 | fa163ece-2845-eadd-a47b-000000000a64 | TIMING | tripleo_firewall : Manage firewall rules | standalone | 0:43:16.075926 | 2525.98s

This is a promotion blocker for the master centos9 integration line

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/2a01470/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/e7a9cbd/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
[3] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/e7a9cbd/job-output.txt

Tags: alert ci
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

I've quickly checked the host logs[1].

We can indeed see the first ansible-ansible.builtin.iptables call at 17:37:31, and the last one at
18:19:39. There are a lot of things in-between, but I don't think those unrelated actions have any impact on the rule application.

AFAIK, here are the changes related to the firewall:
- tripleo_iptables action allows to filter out "nft_" prefixed parameter: https://review.opendev.org/c/openstack/tripleo-ansible/+/850221
This is needed in order to allow some better rules within nftables

- actually accept connections on l0 and dhcpv6: https://review.opendev.org/c/openstack/tripleo-ansible/+/850620
While this doesn't have actual effect on the way we manage iptables, it's needed for nftables due to the way it drops packets via a chain policy instead of the (terrible) "DROP" rule that affects only the NEW states

- better logging for nftables: https://review.opendev.org/c/openstack/tripleo-ansible/+/850222
This only affects nftables based deploy

So I don't think any of the mentioned patches are having an actual impact on the time to apply rules :(. Maybe an issue with the VM resources?

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/2a01470/logs/undercloud/var/log/extra/journal.txt.gz

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (6.0 KiB)

Some more digging:

we can see a first "bunch" of rules being applied, being pretty fast:
17:37:34 -> 17:37:40

Then, apparently, zuul user disconnect from SSH, and re-connect multiple times:

Aug 04 17:37:41 standalone.localdomain sshd[107893]: Accepted publickey for zuul from 127.0.0.1 port 52108 ssh2: RSA SHA256:6Me712WCMk6cyNoQdOq7KDuyNc+l5q2dV1a6LVZJ0EM
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: New session 792 of user zuul.
Aug 04 17:37:41 standalone.localdomain systemd[1]: Started Session 792 of User zuul.
Aug 04 17:37:41 standalone.localdomain sshd[107893]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 04 17:37:41 standalone.localdomain sshd[107896]: Received disconnect from 127.0.0.1 port 52108:11: disconnected by user
Aug 04 17:37:41 standalone.localdomain sshd[107896]: Disconnected from user zuul 127.0.0.1 port 52108
Aug 04 17:37:41 standalone.localdomain sshd[107893]: pam_unix(sshd:session): session closed for user zuul
Aug 04 17:37:41 standalone.localdomain systemd[1]: session-792.scope: Deactivated successfully.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Session 792 logged out. Waiting for processes to exit.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Removed session 792.
Aug 04 17:37:41 standalone.localdomain sshd[107928]: Accepted publickey for zuul from 127.0.0.1 port 52110 ssh2: RSA SHA256:6Me712WCMk6cyNoQdOq7KDuyNc+l5q2dV1a6LVZJ0EM
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: New session 793 of user zuul.
Aug 04 17:37:41 standalone.localdomain systemd[1]: Started Session 793 of User zuul.
Aug 04 17:37:41 standalone.localdomain sshd[107928]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 04 17:37:41 standalone.localdomain sshd[107931]: Received disconnect from 127.0.0.1 port 52110:11: disconnected by user
Aug 04 17:37:41 standalone.localdomain sshd[107931]: Disconnected from user zuul 127.0.0.1 port 52110
Aug 04 17:37:41 standalone.localdomain sshd[107928]: pam_unix(sshd:session): session closed for user zuul
Aug 04 17:37:41 standalone.localdomain systemd[1]: session-793.scope: Deactivated successfully.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Session 793 logged out. Waiting for processes to exit.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Removed session 793.
Aug 04 17:37:41 standalone.localdomain sshd[107964]: Accepted publickey for zuul from 127.0.0.1 port 52116 ssh2: RSA SHA256:6Me712WCMk6cyNoQdOq7KDuyNc+l5q2dV1a6LVZJ0EM
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: New session 794 of user zuul.
Aug 04 17:37:41 standalone.localdomain systemd[1]: Started Session 794 of User zuul.
Aug 04 17:37:41 standalone.localdomain sshd[107964]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 04 17:37:42 standalone.localdomain kernel: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.2 LEN=52 TOS=0x08 PREC=0x40 TTL=64 ID=46664 DF PROTO=TCP SPT=52116 DPT=22 WINDOW=512 RES=0x00 ACK URGP=0
Aug 04 17:37:42 standalone.localdomain sudo[108000]: zuul : TTY=pts/2 ; PWD=/home/zuul ; USER=root ; COMMAND=...

Read more...

Revision history for this message
Marios Andreou (marios-b) wrote (last edit ):

last few runs here are green [1] but I think the issue persists

see [2] for example " 2 hrs 4 mins 12 secs 2022-08-08 01:18:12 SUCCESS" but there is still 2520seconds (over 40 mins) for firewalls:

2022-08-07 21:44:41.288713 | fa163e1e-ddbc-4db8-8d27-000000000a64 | TASK | Manage firewall rules
2022-08-07 22:26:47.265020 | fa163e1e-ddbc-4db8-8d27-000000000a64 | CHANGED | Manage firewall rules | standalone
2022-08-07 22:26:47.266632 | fa163e1e-ddbc-4db8-8d27-000000000a64 | TIMING | tripleo_firewall : Manage firewall rules | standalone | 0:43:16.595542 | 2525.98s

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-scenario001-standalone-master
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/a873844/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

We're close to getting TripleO switch to nftables. The way the rules are applied with that engine should avoid this kind of headache, since it's basically generating templates, and applying *all* the rules in only one transaction.

The current way tripleo_iptables implements things is still relying on ansible "iptables" modules, and that one is running one transaction per rule. tripleo_iptables makes it faster, but still - ansible may take the liberty to split the transactions, apparently.

I'm wondering if there wasn't any ansible update triggering those timeouts? Would that be possible to check?

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

With nftables - job finished within 1 hr 37 mins, earlier it used to take more than 2 hrs in average.

tripleo_firewall role finished within 2 seconds which was taking ~40 mins earlier.

The comparison file is attached.

Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :

The job indeed is green[1] and triple_firewall role finished within 2 seconds.

2022-08-18 22:30:50.883767 | fa163e65-29ea-c05d-ef8e-000000000a6a | TIMING | tripleo_firewall : Enable tripleo-iptables service (and do a daemon-reload systemd) | standalone | 0:43:21.087435 | 1.10s
2022-08-18 22:30:50.891823 | fa163e65-29ea-c05d-ef8e-000000000a6a | TIMING | tripleo_firewall : Enable tripleo-iptables service (and do a daemon-reload systemd) | standalone | 0:43:21.095490 | 1.10s

[1]https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/3ba2d4a/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

Revision history for this message
Takashi Kajinami (kajinamit) wrote :

Thanks to Sandeep we managed to get access to the test node in RDO infra to look into the issue.

I'll summarize our observations below.

* `iptables -t filter -L INPUT` takes almost 40 seconds in the node[1].
* The command gets stack for several seconds before it shows a line with 192.168.24.0/24.
* adding -n option allows the command to complete quickly[1].
* We tested removing 127.0.0.1 from resolve.conf, which resolved the slowness without -n

[1] https://paste.opendev.org/show/beGwwFUdaaaO0R2ccOdy/

[2]
nameserver 127.0.0.1
nameserver 1.1.1.1

Because ansible_iptables module executes the command when it inserts a single rule, the slow command can cause huge delay when there are multiple rules managed.

At this moment we are concluding the issue is caused by the wrong DNS setting, which we'll look into further.

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote (last edit ):

Hello Team,

Thanks to Takashi for debugging help.

We are not hitting this issue in Upstream and in rdo along with sc01 all other scenarios are affected also. ex. Vanilla standalone[0].

tl;dr SLOW DNS resolution issue, Because we had 127.0.0.1 as our first entry in resolv.conf, It would first try to lookup using 127.0.0.1 which would time out after a couple of seconds(since localhost doesn't know how to answer a DNS query), and then it would look it up using the second DNS server.

Detailed debugging step:
=======================

We noticed simply running "iptables -t filter -L INPUT" was very slow and was taking 40+ seconds, However, with extra -n flag it returns quickly, details in [1]. "-n" here means that IP addresses and port numbers will be printed in numeric format. By default, the program will try to display them as host names, network names, or services (whenever applicable).

~~~
time iptables -t filter -L INPUT
.
.
real 0m40.197s
~~~

We came across artilce [2] which pointed to slow DNS/reverse DNS resolution and that was inteed the case:-

To resolve www.google.com, it was taking more than 3 seconds

~~~
[root@standalone ~]# time host www.google.com
www.google.com has address 142.251.41.68
www.google.com has IPv6 address 2607:f8b0:400b:804::2004

real 0m3.037s
user 0m0.015s
sys 0m0.014s
~~~

This was our /etc/resolv.conf content

~~~
[root@standalone ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search openstacklocal localdomain
nameserver 127.0.0.1
nameserver 1.1.1.1
~~~

In article[3], there are suggestions that If you don't have a name server on the local box you should leave out 127.0.0.1.

We commented 127.0.0.1 in the /etc/resolv.conf and that resolved the issue.

~~~
[root@standalone ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search openstacklocal localdomain
#nameserver 127.0.0.1
nameserver 1.1.1.1
~~~

DNS resolution became faster.
~~~
[root@standalone ~]# time host www.google.com
www.google.com has address 142.251.33.164
www.google.com has IPv6 address 2607:f8b0:400b:80c::2004

real 0m0.059s
user 0m0.010s
sys 0m0.015s
~~~

~~~
time iptables -t filter -L INPUT
.
.
real 0m0.188s
~~~

[0] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-wallaby/bbc1dd7/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
~~~
took ~30 mins: 022-08-25 18:51:02.273791 | fa163e1b-009f-a236-db4b-000000000889 | TIMING | tripleo_firewall : Manage firewall rules | standalone | 0:29:08.429153 | 1684.59s
~~~

[1] https://paste.opendev.org/show/beGwwFUdaaaO0R2ccOdy/
[2] https://www.fir3net.com/UNIX/Linux/iptables-l-output-displays-slowly.html
[3] https://serverfault.com/questions/791911/centos-extremely-slow-dns-lookup

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

* "127.0.0.1" entry in resolv.conf comes during deployment, we have this entry in standalone_parameters.yaml.txt [1]. We create standalone_parameters.yaml using below Jinja file [2] based on `undercloud_undercloud_nameserver` var.

We use standalone.yml[3] for standalone deployment which is basically symlink to multinode.yml and `undercloud_undercloud_nameserver` var is defined in multinode.yml[4].

We need to investigate on why 127.0.0.1 was added in `undercloud_undercloud_nameserver` var and if it can be removed/other possible solutions.

[1] https://logserver.rdoproject.org/48/38348/38/check/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/70fd0da/logs/undercloud/home/zuul/standalone_parameters.yaml.txt.gz
~~~
  DnsServers: ["127.0.0.1", "1.1.1.1"]
~~~
[2] https://opendev.org/openstack/tripleo-quickstart-extras/src/branch/master/roles/standalone/templates/standalone_config.yaml.j2#L20
~~~
  DnsServers: {{ undercloud_undercloud_nameservers | to_json }}
~~~

[3] https://github.com/openstack/tripleo-ci/blob/master/toci-quickstart/config/testenv/standalone.yml
~~~
lrwxrwxrwx. 1 sandeepyadav sandeepyadav 13 Aug 26 09:46 standalone.yml -> multinode.yml
~~~

[4] https://opendev.org/openstack/tripleo-ci/src/branch/master/toci-quickstart/config/testenv/multinode.yml#L36

~~~
undercloud_undercloud_nameservers:
  - 127.0.0.1
  - 1.1.1.1
~~~

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote (last edit ):

Hello,

While checking why we added 127.0.0.1 in resolv.conf - we came across the below Upstream bug and patch which suggests using local unbound service.

~~~
Bug: https://bugs.launchpad.net/tripleo/+bug/1711262 (comment#3)
Patch: https://opendev.org/openstack/tripleo-ci/commit/5d612318c95b9b3ff78e66e91d5c225274cb8b09
~~~

One observation I want to add is that unbound service is not installed/running on the node in RDO Centos9 and it's present in the Upstream C9 node/rdo Centos8 node.

Upstream Node:-

https://b31788cff7a995401de2-45518917cf8baf33fe991d0324b9a061.ssl.cf1.rackcdn.com/846947/2/gate/tripleo-ci-centos-9-scenario001-standalone/4cac8c0/logs/undercloud/var/log/extra/package-list-installed.txt

~~~
unbound.x86_64 1.16.0-3.el9 @appstream
unbound-libs.x86_64 1.16.0-3.el9 @appstream
~~~

https://b31788cff7a995401de2-45518917cf8baf33fe991d0324b9a061.ssl.cf1.rackcdn.com/846947/2/gate/tripleo-ci-centos-9-scenario001-standalone/4cac8c0/logs/undercloud/var/log/extra/services.txt
~~~
unbound.service loaded active running Unbound recursive Domain Name Server
~~~

RDO node:-

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/1c1c2d4/logs/undercloud/var/log/extra/package-list-installed.txt.gz
~~~
unbound-libs.x86_64 1.16.0-3.el9 @appstream
~~~

In rdo node, unbound.service is not present.

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/1c1c2d4/logs/undercloud/var/log/extra/services.txt.gz

In Centos8 wallaby, we have unbound installed, and we did not see the same issue in Centos8 node.
https://logserver.rdoproject.org/openstack-periodic-integration-stable1-cs8/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-wallaby/8cb7e4a/logs/undercloud/etc/unbound/

On the affected node installing unbound and starting unbound service solved the issue.

[root@standalone conf.d]# time host www.google.com
www.google.com has address 142.251.41.68
www.google.com has IPv6 address 2607:f8b0:400b:807::2004

real 0m3.046s
user 0m0.013s
sys 0m0.015s

[root@standalone conf.d]# dnf install unbound
.
.
Installed:
  unbound-1.16.2-2.el9.x86_64

[root@standalone conf.d]# systemctl start unbound.service
[root@standalone conf.d]# time host www.google.com
www.google.com has address 142.251.41.68
www.google.com has IPv6 address 2607:f8b0:400b:804::2004

real 0m0.133s
user 0m0.007s
sys 0m0.014s

Next step: Investigate why rdo Centos9 nodepool image doesn't have unbound installed and add unbound service to be consistent with upstream.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ci (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/854751

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Spoke with Infra team(Daniel) regarding why Unbound service is missing in RDO C9 Image, turns out we were building c9 image via virt-customize vs dib in upstream and that seems to bring the difference.

Thanks to daniel, he proposed patch for c9 dib image[1]. We have proposed[2] as a workaround and will revert workaround[2] later once we get the c9 image built with dib that will contain unbound service by default.

[1] https://softwarefactory-project.io/r/c/config/+/25941
[2] https://review.opendev.org/c/openstack/tripleo-ci/+/854751

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ci (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ci/+/854751
Committed: https://opendev.org/openstack/tripleo-ci/commit/343f8689418a1fd312b3d85f086233fa8d94d9ac
Submitter: "Zuul (22348)"
Branch: master

commit 343f8689418a1fd312b3d85f086233fa8d94d9ac
Author: Sandeep Yadav <email address hidden>
Date: Fri Aug 26 13:19:40 2022 +0530

    Override Nameserver for vexxhost cloud

    We are hitting slow dns resolution for the jobs running in vexxhoxt
    cloud because 127.0.0.1 is first entry in resolv.conf but the image
    we use don't have unbound service installed. Unbound service is
    present in Upstream images, details in related bug.

    With this patch, overriding Nameserver values used in Vexxhost cloud.

    Related-Bug: #1983718
    Change-Id: Ie95ed0e6b67afc28d124b2cf91fe67b57f18e654

Revision history for this message
Alan Pevec (apevec) wrote :

@dpawlik re. comment 12 - IIUC upstream-centos-9-stream DIB base image was reverted, do we have an update to virt-customize to add local unbound in CS9 cloud qcow based nodepool image?

Revision history for this message
daniel.pawlik (daniel-pawlik) wrote :

We remove upstream-centos-9-stream image and come back to cloud-centos-9-stream due tests results were much much better on cloud image.

Alan Pevec (apevec)
Changed in tripleo:
status: Triaged → Fix Released
tags: removed: promotion-blocker
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.