Bug #1983718 “periodic master scen1 standalone fails/timeout 'ma...” : Bugs : tripleo

Revision history for this message

Cédric Jeanneret (cjeanner) wrote on 2022-08-05:

#1

I've quickly checked the host logs[1].

We can indeed see the first ansible-ansible.builtin.iptables call at 17:37:31, and the last one at
18:19:39. There are a lot of things in-between, but I don't think those unrelated actions have any impact on the rule application.

AFAIK, here are the changes related to the firewall:
- tripleo_iptables action allows to filter out "nft_" prefixed parameter: https://review.opendev.org/c/openstack/tripleo-ansible/+/850221
This is needed in order to allow some better rules within nftables

- actually accept connections on l0 and dhcpv6: https://review.opendev.org/c/openstack/tripleo-ansible/+/850620
While this doesn't have actual effect on the way we manage iptables, it's needed for nftables due to the way it drops packets via a chain policy instead of the (terrible) "DROP" rule that affects only the NEW states

- better logging for nftables: https://review.opendev.org/c/openstack/tripleo-ansible/+/850222
This only affects nftables based deploy

So I don't think any of the mentioned patches are having an actual impact on the time to apply rules :(. Maybe an issue with the VM resources?

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/2a01470/logs/undercloud/var/log/extra/journal.txt.gz

Revision history for this message

Cédric Jeanneret (cjeanner) wrote on 2022-08-05:

#2

Download full text (6.0 KiB)

Some more digging:

we can see a first "bunch" of rules being applied, being pretty fast:
17:37:34 -> 17:37:40

Then, apparently, zuul user disconnect from SSH, and re-connect multiple times:

Aug 04 17:37:41 standalone.localdomain sshd[107893]: Accepted publickey for zuul from 127.0.0.1 port 52108 ssh2: RSA SHA256:6Me712WCMk6cyNoQdOq7KDuyNc+l5q2dV1a6LVZJ0EM
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: New session 792 of user zuul.
Aug 04 17:37:41 standalone.localdomain systemd[1]: Started Session 792 of User zuul.
Aug 04 17:37:41 standalone.localdomain sshd[107893]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 04 17:37:41 standalone.localdomain sshd[107896]: Received disconnect from 127.0.0.1 port 52108:11: disconnected by user
Aug 04 17:37:41 standalone.localdomain sshd[107896]: Disconnected from user zuul 127.0.0.1 port 52108
Aug 04 17:37:41 standalone.localdomain sshd[107893]: pam_unix(sshd:session): session closed for user zuul
Aug 04 17:37:41 standalone.localdomain systemd[1]: session-792.scope: Deactivated successfully.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Session 792 logged out. Waiting for processes to exit.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Removed session 792.
Aug 04 17:37:41 standalone.localdomain sshd[107928]: Accepted publickey for zuul from 127.0.0.1 port 52110 ssh2: RSA SHA256:6Me712WCMk6cyNoQdOq7KDuyNc+l5q2dV1a6LVZJ0EM
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: New session 793 of user zuul.
Aug 04 17:37:41 standalone.localdomain systemd[1]: Started Session 793 of User zuul.
Aug 04 17:37:41 standalone.localdomain sshd[107928]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 04 17:37:41 standalone.localdomain sshd[107931]: Received disconnect from 127.0.0.1 port 52110:11: disconnected by user
Aug 04 17:37:41 standalone.localdomain sshd[107931]: Disconnected from user zuul 127.0.0.1 port 52110
Aug 04 17:37:41 standalone.localdomain sshd[107928]: pam_unix(sshd:session): session closed for user zuul
Aug 04 17:37:41 standalone.localdomain systemd[1]: session-793.scope: Deactivated successfully.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Session 793 logged out. Waiting for processes to exit.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Removed session 793.
Aug 04 17:37:41 standalone.localdomain sshd[107964]: Accepted publickey for zuul from 127.0.0.1 port 52116 ssh2: RSA SHA256:6Me712WCMk6cyNoQdOq7KDuyNc+l5q2dV1a6LVZJ0EM
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: New session 794 of user zuul.
Aug 04 17:37:41 standalone.localdomain systemd[1]: Started Session 794 of User zuul.
Aug 04 17:37:41 standalone.localdomain sshd[107964]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 04 17:37:42 standalone.localdomain kernel: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.2 LEN=52 TOS=0x08 PREC=0x40 TTL=64 ID=46664 DF PROTO=TCP SPT=52116 DPT=22 WINDOW=512 RES=0x00 ACK URGP=0
Aug 04 17:37:42 standalone.localdomain sudo[108000]: zuul : TTY=pts/2 ; PWD=/home/zuul ; USER=root ; COMMAND=...

Some more digging:

we can see a first "bunch" of rules being applied, being pretty fast:
17:37:34 -> 17:37:40

Then, apparently, zuul user disconnect from SSH, and re-connect multiple times:

Aug 04 17:37:41 standalone.localdomain sshd[107893]: Accepted publickey for zuul from 127.0.0.1 port 52108 ssh2: RSA SHA256:6Me712WCMk6cyNoQdOq7KDuyNc+l5q2dV1a6LVZJ0EM
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: New session 792 of user zuul.
Aug 04 17:37:41 standalone.localdomain systemd[1]: Started Session 792 of User zuul.
Aug 04 17:37:41 standalone.localdomain sshd[107893]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 04 17:37:41 standalone.localdomain sshd[107896]: Received disconnect from 127.0.0.1 port 52108:11: disconnected by user
Aug 04 17:37:41 standalone.localdomain sshd[107896]: Disconnected from user zuul 127.0.0.1 port 52108
Aug 04 17:37:41 standalone.localdomain sshd[107893]: pam_unix(sshd:session): session closed for user zuul
Aug 04 17:37:41 standalone.localdomain systemd[1]: session-792.scope: Deactivated successfully.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Session 792 logged out. Waiting for processes to exit.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Removed session 792.
Aug 04 17:37:41 standalone.localdomain sshd[107928]: Accepted publickey for zuul from 127.0.0.1 port 52110 ssh2: RSA SHA256:6Me712WCMk6cyNoQdOq7KDuyNc+l5q2dV1a6LVZJ0EM
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: New session 793 of user zuul.
Aug 04 17:37:41 standalone.localdomain systemd[1]: Started Session 793 of User zuul.
Aug 04 17:37:41 standalone.localdomain sshd[107928]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 04 17:37:41 standalone.localdomain sshd[107931]: Received disconnect from 127.0.0.1 port 52110:11: disconnected by user
Aug 04 17:37:41 standalone.localdomain sshd[107931]: Disconnected from user zuul 127.0.0.1 port 52110
Aug 04 17:37:41 standalone.localdomain sshd[107928]: pam_unix(sshd:session): session closed for user zuul
Aug 04 17:37:41 standalone.localdomain systemd[1]: session-793.scope: Deactivated successfully.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Session 793 logged out. Waiting for processes to exit.
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: Removed session 793.
Aug 04 17:37:41 standalone.localdomain sshd[107964]: Accepted publickey for zuul from 127.0.0.1 port 52116 ssh2: RSA SHA256:6Me712WCMk6cyNoQdOq7KDuyNc+l5q2dV1a6LVZJ0EM
Aug 04 17:37:41 standalone.localdomain systemd-logind[742]: New session 794 of user zuul.
Aug 04 17:37:41 standalone.localdomain systemd[1]: Started Session 794 of User zuul.
Aug 04 17:37:41 standalone.localdomain sshd[107964]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 04 17:37:42 standalone.localdomain kernel: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.2 LEN=52 TOS=0x08 PREC=0x40 TTL=64 ID=46664 DF PROTO=TCP SPT=52116 DPT=22 WINDOW=512 RES=0x00 ACK URGP=0 
Aug 04 17:37:42 standalone.localdomain sudo[108000]:     zuul : TTY=pts/2 ; PWD=/home/zuul ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-nwatniafftgepoezzkfsyoafcsybycji ; DEPLOY_CONTROL_VIP=192.168.24.3 DEPLOY_DEPLOYMENT_USER=zuul DEPLOY_HEAT_API_PORT=None DEPLOY_HEAT_CONTAINER_IMAGE=None DEPLOY_HEAT_USER=None DEPLOY_HIERADATA_OVERRIDE=None DEPLOY_LOCAL_DOMAIN=None DEPLOY_LOCAL_IP=192.168.24.1/24 DEPLOY_NETWORKS_FILE=/home/zuul/network_data.yaml DEPLOY_OUTPUT_DIR=/home/zuul/tripleo-deploy DEPLOY_PLAN_ENV_FILE=None DEPLOY_PUBLIC_VIP=None DEPLOY_PYTHON_INTERPRETER=None DEPLOY_ROLES_FILE=/usr/share/openstack-tripleo-heat-templates/roles/Standalone.yaml DEPLOY_STACK=standalone DEPLOY_STANDALONE_ROLE=Standalone DEPLOY_TEMPLATES=/usr/share/openstack-tripleo-heat-templates DEPLOY_TIMEOUT_ARG=90 /usr/libexec/platform-python /home/zuul/.ansible/tmp/ansible-tmp-1659634160.3553169-97548-238723454294372/AnsiballZ_async_status.py
Aug 04 17:37:42 standalone.localdomain sudo[108000]: pam_unix(sudo:session): session opened for user root(uid=0) by zuul(uid=1000)
Aug 04 17:37:42 standalone.localdomain platform-python[108002]: ansible-ansible.legacy.async_status Invoked with jid=329357676243.97772 mode=status _async_dir=/root/.ansible_async
Aug 04 17:37:42 standalone.localdomain sudo[108000]: pam_unix(sudo:session): session closed for user root
Aug 04 17:37:42 standalone.localdomain sshd[107967]: Received disconnect from 127.0.0.1 port 52116:11: disconnected by user
Aug 04 17:37:42 standalone.localdomain sshd[107967]: Disconnected from user zuul 127.0.0.1 port 52116
Aug 04 17:37:42 standalone.localdomain sshd[107964]: pam_unix(sshd:session): session closed for user zuul
Aug 04 17:37:42 standalone.localdomain systemd[1]: session-794.scope: Deactivated successfully.
Aug 04 17:37:42 standalone.localdomain systemd-logind[742]: Session 794 logged out. Waiting for processes to exit.
Aug 04 17:37:42 standalone.localdomain systemd-logind[742]: Removed session 794.
Aug 04 17:37:45 standalone.localdomain kernel: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=192.168.42.1 DST=192.168.42.1 LEN=148 TOS=0x00 PREC=0x00 TTL=64 ID=11222 DF PROTO=TCP SPT=40484 DPT=3300 WINDOW=11765 RES=0x00 ACK PSH URGP=0 
Aug 04 17:37:45 standalone.localdomain ansible-async_wrapper.py[97775]: 97776 still running (5190)

then it applies a two rules, from 17:37:45 -> 17:37:50, and then, again, sshd connections things

We can see it takes already 5s to get back to rule application, and this time, applying 2 rules takes 5 seconds.

Then, it applies 2 rules again, from 17:37:56 -> 17:38:01 - same pattern: 5 seconds to get in (after multiple ssh attempts/success), then 5s to apply 2 rules

and so on.

Note that, later, we're still applying rules while a bunch of containers are apparently started. I'm a bit surprised, iirc this shouldn't happen.

I have the feeling this is related to a change in either tripleo-heat-templates, or to the ansible strategy (tripleo_free maybe?). But not actually to the tripleo_iptables action nor tripleo_firewall role.

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-08 (last edit on 2022-08-08):

#3

last few runs here are green [1] but I think the issue persists

see [2] for example " 2 hrs 4 mins 12 secs 2022-08-08 01:18:12 SUCCESS" but there is still 2520seconds (over 40 mins) for firewalls:

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-scenario001-standalone-master
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/a873844/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

Revision history for this message

Cédric Jeanneret (cjeanner) wrote on 2022-08-08:

#4

We're close to getting TripleO switch to nftables. The way the rules are applied with that engine should avoid this kind of headache, since it's basically generating templates, and applying *all* the rules in only one transaction.

The current way tripleo_iptables implements things is still relying on ansible "iptables" modules, and that one is running one transaction per rule. tripleo_iptables makes it faster, but still - ansible may take the liberty to split the transactions, apparently.

I'm wondering if there wasn't any ansible update triggering those timeouts? Would that be possible to check?

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2022-08-18:

#5

nftables_comparision.txt Edit (14.7 KiB, text/plain)

With nftables - job finished within 1 hr 37 mins, earlier it used to take more than 2 hrs in average.

tripleo_firewall role finished within 2 seconds which was taking ~40 mins earlier.

The comparison file is attached.

Revision history for this message

Soniya Murlidhar Vyas (svyas) wrote on 2022-08-19:

#6

The job indeed is green[1] and triple_firewall role finished within 2 seconds.

[1]https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/3ba2d4a/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

Revision history for this message

Takashi Kajinami (kajinamit) wrote on 2022-08-26:

#7

Thanks to Sandeep we managed to get access to the test node in RDO infra to look into the issue.

I'll summarize our observations below.

* `iptables -t filter -L INPUT` takes almost 40 seconds in the node[1].
* The command gets stack for several seconds before it shows a line with 192.168.24.0/24.
* adding -n option allows the command to complete quickly[1].
* We tested removing 127.0.0.1 from resolve.conf, which resolved the slowness without -n

[1] https://paste.opendev.org/show/beGwwFUdaaaO0R2ccOdy/

[2]
nameserver 127.0.0.1
nameserver 1.1.1.1

Because ansible_iptables module executes the command when it inserts a single rule, the slow command can cause huge delay when there are multiple rules managed.

At this moment we are concluding the issue is caused by the wrong DNS setting, which we'll look into further.

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2022-08-26 (last edit on 2022-08-26):

#8

Hello Team,

Thanks to Takashi for debugging help.

We are not hitting this issue in Upstream and in rdo along with sc01 all other scenarios are affected also. ex. Vanilla standalone[0].

tl;dr SLOW DNS resolution issue, Because we had 127.0.0.1 as our first entry in resolv.conf, It would first try to lookup using 127.0.0.1 which would time out after a couple of seconds(since localhost doesn't know how to answer a DNS query), and then it would look it up using the second DNS server.

Detailed debugging step:
=======================

We noticed simply running "iptables -t filter -L INPUT" was very slow and was taking 40+ seconds, However, with extra -n flag it returns quickly, details in [1]. "-n" here means that IP addresses and port numbers will be printed in numeric format. By default, the program will try to display them as host names, network names, or services (whenever applicable).

~~~
time iptables -t filter -L INPUT
.
.
real 0m40.197s
~~~

We came across artilce [2] which pointed to slow DNS/reverse DNS resolution and that was inteed the case:-

To resolve www.google.com, it was taking more than 3 seconds

~~~
[root@standalone ~]# time host www.google.com
www.google.com has address 142.251.41.68
www.google.com has IPv6 address 2607:f8b0:400b:804::2004

real 0m3.037s
user 0m0.015s
sys 0m0.014s
~~~

This was our /etc/resolv.conf content

~~~
[root@standalone ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search openstacklocal localdomain
nameserver 127.0.0.1
nameserver 1.1.1.1
~~~

In article[3], there are suggestions that If you don't have a name server on the local box you should leave out 127.0.0.1.

We commented 127.0.0.1 in the /etc/resolv.conf and that resolved the issue.

~~~
[root@standalone ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search openstacklocal localdomain
#nameserver 127.0.0.1
nameserver 1.1.1.1
~~~

DNS resolution became faster.
~~~
[root@standalone ~]# time host www.google.com
www.google.com has address 142.251.33.164
www.google.com has IPv6 address 2607:f8b0:400b:80c::2004

real 0m0.059s
user 0m0.010s
sys 0m0.015s
~~~

~~~
time iptables -t filter -L INPUT
.
.
real 0m0.188s
~~~

[0] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-wallaby/bbc1dd7/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
~~~
took ~30 mins: 022-08-25 18:51:02.273791 | fa163e1b-009f-a236-db4b-000000000889 | TIMING | tripleo_firewall : Manage firewall rules | standalone | 0:29:08.429153 | 1684.59s
~~~

[1] https://paste.opendev.org/show/beGwwFUdaaaO0R2ccOdy/
[2] https://www.fir3net.com/UNIX/Linux/iptables-l-output-displays-slowly.html
[3] https://serverfault.com/questions/791911/centos-extremely-slow-dns-lookup

Hello Team,

Thanks to Takashi for debugging help.

We are not hitting this issue in Upstream and in rdo along with sc01 all other scenarios are affected also. ex. Vanilla standalone[0].

tl;dr SLOW DNS resolution issue, Because we had 127.0.0.1 as our first entry in resolv.conf, It would first try to lookup using 127.0.0.1 which would time out after a couple of seconds(since localhost doesn't know how to answer a DNS query), and then it would look it up using the second DNS server.

Detailed debugging step:
=======================

We noticed simply running "iptables -t filter -L INPUT" was very slow and was taking 40+ seconds, However, with extra -n flag it returns quickly, details in [1]. "-n" here means that IP addresses and port numbers will be printed in numeric format.  By default, the program will try to display them as host names, network names, or services (whenever applicable).

~~~
time iptables -t filter -L INPUT
.
.
real	0m40.197s
~~~

We came across artilce [2] which pointed to slow DNS/reverse DNS resolution and that was inteed the case:-

To resolve www.google.com, it was taking more than 3 seconds

~~~
[root@standalone ~]# time host www.google.com
www.google.com has address 142.251.41.68
www.google.com has IPv6 address 2607:f8b0:400b:804::2004

real    0m3.037s
user    0m0.015s
sys     0m0.014s
~~~

This was our /etc/resolv.conf content

~~~
[root@standalone ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search openstacklocal localdomain
nameserver 127.0.0.1
nameserver 1.1.1.1
~~~

In article[3], there are suggestions that If you don't have a name server on the local box you should leave out 127.0.0.1.

We commented 127.0.0.1 in the /etc/resolv.conf and that resolved the issue.

~~~
[root@standalone ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search openstacklocal localdomain
#nameserver 127.0.0.1
nameserver 1.1.1.1
~~~

DNS resolution became faster.
~~~
[root@standalone ~]# time host www.google.com
www.google.com has address 142.251.33.164
www.google.com has IPv6 address 2607:f8b0:400b:80c::2004

real    0m0.059s
user    0m0.010s
sys     0m0.015s
~~~

~~~
time iptables -t filter -L INPUT
.
.
real    0m0.188s
~~~

[0] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-wallaby/bbc1dd7/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
~~~
took ~30 mins: 022-08-25 18:51:02.273791 | fa163e1b-009f-a236-db4b-000000000889 |     TIMING | tripleo_firewall : Manage firewall rules | standalone | 0:29:08.429153 | 1684.59s
~~~

[1] https://paste.opendev.org/show/beGwwFUdaaaO0R2ccOdy/
[2] https://www.fir3net.com/UNIX/Linux/iptables-l-output-displays-slowly.html
[3] https://serverfault.com/questions/791911/centos-extremely-slow-dns-lookup

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2022-08-26:

#9

* "127.0.0.1" entry in resolv.conf comes during deployment, we have this entry in standalone_parameters.yaml.txt [1]. We create standalone_parameters.yaml using below Jinja file [2] based on `undercloud_undercloud_nameserver` var.

We use standalone.yml[3] for standalone deployment which is basically symlink to multinode.yml and `undercloud_undercloud_nameserver` var is defined in multinode.yml[4].

We need to investigate on why 127.0.0.1 was added in `undercloud_undercloud_nameserver` var and if it can be removed/other possible solutions.

[1] https://logserver.rdoproject.org/48/38348/38/check/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/70fd0da/logs/undercloud/home/zuul/standalone_parameters.yaml.txt.gz
~~~
DnsServers: ["127.0.0.1", "1.1.1.1"]
~~~
[2] https://opendev.org/openstack/tripleo-quickstart-extras/src/branch/master/roles/standalone/templates/standalone_config.yaml.j2#L20
~~~
DnsServers: {{ undercloud_undercloud_nameservers | to_json }}
~~~

[3] https://github.com/openstack/tripleo-ci/blob/master/toci-quickstart/config/testenv/standalone.yml
~~~
lrwxrwxrwx. 1 sandeepyadav sandeepyadav 13 Aug 26 09:46 standalone.yml -> multinode.yml
~~~

[4] https://opendev.org/openstack/tripleo-ci/src/branch/master/toci-quickstart/config/testenv/multinode.yml#L36

~~~
undercloud_undercloud_nameservers:
- 127.0.0.1
- 1.1.1.1
~~~

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2022-08-26 (last edit on 2022-08-26):

#10

Hello,

While checking why we added 127.0.0.1 in resolv.conf - we came across the below Upstream bug and patch which suggests using local unbound service.

~~~
Bug: https://bugs.launchpad.net/tripleo/+bug/1711262 (comment#3)
Patch: https://opendev.org/openstack/tripleo-ci/commit/5d612318c95b9b3ff78e66e91d5c225274cb8b09
~~~

One observation I want to add is that unbound service is not installed/running on the node in RDO Centos9 and it's present in the Upstream C9 node/rdo Centos8 node.

Upstream Node:-

https://b31788cff7a995401de2-45518917cf8baf33fe991d0324b9a061.ssl.cf1.rackcdn.com/846947/2/gate/tripleo-ci-centos-9-scenario001-standalone/4cac8c0/logs/undercloud/var/log/extra/package-list-installed.txt

~~~
unbound.x86_64 1.16.0-3.el9 @appstream
unbound-libs.x86_64 1.16.0-3.el9 @appstream
~~~

https://b31788cff7a995401de2-45518917cf8baf33fe991d0324b9a061.ssl.cf1.rackcdn.com/846947/2/gate/tripleo-ci-centos-9-scenario001-standalone/4cac8c0/logs/undercloud/var/log/extra/services.txt
~~~
unbound.service loaded active running Unbound recursive Domain Name Server
~~~

RDO node:-

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/1c1c2d4/logs/undercloud/var/log/extra/package-list-installed.txt.gz
~~~
unbound-libs.x86_64 1.16.0-3.el9 @appstream
~~~

In rdo node, unbound.service is not present.

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/1c1c2d4/logs/undercloud/var/log/extra/services.txt.gz

In Centos8 wallaby, we have unbound installed, and we did not see the same issue in Centos8 node.
https://logserver.rdoproject.org/openstack-periodic-integration-stable1-cs8/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-wallaby/8cb7e4a/logs/undercloud/etc/unbound/

On the affected node installing unbound and starting unbound service solved the issue.

[root@standalone conf.d]# time host www.google.com
www.google.com has address 142.251.41.68
www.google.com has IPv6 address 2607:f8b0:400b:807::2004

real 0m3.046s
user 0m0.013s
sys 0m0.015s

[root@standalone conf.d]# dnf install unbound
.
.
Installed:
unbound-1.16.2-2.el9.x86_64

[root@standalone conf.d]# systemctl start unbound.service
[root@standalone conf.d]# time host www.google.com
www.google.com has address 142.251.41.68
www.google.com has IPv6 address 2607:f8b0:400b:804::2004

real 0m0.133s
user 0m0.007s
sys 0m0.014s

Next step: Investigate why rdo Centos9 nodepool image doesn't have unbound installed and add unbound service to be consistent with upstream.

Hello,

While checking why we added 127.0.0.1 in resolv.conf - we came across the below Upstream bug and patch which suggests using local unbound service.

~~~
Bug: https://bugs.launchpad.net/tripleo/+bug/1711262 (comment#3)
Patch: https://opendev.org/openstack/tripleo-ci/commit/5d612318c95b9b3ff78e66e91d5c225274cb8b09
~~~

One observation I want to add is that unbound service is not installed/running on the node in RDO Centos9 and it's present in the Upstream C9 node/rdo Centos8 node.

Upstream Node:-

https://b31788cff7a995401de2-45518917cf8baf33fe991d0324b9a061.ssl.cf1.rackcdn.com/846947/2/gate/tripleo-ci-centos-9-scenario001-standalone/4cac8c0/logs/undercloud/var/log/extra/package-list-installed.txt

~~~
unbound.x86_64                              1.16.0-3.el9                          @appstream                         
unbound-libs.x86_64                         1.16.0-3.el9                          @appstream   
~~~

https://b31788cff7a995401de2-45518917cf8baf33fe991d0324b9a061.ssl.cf1.rackcdn.com/846947/2/gate/tripleo-ci-centos-9-scenario001-standalone/4cac8c0/logs/undercloud/var/log/extra/services.txt
~~~  
unbound.service                                                                                                                      loaded    active   running   Unbound recursive Domain Name Server
~~~

RDO node:-

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/1c1c2d4/logs/undercloud/var/log/extra/package-list-installed.txt.gz
~~~
unbound-libs.x86_64                         1.16.0-3.el9                          @appstream 
~~~

In rdo node, unbound.service is not present.

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/1c1c2d4/logs/undercloud/var/log/extra/services.txt.gz

In Centos8 wallaby,  we have unbound installed, and we did not see the same issue in Centos8 node.
https://logserver.rdoproject.org/openstack-periodic-integration-stable1-cs8/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-wallaby/8cb7e4a/logs/undercloud/etc/unbound/

On the affected node installing unbound and starting unbound service solved the issue.

[root@standalone conf.d]# time host www.google.com
www.google.com has address 142.251.41.68
www.google.com has IPv6 address 2607:f8b0:400b:807::2004

real    0m3.046s
user    0m0.013s
sys     0m0.015s

[root@standalone conf.d]# dnf install unbound
.
.
Installed:
  unbound-1.16.2-2.el9.x86_64

[root@standalone conf.d]# systemctl start unbound.service 
[root@standalone conf.d]# time host www.google.com
www.google.com has address 142.251.41.68
www.google.com has IPv6 address 2607:f8b0:400b:804::2004

real    0m0.133s
user    0m0.007s
sys     0m0.014s

Next step: Investigate why rdo Centos9 nodepool image doesn't have unbound installed and add unbound service to be consistent with upstream.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-26: Related fix proposed to tripleo-ci (master)

#11

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/854751

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2022-08-26:

#12

Spoke with Infra team(Daniel) regarding why Unbound service is missing in RDO C9 Image, turns out we were building c9 image via virt-customize vs dib in upstream and that seems to bring the difference.

Thanks to daniel, he proposed patch for c9 dib image[1]. We have proposed[2] as a workaround and will revert workaround[2] later once we get the c9 image built with dib that will contain unbound service by default.

[1] https://softwarefactory-project.io/r/c/config/+/25941
[2] https://review.opendev.org/c/openstack/tripleo-ci/+/854751

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-26: Related fix merged to tripleo-ci (master)

#13

Reviewed: https://review.opendev.org/c/openstack/tripleo-ci/+/854751
Committed: https://opendev.org/openstack/tripleo-ci/commit/343f8689418a1fd312b3d85f086233fa8d94d9ac
Submitter: "Zuul (22348)"
Branch: master

commit 343f8689418a1fd312b3d85f086233fa8d94d9ac
Author: Sandeep Yadav <email address hidden>
Date: Fri Aug 26 13:19:40 2022 +0530

Override Nameserver for vexxhost cloud

    We are hitting slow dns resolution for the jobs running in vexxhoxt
    cloud because 127.0.0.1 is first entry in resolv.conf but the image
    we use don't have unbound service installed. Unbound service is
    present in Upstream images, details in related bug.

With this patch, overriding Nameserver values used in Vexxhost cloud.

Related-Bug: #1983718
Change-Id: Ie95ed0e6b67afc28d124b2cf91fe67b57f18e654

Revision history for this message

Alan Pevec (apevec) wrote on 2022-11-02:

#14

@dpawlik re. comment 12 - IIUC upstream-centos-9-stream DIB base image was reverted, do we have an update to virt-customize to add local unbound in CS9 cloud qcow based nodepool image?

Revision history for this message

daniel.pawlik (daniel-pawlik) wrote on 2022-11-02:

#15

We remove upstream-centos-9-stream image and come back to cloud-centos-9-stream due tests results were much much better on cloud image.

Alan Pevec (apevec) on 2022-11-02

Changed in tripleo:
status:	Triaged → Fix Released
tags:	removed: promotion-blocker

tripleo

periodic master scen1 standalone fails/timeout 'manage firewall rules'

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches