Periodic rocky fs020 job fails tempest tests tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_cross_tenant_traffic and tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_multiple_security_groups

Bug #1843259 reported by Gabriele Cerami on 2019-09-09
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Gabriele Cerami

Bug Description

logs at https://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset020-rocky/e9d92b4/logs/undercloud/home/zuul/tempest.log.txt.gz#_2019-09-09_09_12_54

show

2019-09-09 09:12:54 | tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_cross_tenant_traffic[compute,id-e79f879e-debb-440c-a7e4-efeda05b6848,network]
2019-09-09 09:12:54 | -------------------------------------------------------------------------------------------------------------------------------------------------------------
2019-09-09 09:12:54 |
2019-09-09 09:12:54 | Captured traceback:
2019-09-09 09:12:54 | ~~~~~~~~~~~~~~~~~~~
2019-09-09 09:12:54 | Traceback (most recent call last):
2019-09-09 09:12:54 | File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
2019-09-09 09:12:54 | return f(*func_args, **func_kwargs)
2019-09-09 09:12:54 | File "/usr/lib/python2.7/site-packages/tempest/scenario/test_security_groups_basic_ops.py", line 488, in test_cross_tenant_traffic
2019-09-09 09:12:54 | self._test_cross_tenant_block(source_tenant, dest_tenant)
2019-09-09 09:12:54 | File "/usr/lib/python2.7/site-packages/tempest/scenario/test_security_groups_basic_ops.py", line 406, in _test_cross_tenant_block
2019-09-09 09:12:54 | should_succeed=False)
2019-09-09 09:12:54 | File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 960, in check_remote_connectivity
2019-09-09 09:12:54 | self.fail(msg)
2019-09-09 09:12:54 | File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 690, in fail
2019-09-09 09:12:54 | raise self.failureException(msg)
2019-09-09 09:12:54 | AssertionError: 10.0.0.105 is reachable from 10.0.0.106

2019-09-09 09:12:54 | tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_multiple_security_groups[compute,id-d2f77418-fcc4-439d-b935-72eca704e293,network,slow]
2019-09-09 09:12:54 | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
2019-09-09 09:12:54 |
2019-09-09 09:12:54 | Captured traceback:
2019-09-09 09:12:54 | ~~~~~~~~~~~~~~~~~~~
2019-09-09 09:12:54 | Traceback (most recent call last):
2019-09-09 09:12:54 | File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
2019-09-09 09:12:54 | return f(*func_args, **func_kwargs)
2019-09-09 09:12:54 | File "/usr/lib/python2.7/site-packages/tempest/scenario/test_security_groups_basic_ops.py", line 575, in test_multiple_security_groups
2019-09-09 09:12:54 | should_connect=False)
2019-09-09 09:12:54 | File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 622, in check_vm_connectivity
2019-09-09 09:12:54 | msg=msg)
2019-09-09 09:12:54 | File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 702, in assertTrue
2019-09-09 09:12:54 | raise self.failureException(msg)
2019-09-09 09:12:54 | AssertionError: False is not true : ip address 10.0.0.105 is reachable

indicating errors in security groups setup

Changed in tripleo:
importance: Undecided → Critical
tags: added: tempest
Nate Johnston (nate-johnston) wrote :

Can we get access to CI nodes with those failed tests? It would be much easier to debug, if that is possible. If so please let me know, plus <email address hidden> - thanks!

Just some info to review the logs:
Port:
- id: 427e785f-...
- ip: 10.0.0.105
- mac: fa:16:3e:d2:16:58
- subnet: 0a891adb-...
- net: 90e0670a-...
SG:
- id: 929f0211-...
- rule(ssh): 524aa39b-...

The port (in compute1, OVS agent logs), is:
- bond: 08:53:16.262
- processed by the OVS agent: 08:53:18.384
- preparing filters for port: 08:53:19.657
- iptables finishes applying 83 rules: 08:53:19.742

The main problem here is, unlike in OVS firewall, the IPtables rules are not logged (even in DEBUG level). I'm going to propose a patch to have this output in the logs.

Slawek Kaplonski (slaweq) wrote :

I was checking logs from failed jobs and I don't see anything which could cause this issue.
I also checked patches merged in those days to stable/rocky, as suggested by Sagi, but even with couple more days: https://review.opendev.org/#/q/status:merged+AND+branch:stable/rocky+before:2019-09-09+after:2019-09-05 - there is nothing really suspicious there.

So for now I think that maybe it's some change in centos 7 image used for tests? Like e.g. different docker, iptables or kernel version maybe.
Can we somehow compare what versions of this software was used in those different jobs?

Slawek Kaplonski (slaweq) wrote :

Ok, I found list of packages in https://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset020-rocky/e9d92b4/logs/overcloud-novacompute-1/var/log/extra/ and it looks that packages installed in passing job (6.09) are exactly the same as those installed on e.g. 9.09 when job failed.

Slawek Kaplonski (slaweq) wrote :

I think I might found the reason. In job where tests are failing, I see

net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0

While on "passing" job those values are set to 1

Slawek Kaplonski (slaweq) wrote :

So I have no idea why those settings are switched to 0 but I'm pretty sure that this is a reason of why this job is failing.
If we could have access to such CI nodes which runs this job, we could than switch those settings to 1 and run tests again to see if this will really helps.
Now I think that also someone much more familiar with those CI jobs and TripleO should take a look to check maybe where those values are changed.

Emilien Macchi (emilienm) wrote :
Download full text (3.4 KiB)

sysctl settings managed by Puppet are visible in this hieradata:

https://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset020-rocky/e9d92b4/logs/overcloud-novacompute-1/etc/puppet/hieradata/service_configs.json.txt.gz

Pasting here:

    "sysctl_settings": {
        "fs.inotify.max_user_instances": {
            "value": 1024
        },
        "fs.suid_dumpable": {
            "value": 0
        },
        "kernel.dmesg_restrict": {
            "value": 1
        },
        "kernel.pid_max": {
            "value": 1048576
        },
        "net.core.netdev_max_backlog": {
            "value": 10000
        },
        "net.ipv4.conf.all.arp_accept": {
            "value": 1
        },
        "net.ipv4.conf.all.arp_notify": {
            "value": 1
        },
        "net.ipv4.conf.all.log_martians": {
            "value": 1
        },
        "net.ipv4.conf.all.secure_redirects": {
            "value": 0
        },
        "net.ipv4.conf.all.send_redirects": {
            "value": 0
        },
        "net.ipv4.conf.default.accept_redirects": {
            "value": 0
        },
        "net.ipv4.conf.default.log_martians": {
            "value": 1
        },
        "net.ipv4.conf.default.secure_redirects": {
            "value": 0
        },
        "net.ipv4.conf.default.send_redirects": {
            "value": 0
        },
        "net.ipv4.ip_forward": {
            "value": 1
        },
        "net.ipv4.ip_nonlocal_bind": {
            "value": 0
        },
        "net.ipv4.neigh.default.gc_thresh1": {
            "value": 1024
        },
        "net.ipv4.neigh.default.gc_thresh2": {
            "value": 2048
        },
        "net.ipv4.neigh.default.gc_thresh3": {
            "value": 4096
        },
        "net.ipv4.tcp_keepalive_intvl": {
            "value": 1
        },
        "net.ipv4.tcp_keepalive_probes": {
            "value": 5
        },
        "net.ipv4.tcp_keepalive_time": {
            "value": 5
        },
        "net.ipv6.conf.all.accept_ra": {
            "value": 0
        },
        "net.ipv6.conf.all.accept_redirects": {
            "value": 0
        },
        "net.ipv6.conf.all.autoconf": {
            "value": 0
        },
        "net.ipv6.conf.all.disable_ipv6": {
            "value": 0
        },
        "net.ipv6.conf.all.ndisc_notify": {
            "value": 1
        },
        "net.ipv6.conf.default.accept_ra": {
            "value": 0
        },
        "net.ipv6.conf.default.accept_redirects": {
            "value": 0
        },
        "net.ipv6.conf.default.autoconf": {
            "value": 0
        },
        "net.ipv6.conf.default.disable_ipv6": {
            "value": 0
        },
        "net.ipv6.conf.lo.disable_ipv6": {
            "value": 0
        },
        "net.ipv6.ip_nonlocal_bind": {
            "value": 0
        },
        "net.netfilter.nf_conntrack_max": {
            "value": 500000
        },
        "net.nf_conntrack_max": {
            "value": 500000
        }
    },

As you can see, nothing about the net.bridge.*; so I suspect this is done outside of TripleO.
Maybe in the RDO node...

Read more...

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers