Fullstack tests neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork fails often

Bug #1742401 reported by Slawek Kaplonski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Slawek Kaplonski

Bug Description

Fullstack tests from group neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork are often failing in gate with error like:

ft1.1: neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork.test_securitygroup(ovs-hybrid)_StringException: Traceback (most recent call last):
  File "neutron/tests/base.py", line 132, in func
    return f(self, *args, **kwargs)
  File "neutron/tests/fullstack/test_securitygroup.py", line 193, in test_securitygroup
    net_helpers.assert_no_ping(vms[0].namespace, vms[1].ip)
  File "neutron/tests/common/net_helpers.py", line 155, in assert_no_ping
    {'ns': src_namespace, 'destination': dst_ip})
  File "neutron/tests/tools.py", line 144, in fail
    raise unittest2.TestCase.failureException(msg)
AssertionError: destination ip 20.0.0.9 is replying to ping from namespace test-dbbb4045-363f-44cb-825b-17090f28df11, but it shouldn't

Example gate logs: http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I spot this issue once locally for now. From analyze of logs I can suspect that there can be some issue with neutron.agent.resource_cache:

- in test logs I can see call made to neutron-server to apply SG on port:
2018-01-17 22:08:11.956 3981 DEBUG neutronclient.client [-] REQ: curl -i http://127.0.0.1:15603/v2.0/ports/70cd10e8-c511-4bfa-a11f-f287f1aeafd8 -X PUT -H "X-Auth-Token: {SHA1}da39a3ee5e6b4b0d3255bfef95601890afd80709" -H "User-Agent: python-neutronclient" -d '{"port": {"security_groups": ["d7c4f90c-8c73-4dd9-8472-297ee226e7ae"], "port_security_enabled": true}}' http_log_req /opt/stack/neutron/.tox/dsvm-fullstack/local/lib/python2.7/site-packages/neutronclient/common/utils.py:195

- in openvswitch agent I can see, just after this call something like:
2018-01-17 22:08:12.707 6083 DEBUG neutron.agent.resource_cache [req-5086edc1-0bd2-427a-a352-409130a50b67 - - - - -] Received resource Port update without any changes: 70cd10e8-c511-4bfa-a11f-f287f1aeafd8 record_resource_update /opt/stack/neutron/neutron/agent/resource_cache.py:177

And after that port still has in iptables rules to accept all traffic as it is for port with disabled port security:

   21 3672 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-out tap70cd10e8-c5 --physdev-is-bridged /* Accept all packets when port security is disabled. */
   20 1949 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-in tap70cd10e8-c5 --physdev-is-bridged /* Accept all packets when port security is disabled. */

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/535368

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

There is also another problem quite often in this test when "ncat" process don't starts properly:

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "neutron/tests/base.py", line 132, in func
        return f(self, *args, **kwargs)
      File "neutron/tests/fullstack/test_securitygroup.py", line 163, in test_securitygroup
        net_helpers.NetcatTester.TCP)
      File "neutron/tests/fullstack/test_securitygroup.py", line 68, in assert_connection
        self.assertTrue(netcat.test_connectivity())
      File "neutron/tests/common/net_helpers.py", line 509, in test_connectivity
        self.client_process.writeline(testing_string)
      File "neutron/tests/common/net_helpers.py", line 459, in client_process
        self.establish_connection()
      File "neutron/tests/common/net_helpers.py", line 489, in establish_connection
        address=self.address)
      File "neutron/tests/common/net_helpers.py", line 537, in _spawn_nc_in_namespace
        proc = RootHelperProcess(cmd, namespace=namespace)
      File "neutron/tests/common/net_helpers.py", line 288, in __init__
        self._wait_for_child_process()
      File "neutron/tests/common/net_helpers.py", line 321, in _wait_for_child_process
        "in %d seconds" % (self.cmd, timeout)))
      File "neutron/common/utils.py", line 649, in wait_until_true
        raise exception
    RuntimeError: Process ['ncat', u'20.0.0.5', '3333', '-w', '20'] hasn't been spawned in 20 seconds

But I will try to debug it later...

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

So after more debugging I think that issue is due to some race condition in updating local resource_cache on agent.
What I think that happens:
1. Ports are updated one by one to set SG on it and turn on port_security,
2. Agent receives "port update" push notification for one of ports and stores it's object in resource cache,
2. Then (I don't know why exactly) resource_cache.RemoteResourceCache.get_resources("Ports", filters={'security_group_ids':(SG_ID,)}); This method receives info about our "broken" port and updates it's values for such with configured SG - see method _flood_cache_for_query in RemoteResourceCache class: https://github.com/openstack/neutron/blob/master/neutron/agent/resource_cache.py#L83
3. Then (few milliseconds later) neutron server sends push notification with "new" data about our broken port. But same date is already stored in cache so it don't triggers local notification and SG are not applied for this port (it's in https://github.com/openstack/neutron/blob/master/neutron/agent/resource_cache.py#L176)

Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Download full text (4.7 KiB)

It looks that OVO Port.get_objects() don't filter properly on security_group_ids. I checked it in /opt/stack/neutron/neutron/api/rpc/handlers/resources_rpc.py with local pdb and I got results like:

 140 @oslo_messaging.expected_exceptions(rpc_exc.CallbackNotFound)
 141 def bulk_pull(self, context, resource_type, version, filter_kwargs=None):
 142 if resource_type == "Port":
 143 import remote_pdb; remote_pdb.set_trace()
 144 filter_kwargs = filter_kwargs or {}
 145 resource_type_cls = _resource_to_class(resource_type)
 146 # TODO(kevinbenton): add in producer registry so producers can add
 147 # hooks to mangle these things like they can with 'pull'.
 148 -> return [obj.obj_to_primitive(target_version=version)
 149 for obj in resource_type_cls.get_objects(context, _pager=None,
 150 **filter_kwargs)]
(Pdb++) pp filter_kwargs
{u'security_group_ids': [u'b7a14ec4-36be-4fcc-ba9e-d58287a3ef91']}
(Pdb++) pp resource_type_cls
<class 'neutron.objects.ports.Port'>
(Pdb++) pp resource_type_cls.get_objects(context, _pager=None, **filter_kwargs)
[Port(admin_state_up=True,allowed_address_pairs=[],binding=PortBinding,binding_levels=[PortBindingLevel],created_at=2018-01-19T22:17:19Z,data_plane_status=<?>,description='',device_id='',device_owner='',dhcp_options=[],distributed_binding=None,dns=None,fixed_ips=[IPAllocation],id=1e624c28-343d-4bd7-994d-8d1cec1944ba,mac_address=fa:16:3e:a1:05:15,name='',network_id=1cf64024-3a22-4c6b-b914-4ee4cd4bee52,project_id='c0749e3c-9724-499c-9e6a-ff2f59449909',qos_policy_id=None,revision_number=7,security=PortSecurity(1e624c28-343d-4bd7-994d-8d1cec1944ba),security_group_ids=set([b7a14ec4-36be-4fcc-ba9e-d58287a3ef91]),status='ACTIVE',updated_at=2018-01-19T22:17:43Z),
 Port(admin_state_up=True,allowed_address_pairs=[],binding=PortBinding,binding_levels=[PortBindingLevel],created_at=2018-01-19T22:17:23Z,data_plane_status=<?>,description='',device_id='',device_owner='',dhcp_options=[],distributed_binding=None,dns=None,fixed_ips=[IPAllocation],id=8b00f7d9-f66b-4a1a-a47c-620cc6b8caed,mac_address=fa:16:3e:bc:c2:32,name='',network_id=1cf64024-3a22-4c6b-b914-4ee4cd4bee52,project_id='c0749e3c-9724-499c-9e6a-ff2f59449909',qos_policy_id=None,revision_number=7,security=PortSecurity(8b00f7d9-f66b-4a1a-a47c-620cc6b8caed),security_group_ids=set([7fcda581-965f-4802-b9af-09091f929531]),status='ACTIVE',updated_at=2018-01-19T22:17:45Z),
 Port(admin_state_up=True,allowed_address_pairs=[],binding=PortBinding,binding_levels=[PortBindingLevel],created_at=2018-01-19T22:17:18Z,data_plane_status=<?>,description='',device_id='dhcp4d86b788-a506-5af1-b7cc-b9464f818a8f-1cf64024-3a22-4c6b-b914-4ee4cd4bee52',device_owner='network:dhcp',dhcp_options=[],distributed_binding=None,dns=None,fixed_ips=[IPAllocation],id=9e6978e5-14b7-41f1-86d4-8c3d6461d6b5,mac_address=fa:16:3e:df:c2:b0,name='',network_id=1cf64024-3a22-4c6b-b914-4ee4cd4bee52,project_id='c0749e3c-9724-499c-9e6a-ff2f59449909',qos_policy_id=None,revision_number=5,security=PortSecurity(9e6978e5-14b7-41f1-86d4-8c3d6461d6b5),...

Read more...

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

For issue with "ncat" described in comment #3 I opened separate bug: https://bugs.launchpad.net/neutron/+bug/1744402

This bug is only related to issue with pinging VM when it shoudn't

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/535368

Changed in neutron:
status: Confirmed → In Progress
Miguel Lavalle (minsel)
Changed in neutron:
milestone: none → queens-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/536367
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=725df3e0382e048391fac109ea57920683eaf4d0
Submitter: Zuul
Branch: master

commit 725df3e0382e048391fac109ea57920683eaf4d0
Author: Sławek Kapłoński <email address hidden>
Date: Mon Jan 22 14:01:30 2018 +0100

    Fix race condition with enabling SG on many ports at once

    When there are many calls to enable security groups on ports there
    can be sometimes race condition between refresh recource_cache
    with data get by "pull" call to neutron server and data received
    with "push" rpc message from neutron server.
    In such case when "push" message comes with information about
    updated port (with enabled port_security), in local cache this port
    is already updated so local AFTER_UPDATE call is not called for
    such port and its rules in firewall are not updated.

    It happend quite often in fullstack security groups test because
    there are 4 ports created in this test and all 4 are updated to
    apply SG to it one by one.
    And here's what happen then in details:
    1. port 1 was updated in neutron-server so it sends push notification
       to L2 agent to update security groups,
    2. port 1 info was saved in resource cache on L2 agent's side and agent
       started to configure security groups for this port,
    3. as one of steps L2 agent called
       SecurityGroupServerAPIShim._select_ips_for_remote_group() method;
       In that method RemoteResourceCache.get_resources() is called and this
       method asks neutron-server for details about ports from given
       security_group,
    4. in the meantime neutron-server got port update call for second port
       (with same security group) so it sends to L2 agent informations about 2
       ports (as a reply to request sent from L2 agent in step 3),
    5. resource cache updates informations about two ports in local cache,
       returns its data to
       SecurityGroupServerAPIShim._select_ips_for_remote_group() and all
       looks fine,
    6. but now L2 agent receives push notification with info that port 2 is
       updated (changed security groups), so it checks info about this port
       in local cache,
    7. in local cache info about port 2 is already WITH updated security
       group so RemoteResourceCache doesn't trigger local notification about
       port AFTER UPDATE and L2 agent doesn't know that security groups for this
       port should be changed

    This patch fixes it by changing way how items are updated in
    the resource_cache.
    For now it is done with record_resource_update() method instead of
    writing new values directly to resource_cache._type_cache dict.
    Due to that if resource will be updated during "pull" call to neutron
    server, local AFTER_UPDATE will still be triggered for such resource.

    Change-Id: I5a62cc5731c5ba571506a3aa26303a1b0290d37b
    Closes-Bug: #1742401

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.0.0rc1

This issue was fixed in the openstack/neutron 12.0.0.0rc1 release candidate.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/573638

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/573638
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b22a9d5ad502055529e9c62607c88e0fa1ec28ba
Submitter: Zuul
Branch: stable/pike

commit b22a9d5ad502055529e9c62607c88e0fa1ec28ba
Author: Sławek Kapłoński <email address hidden>
Date: Mon Jan 22 14:01:30 2018 +0100

    Fix race condition with enabling SG on many ports at once

    When there are many calls to enable security groups on ports there
    can be sometimes race condition between refresh recource_cache
    with data get by "pull" call to neutron server and data received
    with "push" rpc message from neutron server.
    In such case when "push" message comes with information about
    updated port (with enabled port_security), in local cache this port
    is already updated so local AFTER_UPDATE call is not called for
    such port and its rules in firewall are not updated.

    It happend quite often in fullstack security groups test because
    there are 4 ports created in this test and all 4 are updated to
    apply SG to it one by one.
    And here's what happen then in details:
    1. port 1 was updated in neutron-server so it sends push notification
       to L2 agent to update security groups,
    2. port 1 info was saved in resource cache on L2 agent's side and agent
       started to configure security groups for this port,
    3. as one of steps L2 agent called
       SecurityGroupServerAPIShim._select_ips_for_remote_group() method;
       In that method RemoteResourceCache.get_resources() is called and this
       method asks neutron-server for details about ports from given
       security_group,
    4. in the meantime neutron-server got port update call for second port
       (with same security group) so it sends to L2 agent informations about 2
       ports (as a reply to request sent from L2 agent in step 3),
    5. resource cache updates informations about two ports in local cache,
       returns its data to
       SecurityGroupServerAPIShim._select_ips_for_remote_group() and all
       looks fine,
    6. but now L2 agent receives push notification with info that port 2 is
       updated (changed security groups), so it checks info about this port
       in local cache,
    7. in local cache info about port 2 is already WITH updated security
       group so RemoteResourceCache doesn't trigger local notification about
       port AFTER UPDATE and L2 agent doesn't know that security groups for this
       port should be changed

    This patch fixes it by changing way how items are updated in
    the resource_cache.
    For now it is done with record_resource_update() method instead of
    writing new values directly to resource_cache._type_cache dict.
    Due to that if resource will be updated during "pull" call to neutron
    server, local AFTER_UPDATE will still be triggered for such resource.

    Change-Id: I5a62cc5731c5ba571506a3aa26303a1b0290d37b
    Closes-Bug: #1742401
    (cherry picked from commit 725df3e0382e048391fac109ea57920683eaf4d0)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.6

This issue was fixed in the openstack/neutron 11.0.6 release.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I will reopen this bug as this is still failing in u/s master branch.

Changed in neutron:
status: Fix Released → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/730580

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/730580
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e748e01656e6d2db1510d85e4c25e21ff394c971
Submitter: Zuul
Branch: master

commit e748e01656e6d2db1510d85e4c25e21ff394c971
Author: Slawek Kaplonski <email address hidden>
Date: Mon May 25 10:32:07 2020 +0200

    [Fullstack] Mark security groups test as unstable

    It's still failing from time to time and we don't know the root
    cause of that failure. So lets mark it as unstable for now until we
    will finally find out what's going on in this test and until we will
    finally fix it.

    Change-Id: I2b6b50fb14856997d886daed5cf7bb4898d57130
    Related-bug: #1742401

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/737741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/737741
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1e9a3bffd2cf565568ac479710104cd3a4cdae53
Submitter: Zuul
Branch: master

commit 1e9a3bffd2cf565568ac479710104cd3a4cdae53
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jun 24 12:32:14 2020 +0200

    [Fullstack] SG test - first wait until all vms can ping each other

    In fullstack security group test, after fake VMs are created there
    check if connectivity to some custom port is working fine with port
    security disabled.
    After that there is called "block_until_ping" method for each vms.

    This patch changes that to first wait if we can ping vms and later do
    netcat tests.
    Even if that will not solve problems with failures of this test, we
    may know more if the issue is caused by netcat or it's just no
    ICMP connectivity between VMs at all.

    Change-Id: Ie9e2170c761c9a10f3daa991c3fb77f304bb07e2
    Related-Bug: #1742401

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/736034
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers