security group listing race

Bug #1262566 reported by Attila Fazekas on 2013-12-19
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Critical
Aaron Rosen
Icehouse
Critical
Sean Dague

Bug Description

In this grenade job
http://logs.openstack.org/63/62463/3/gate/gate-grenade-dsvm/4a94d81/console.html

One process tries to list ALL security groups :
"2013-12-19 06:42:11.519 15319 INFO tempest.common.rest_client [-] Request: GET http://127.0.0.1:8774/v2/1fb77a9f8ccc417498161dbad4eeabda/os-security-groups?all_tenants=true"
(pid=15319)

http://logs.openstack.org/63/62463/3/gate/gate-grenade-dsvm/4a94d81/logs/tempest.txt.gz#_2013-12-19_06_42_11_519

While another process deletes one:
http://logs.openstack.org/63/62463/3/gate/gate-grenade-dsvm/4a94d81/logs/tempest.txt.gz#_2013-12-19_06_42_11_510
"2013-12-19 06:42:11.510 15315 INFO tempest.common.rest_client [-] Request: DELETE http://127.0.0.1:8774/v2/c64e11f31d25473b91f7e1124f41f2a1/os-security-groups/27"
(pid=15315)

The test case failed on listing ALL security group request , which requested closely in time to a security group deletion.
'<message>Security group 27 not found.</message></itemNotFound>'

$ nova secgroup-list --all-tenants 1 # is the cli equivalent of the failing request, this call should not fail with 'Security group 27 not found.'

Matt Riedemann (mriedem) on 2014-01-27
tags: added: network
Gary Kotton (garyk) wrote :

Is this with Neutron or traditional nova networks?
Thanks
Gary

Attila Fazekas (afazekas) wrote :

n-net, so traditional nova network.

Sean Dague (sdague) on 2014-04-27
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Sean Dague (sdague) wrote :

Confirmed this issue in the code. The crux of the problem is as follows:

After listing the security groups, we call a list comprehension to format it (https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/contrib/security_groups.py#L292-L293)

        with translate_exceptions():
            project_id = context.project_id
            raw_groups = self.security_group_api.list(context,
                                                      project=project_id,
                                                      search_opts=search_opts)

        limited_list = common.limited(raw_groups, req)
        result = [self._format_security_group(context, group)
                    for group in limited_list]

The _format call goes through and indirection then gets to:

    def _format_security_group_rule(self, context, rule):
        sg_rule = {}
        sg_rule['id'] = rule['id']
        sg_rule['parent_group_id'] = rule['parent_group_id']
        sg_rule['ip_protocol'] = rule['protocol']
        sg_rule['from_port'] = rule['from_port']
        sg_rule['to_port'] = rule['to_port']
        sg_rule['group'] = {}
        sg_rule['ip_range'] = {}
        if rule['group_id']:
            with translate_exceptions():
                source_group = self.security_group_api.get(context,
                                                           id=rule['group_id'])
            sg_rule['group'] = {'name': source_group.get('name'),
                             'tenant_id': source_group.get('project_id')}
        else:
            sg_rule['ip_range'] = {'cidr': rule['cidr']}
        return sg_rule

The security_group_api.get means we've got another trip to the db (for every rule), and any one of those could fail with the sg having been deleted in the time allotted.

Aaron Rosen (arosen) on 2014-05-08
Changed in nova:
assignee: nobody → Aaron Rosen (arosen)

Fix proposed to branch: master
Review: https://review.openstack.org/92893

Changed in nova:
status: Confirmed → In Progress
Sean Dague (sdague) wrote :

This is a substantial cause of gate resets and random fails

Changed in nova:
importance: High → Critical

Reviewed: https://review.openstack.org/92893
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=eb84cfb4ef7376401f48488d4810714c2664e5fa
Submitter: Jenkins
Branch: master

commit eb84cfb4ef7376401f48488d4810714c2664e5fa
Author: Aaron Rosen <email address hidden>
Date: Thu May 8 11:25:02 2014 -0700

    Fix security group race condition while listing and deleting rules

    Previously, there was a race condition that could occur if one was listing
    a security group which contained a rule which referenced another security
    group as an additional api call is used to look up the security group's
    name which is returned (rather than id) via the API. The problem occurs if
    this security group is deleted in which case a 404 was raised.
    This patch fixes this issue by catching the 404 and ignoring the rule as
    its already deleted as there is a constraint enforced that you cannot
    delete a security group if another security group has a rule that references
    that security group.

    Change-Id: I3855b8d3a1bd2f7c88901da071f9bd5581bd56e4
    Closes-bug: 1262566

Changed in nova:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/97845
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=867341ff27cf91503e6919a2702298ce39f679ad
Submitter: Jenkins
Branch: stable/icehouse

commit 867341ff27cf91503e6919a2702298ce39f679ad
Author: Aaron Rosen <email address hidden>
Date: Thu May 8 11:25:02 2014 -0700

    Fix security group race condition while listing and deleting rules

    Previously, there was a race condition that could occur if one was listing
    a security group which contained a rule which referenced another security
    group as an additional api call is used to look up the security group's
    name which is returned (rather than id) via the API. The problem occurs if
    this security group is deleted in which case a 404 was raised.
    This patch fixes this issue by catching the 404 and ignoring the rule as
    its already deleted as there is a constraint enforced that you cannot
    delete a security group if another security group has a rule that references
    that security group.

    Change-Id: I3855b8d3a1bd2f7c88901da071f9bd5581bd56e4
    Closes-bug: 1262566
    (cherry picked from commit eb84cfb4ef7376401f48488d4810714c2664e5fa)

tags: added: in-stable-icehouse
Alan Pevec (apevec) on 2014-06-05
tags: removed: in-stable-icehouse
Thierry Carrez (ttx) on 2014-06-11
Changed in nova:
milestone: none → juno-1
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/92898
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2620a7c74e2d7e9c904af364a59efcd69cb5947f
Submitter: Jenkins
Branch: master

commit 2620a7c74e2d7e9c904af364a59efcd69cb5947f
Author: Aaron Rosen <email address hidden>
Date: Thu May 8 12:07:27 2014 -0700

    Fix security group race condition while creating rule

    Previously, it was possible for someone to create a security group rule
    where the rule references another security group. To do this nova would
    first create the security group rule and then look up the referenced group
    (group_id's) name in order to return that via the api. During this time
    it's possible for someone to delete this security group rule and the
    referenced group before the call returned resulting in a 404 error being
    raised. This patch addresses this issue by looking up the group name
    first and then creating the security group rule in order to avoid this
    from occuring.

    Note: this patch also adds a test to cover the case where an invalid group
    id was passed though this does not really add any real additional coverage
    to the change largely because the code is currently using global stubs. That
    said, the previous coverage should hopefully be sufficient

    Change-Id: If58ffa5629ba5166f260379ac47922974de31be0
    Related-bug: 1262566

Thierry Carrez (ttx) on 2014-10-16
Changed in nova:
milestone: juno-1 → 2014.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers