race condition in quark-agent can lead to inconsistent updates

Bug #1713860 reported by Kyle Haley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
quark
Fix Released
High
Kyle Haley

Bug Description

I originally thought that the Redis store would return updates to security groups as separate "update" messages, but it appears that any updates to security groups is actually overriding the previous redis entry.

Each redis entry representing an instance has a key tagged as device_id:mac_address. This key has two fields attached, "security group ack" and "security group rules". Whenever an update occurs to the "security group rules" field, the information within that field is overwritten and the "security group ack" field is set to False.

I now think the problem could _potentially_ be a race condition. When quark-agent polls redis, it pulls the "security group ack" field from the key, and if it is False, it'll populate lists about whether a SG needs to be added, updated, or removed. It doesn't pass the security group rule information to the hypervisor, it only tells the scripts on the hypervisor that THEY need to poll redis and retrieve the new information. It functionally passes a list of VIF's that need to be updated, whether that's add/update/remove.

Now, this information is only ack'd by quark-agent AFTER the call to update the interface on the instance has been made. It's entirely possible that the field in Redis was updated between the time it was retrieved by the xen scripts and quark-agent ack's the field. In that way, the field is ack'd even though the information has changed.

I think two pieces of info would verify this:
# Reverse the test. Instead of doing block/allow, do the opposite, allow/block. If after the test we see instances are still ALLOWING traffic when they should be blocking, then that may highlight that this is a race condition.
# IF we find instances that our out of sync with the state they should be in, I should be able to locate their field info in redis. If the new SG rule is there, but it's ack'd as True and obviously hasn't been applied to the instance, then we know this is a race condition.

A few ways to solve this:

# quark-agent will pull in information from the "security group rules" field and then poll it again once the flows scripts are done. If the information doesn't match, it'll re-execute. Potential for looping if something goes wrong here.
# quark-agent will ack the "security group ack" field with a third option, like "update". It'll poll redis again to see if the field is False. If False, that means the information was updated and it can either reapply the scripts or just let the next polling cycle take care of it.
# quark-agent ack's the field as soon as the security group information is retrieved. This would allow the next polling cycle to pick up the latest change. This would only lessen the window for a race condition, however, and not remove it.

Revision history for this message
Kyle Haley (quadewarren) wrote :

It was determined that solution #1 under "A few ways to solve this" was selected and it is working.

Changed in neutron-quark:
importance: Undecided → High
status: New → Confirmed
status: Confirmed → In Progress
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quark (master)

Reviewed: https://review.openstack.org/498973
Committed: https://git.openstack.org/cgit/openstack/quark/commit/?id=500817ab95875b08360f6a4facbd452271797078
Submitter: Jenkins
Branch: master

commit 500817ab95875b08360f6a4facbd452271797078
Author: Kyle Haley <email address hidden>
Date: Thu Aug 3 11:30:32 2017 -0700

    Adding updates to quark-agent to fix race condition

    These updates to redis_base allow quark-agent to check for a potential
    race condition that can occur from the time quark-agent acknowledges a
    security group rule set needs to be updated and when it acknowledges
    that rule set has actually been applied. quark-agent will now store
    security group rules before and after it has executed scripts on the
    hypervisor. If the rule sets do not match, that means an update has
    occurred while the hypervisor scripts were executing. quark-agent will
    not ack those changes and allows the next cycle to pick them up.

    Change-Id: Ieae13f7b22b8e463cba2ccce82fb94699838926a
    Closes-Bug: 1713860

Changed in neutron-quark:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.