race condition in quark-agent can lead to inconsistent updates
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
quark |
Fix Released
|
High
|
Kyle Haley |
Bug Description
I originally thought that the Redis store would return updates to security groups as separate "update" messages, but it appears that any updates to security groups is actually overriding the previous redis entry.
Each redis entry representing an instance has a key tagged as device_
I now think the problem could _potentially_ be a race condition. When quark-agent polls redis, it pulls the "security group ack" field from the key, and if it is False, it'll populate lists about whether a SG needs to be added, updated, or removed. It doesn't pass the security group rule information to the hypervisor, it only tells the scripts on the hypervisor that THEY need to poll redis and retrieve the new information. It functionally passes a list of VIF's that need to be updated, whether that's add/update/remove.
Now, this information is only ack'd by quark-agent AFTER the call to update the interface on the instance has been made. It's entirely possible that the field in Redis was updated between the time it was retrieved by the xen scripts and quark-agent ack's the field. In that way, the field is ack'd even though the information has changed.
I think two pieces of info would verify this:
# Reverse the test. Instead of doing block/allow, do the opposite, allow/block. If after the test we see instances are still ALLOWING traffic when they should be blocking, then that may highlight that this is a race condition.
# IF we find instances that our out of sync with the state they should be in, I should be able to locate their field info in redis. If the new SG rule is there, but it's ack'd as True and obviously hasn't been applied to the instance, then we know this is a race condition.
A few ways to solve this:
# quark-agent will pull in information from the "security group rules" field and then poll it again once the flows scripts are done. If the information doesn't match, it'll re-execute. Potential for looping if something goes wrong here.
# quark-agent will ack the "security group ack" field with a third option, like "update". It'll poll redis again to see if the field is False. If False, that means the information was updated and it can either reapply the scripts or just let the next polling cycle take care of it.
# quark-agent ack's the field as soon as the security group information is retrieved. This would allow the next polling cycle to pick up the latest change. This would only lessen the window for a race condition, however, and not remove it.
It was determined that solution #1 under "A few ways to solve this" was selected and it is working.