Race condition while processing security_groups_member_updated events (ipset)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | ||
Ussuri |
Fix Released
|
Undecided
|
Unassigned | ||
Victoria |
Fix Released
|
Undecided
|
Unassigned | ||
neutron |
Fix Released
|
Medium
|
Charles Farquhar | ||
neutron (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Undecided
|
Unassigned | ||
Groovy |
Fix Released
|
Undecided
|
Unassigned | ||
Hirsute |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
# Summary
Race condition while processing security_
# Overview
We have a customer that uses heat templates to deploy large environments (e.g. 21 instances) with a significant number of security groups (e.g. 60) that use bi-directional remote group references for both ingress and egress filtering. These heat stacks are deployed using a CI pipeline and intermittently suffer from application layer failures due to broken network connectivity. We found that this was caused by the ipsets used to implement remote_group memberships missing IPs from their member lists. Troubleshooting suggests this is caused by a race condition, which I've attempted to describe in detail below.
Version: `54e1a6b1bc378c
I'm working on getting some multi-node environments deployed (I don't think it's possible to reproduce this with a single hypervisor) and hope to provide reproduction steps on Rocky and master soon. I wanted to get this report submitted as-is with the hopes that an experienced Neutron dev might be able to spot possible solutions or provide diagnostic insight that I am not yet able to produce.
I suspect this report may be easier to read with some markdown, so please feel free to read it in a gist: https:/
Also, this diagram is probably critical to following along: https:/
# Race condition symptoms
Given the following security groups/rules:
```
| secgroup name | secgroup id | direction | remote group | dest port |
|------
| server | fcd6cf12-
| client | b52c8c54-
```
And the following instances:
```
| instance name | hypervisor | ip | secgroup assignment |
|------
| server01 | compute01 | 192.168.0.1 | server |
| server02 | compute02 | 192.168.0.2 | server |
| server03 | compute03 | 192.168.0.3 | server |
| client01 | compute04 | 192.168.0.4 | client |
```
We would expect to find the following ipset representing the `server` security group members on `compute04`:
```
# ipset list NIPv4fcd6cf12-
Name: NIPv4fcd6cf12-
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 536
References: 4
Number of entries: 3
Members:
192.168.0.1
192.168.0.2
192.168.0.3
```
What we actually get when the race condition is triggered is an incomplete list of members in the ipset. The member list could contain anywhere between zero and two of the expected IPs.
# Triggering the race condition
The problem occurs when `security_
- `port_update` step 12 retrieves the remote security groups' member lists, which are not necessarily complete yet.
- `port_update` step 22 adds the port to `IptablesFirewa
This results in `security_
# Race condition details
The race condition occurs in the linuxbridge agent between the two following operations:
1) Processing a `port_update` event when an instance is first created
2) Processing `security_
Either of these operations can result in creating or mutating an ipset from `IpsetManager.
## Processing a `port_update` event:
1) We receive an RPC port_update event via `LinuxBridgeRpc
2) Sleep until the next `CommonAgentLoo
3) `CommonAgentLoo
4) `CommonAgentLoo
5) `CommonAgentLoo
6) `CommonAgentLoo
7) `CommonAgentLoo
8) `SecurityGroupA
9) `SecurityGroupA
10) `SecurityGroupA
11) `SecurityGroupA
12) `SecurityGroupS
13) `SecurityGroupS
14) `SecurityGroupA
15) `SecurityGroupA
16) `IptablesFirewa
17) `IptablesFirewa
18) `IpsetManager.
19) The stack unwinds back up to `SecurityGroupA
20) `SecurityGroupA
21) `IptablesFirewa
22) `IptablesFirewa
## Processing a `security_
1) We receive an RPC security_
2) `SecurityGroupA
3) `SecurityGroupA
4) `SecurityGroupA
Changed in neutron: | |
importance: | Undecided → Medium |
affects: | ubuntu → neutron (Ubuntu) |
Hi, thanks for the detailed bug report. If you provide some reproduction steps (some config details are welcomed as well) I can do a try in my local env (I can start a multihost devstack, with 1 extra compute, or perhaps even 2 extra computes).