Race condition while processing security_groups_member_updated events (ipset)

Bug #1887405 reported by Charles Farquhar
56
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Ussuri
Fix Released
Undecided
Unassigned
Victoria
Fix Released
Undecided
Unassigned
neutron
Fix Released
Medium
Charles Farquhar
neutron (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned
Groovy
Fix Released
Undecided
Unassigned
Hirsute
Fix Released
Undecided
Unassigned

Bug Description

# Summary

Race condition while processing security_groups_member_updated events (ipset)

# Overview

We have a customer that uses heat templates to deploy large environments (e.g. 21 instances) with a significant number of security groups (e.g. 60) that use bi-directional remote group references for both ingress and egress filtering. These heat stacks are deployed using a CI pipeline and intermittently suffer from application layer failures due to broken network connectivity. We found that this was caused by the ipsets used to implement remote_group memberships missing IPs from their member lists. Troubleshooting suggests this is caused by a race condition, which I've attempted to describe in detail below.

Version: `54e1a6b1bc378c0745afc03987d0fea241b826ae` (HEAD of stable/rocky as of Jan 26, 2020), though I suspect this issue persists through master.

I'm working on getting some multi-node environments deployed (I don't think it's possible to reproduce this with a single hypervisor) and hope to provide reproduction steps on Rocky and master soon. I wanted to get this report submitted as-is with the hopes that an experienced Neutron dev might be able to spot possible solutions or provide diagnostic insight that I am not yet able to produce.

I suspect this report may be easier to read with some markdown, so please feel free to read it in a gist: https://gist.github.com/cfarquhar/20fddf2000a83216021bd15b512f772b

Also, this diagram is probably critical to following along: https://user-images.githubusercontent.com/1253665/87317744-0a75b180-c4ed-11ea-9bad-085019c0f954.png

# Race condition symptoms

Given the following security groups/rules:

```
| secgroup name | secgroup id | direction | remote group | dest port |
|---------------|--------------------------------------|-----------|--------------------------------------|-----------|
| server | fcd6cf12-2ac9-4704-9208-7c6cb83d1a71 | ingress | b52c8c54-b97a-477d-8b68-f4075e7595d9 | 9092 |
| client | b52c8c54-b97a-477d-8b68-f4075e7595d9 | egress | fcd6cf12-2ac9-4704-9208-7c6cb83d1a71 | 9092 |
```

And the following instances:

```
| instance name | hypervisor | ip | secgroup assignment |
|---------------|------------|-------------|---------------------|
| server01 | compute01 | 192.168.0.1 | server |
| server02 | compute02 | 192.168.0.2 | server |
| server03 | compute03 | 192.168.0.3 | server |
| client01 | compute04 | 192.168.0.4 | client |
```

We would expect to find the following ipset representing the `server` security group members on `compute04`:

```
# ipset list NIPv4fcd6cf12-2ac9-4704-9208-
Name: NIPv4fcd6cf12-2ac9-4704-9208-
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 536
References: 4
Number of entries: 3
Members:
192.168.0.1
192.168.0.2
192.168.0.3
```

What we actually get when the race condition is triggered is an incomplete list of members in the ipset. The member list could contain anywhere between zero and two of the expected IPs.

# Triggering the race condition

The problem occurs when `security_group_member_updated` events arrive between `port_update` steps 12 and 22 (see diagram and process details below).
  - `port_update` step 12 retrieves the remote security groups' member lists, which are not necessarily complete yet.
  - `port_update` step 22 adds the port to `IptablesFirewallDriver.ports()`.

This results in `security_group_member_updated` step 3 looking for the port to apply the updated member list to (in `IptablesFirewallDriver.ports()`) BEFORE it has been added by `port_update`'s step 22. This causes the membership update event to effectively be discarded. We are then left with whatever the remote security group's member list was when the `port_update` process retrieved it at step 12. This state persists until something triggers the port being re-added to the `updated_ports` list (e.g. agent restart, another remote group membership change, local security group addition/removal, etc).

# Race condition details

The race condition occurs in the linuxbridge agent between the two following operations:
  1) Processing a `port_update` event when an instance is first created
  2) Processing `security_group_member_updated` events for the instance's remote security groups.

Either of these operations can result in creating or mutating an ipset from `IpsetManager.set_members()`. The relevant control flow sequence for each operation is listed below. I've left out any branches that did not seem to be relevant to the race condition.

## Processing a `port_update` event:
  1) We receive an RPC port_update event via `LinuxBridgeRpcCallbacks.port_update()`, which adds the tap device to the `LinuxBridgeRpcCallbacks.updated_devices` list
  2) Sleep until the next `CommonAgentLoop.daemon_loop()` iteration
  3) `CommonAgentLoop.daemon_loop()` calls `CommonAgentLoop.scan_devices()`
  4) `CommonAgentLoop.scan_devices()` calls `LinuxBridgeRpcCallbacks.get_and_clear_updated_devices()` to retrieve and clear the `updated_devices` list from step 1.
  5) `CommonAgentLoop.scan_devices()` performs some calculations and returns control to `CommonAgentLoop.daemon_loop()` along with a list of added or updated devices.
  6) `CommonAgentLoop.daemon_loop()` calls `CommonAgentLoop.process_network_devices()` when there are added or updated devices from step 5.
  7) `CommonAgentLoop.process_network_devices()` calls `SecurityGroupAgentRpc.setup_port_filters()` with the added or updated devices from step 5.
  8) `SecurityGroupAgentRpc.setup_port_filters()` checks for devices that were added to the `SecurityGroupAgentRpc.devices_to_refilter` list (which happens in step 4 of the `security_group_member_updated` process) and appends them to the list of updated devices. This is where devices from the `security_group_member_updated` process would have been processed if they weren't lost due to the race condition.
  9) `SecurityGroupAgentRpc.setup_port_filters()` calls `SecurityGroupAgentRpc.prepare_devices_filter()` with the updated port IDs
  10) `SecurityGroupAgentRpc.prepare_devices_filter()` calls `SecurityGroupAgentRpc._apply_port_filter()` with the updated port IDs
  11) `SecurityGroupAgentRpc._apply_port_filter()` calls `SecurityGroupServerRpcApi.security_group_info_for_devices()` with the tap device id
  12) `SecurityGroupServerRpcApi.security_group_info_for_devices()` retrieves detailed information about the port via RPC. The important detail here is a `sg_member_ids` dict which contains a list of member IPs for each remote security groups applicable to the port.
  13) `SecurityGroupServerRpcApi.security_group_info_for_devices()` returns control to `SecurityGroupAgentRpc._apply_port_filter()` along with the port and security group details.
  14) `SecurityGroupAgentRpc._apply_port_filter()` calls `SecurityGroupAgentRpc._update_security_group_info()` with the remote security groups and their member IP list.
  15) `SecurityGroupAgentRpc._update_security_group_info()` iterates over the remote security groups and calls `IptablesFirewallDriver.update_security_group_members()` for each one
  16) `IptablesFirewallDriver.update_security_group_members()` calls `IptablesFirewallDriver._update_ipset_members()`
  17) `IptablesFirewallDriver._update_ipset_members()` calls `IpsetManager.set_members()`
  18) `IpsetManager.set_members()` calls a number of methods to create or mutate the ipset using linux's `ipset` commands.
  19) The stack unwinds back up to `SecurityGroupAgentRpc._apply_port_filter()` (last seen in step 11)
  20) `SecurityGroupAgentRpc._apply_port_filter()` calls `IptablesFirewallDriver.prepare_port_filter()` with the port
  21) `IptablesFirewallDriver.prepare_port_filter()` calls `IptablesFirewallDriver._set_ports()` with the port details retrieved in step 12
  22) `IptablesFirewallDriver._set_ports()` adds the port to `IptablesFirewallDriver.filtered_ports`, which indicates the port is known to exist on the hypervisor. It will now be returned by IptablesFirewallDriver.ports().

## Processing a `security_group_member_updated` event:
  1) We receive an RPC security_group_member_updated event via `SecurityGroupAgentRpc.security_groups_member_updated()`
  2) `SecurityGroupAgentRpc.security_groups_member_updated()` calls `SecurityGroupAgentRpc._security_group_updated()` with the list of security groups whose membership was updated.
  3) `SecurityGroupAgentRpc._security_group_updated()` consults `IptablesFirewallDriver.ports()` to find ports affected by the membership updates. In the race condition, `port_update` has not yet reached step 22 to add the port we need to find, so we stop here and the following step does not occur.
  4) `SecurityGroupAgentRpc._security_group_updated()` adds each affected port from the previous step to the `SecurityGroupAgentRpc.devices_to_refilter` list. This list will be processed next time we reach `port_event` step 8, but in the race condition we never get here.

Tags: sg-fw
Revision history for this message
Charles Farquhar (cfarquhar) wrote :
Revision history for this message
Lajos Katona (lajos-katona) wrote :

Hi, thanks for the detailed bug report. If you provide some reproduction steps (some config details are welcomed as well) I can do a try in my local env (I can start a multihost devstack, with 1 extra compute, or perhaps even 2 extra computes).

Revision history for this message
Charles Farquhar (cfarquhar) wrote :
Download full text (10.0 KiB)

Hi Lajos,

Thanks! I was able to reproduce this in a lab - please see my notes below. A markdown version is here: https://gist.github.com/cfarquhar/927a989e5a325adfa948d4388c2525d3

# Overview

Two changes were required to reproduce this consistently in a lab:
  1. Expand the window where the race condition can occur (time spent between `port_update` steps 12 and 22).
  2. Slow port creation for the remote secgroup members

I picked some extreme delays here, but the difference between hitting the race condition or not in production is measured in ms.

These steps were tested on a multi-node environment deployed from https://github.com/openstack/openstack-ansible/tree/436e777, but should work on any deployment with 2x compute nodes.

We will designate one compute node as the "server" hypervisor and the other as the "client" hypervisor. The distinction is important when we get to the "Expand race condition window" step below. I've chosen `compute1` as the "client" hypervisor and `compute2` as the server hypervisor.

Please see the attachment for helper scripts and configuration from `neutron-server` and `neutron-linuxbridge-agent`.

# Setup

## OpenStack resources

```
# Image
wget http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
openstack image create "cirros" --file cirros-0.3.4-x86_64-disk.img --disk-format qcow2 --container-format bare --public

# Flavor
openstack flavor create --disk 10 --vcpus 1 --ram 128 default

# Network
openstack network create network
openstack subnet create --network network --subnet-range 10.0.0.0/24 --allocation-pool start=10.0.0.10,end=10.0.0.254 subnet
openstack subnet set --dns-nameserver 8.8.8.8 subnet

# Secgroups
openstack security group create server
openstack security group create client

# Create server security group rules
openstack security group rule create --remote-group client --dst-port 9092 --protocol tcp server
openstack security group rule create --remote-group server --dst-port 9092 --protocol tcp server

# Create client security group rules
openstack security group rule create --egress --remote-group server --dst-port 9092 --protocol tcp client

```

## Expand race condition window

1. Patch `SecurityGroupAgentRpc._apply_port_filter()` in `neutron/agent/securitygroups_rpc.py` where `neutron-linuxbridge-agent` will run on the "server" hypervisor (compute2 in this example).

This extends the window for `port_update` steps 12 and 22 and makes it easier to land `security_group_member_updated` events in it.

```
root@compute2:~# diff -u /openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/agent/securitygroups_rpc.py{-orig,}
--- /openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/agent/securitygroups_rpc.py-orig 2020-07-15 12:12:29.585918642 -0500
+++ /openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/agent/securitygroups_rpc.py 2020-07-15 12:12:58.741670522 -0500
@@ -142,6 +142,11 @@
                 devices.update(devices_info['devices'])
                 security_groups.update(devices_info['security_groups'])
                 security_group_member_ips.update(devices_info['sg_member_ips'])
+ ...

Revision history for this message
Charles Farquhar (cfarquhar) wrote :
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Charles Farquhar (cfarquhar) wrote :
Download full text (5.9 KiB)

I was able to reproduce this on master using devstack. Reproduction steps are included below.
 A markdown version is available here: https://gist.github.com/cfarquhar/a3a0cd087bd7c8645bb18fa4e48972e3.

# Devstack setup

I used two Rackspace public cloud VMs. The VMs had a private network (192.168.23.0/24) in addition to their public IPs, but that was probably unnecessary.

## Common

Perform these steps on both nodes

```
useradd -s /bin/bash -d /opt/stack -m stack
echo "stack ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

su - stack
git clone https://opendev.org/openstack/devstack
cd devstack
```

## Node 1: Combined controller+compute node

Create `local.conf` in `/opt/stack/devstack`. Be sure to update `HOST_IP` with the public IP.

```
[[local|localrc]]
HOST_IP=<PUBLIC_IP>
LOGFILE=/opt/stack/logs/stack.sh.log
ADMIN_PASSWORD=redacted
DATABASE_PASSWORD=redacted
RABBIT_PASSWORD=redacted
SERVICE_PASSWORD=redacted

# Neturon
FIXED_RANGE=10.4.128.0/20
FLOATING_RANGE=192.168.23.128/25
Q_USE_SECGROUP=True
Q_AGENT=linuxbridge
IP_VERSION=4

# This was required to avoid errors when creating instances. It may not be
# necessary in other environments.
[[post-config|$NOVA_CONF]]
[libvirt]
cpu_mode=host-model
```

## Node 2: Additional compute node

Create `local.conf` in `/opt/stack/devstack`. Be sure to update `HOST_IP` with the public IP of this node and `SERVICE_HOST` with the public IP of node 1 (controller+compute).

`local.conf`:
```
[[local|localrc]]
HOST_IP=<PUBLIC_IP>
LOGFILE=/opt/stack/logs/stack.sh.log
ADMIN_PASSWORD=redacted
DATABASE_PASSWORD=redacted
RABBIT_PASSWORD=redacted
SERVICE_PASSWORD=redacted
DATABASE_TYPE=mysql
SERVICE_HOST=<CONTROLLER+COMPUTE_PUBLIC_IP>
MYSQL_HOST=$SERVICE_HOST
RABBIT_HOST=$SERVICE_HOST
GLANCE_HOSTPORT=$SERVICE_HOST:9292
ENABLED_SERVICES=n-cpu,q-agt,c-vol,placement-client
NOVA_VNC_ENABLED=True
NOVNCPROXY_URL="http://$SERVICE_HOST:6080/vnc_lite.html"
VNCSERVER_LISTEN=$HOST_IP
VNCSERVER_PROXYCLIENT_ADDRESS=$VNCSERVER_LISTEN

# Neturon
FIXED_RANGE=10.4.128.0/20
FLOATING_RANGE=192.168.23.128/25
Q_USE_SECGROUP=True
Q_AGENT=linuxbridge
IP_VERSION=4
```

## Registering node 2 (additional compute node)

On node 1 (controller+compute)
```
/opt/stack/devstack/tools/discover_hosts.sh
```

# Create security groups and rules
```
# Source credentials
source ~/openrc admin

# Create secgroups
openstack security group create server
openstack security group create client

# Create server security group rules
openstack security group rule create --remote-group client --dst-port 9092 --protocol tcp server

# Create client security group rules
openstack security group rule create --egress --remote-group server --dst-port 9092 --protocol tcp client

# Save the client secgroup's ID
CLIENT_SG_ID=$(openstack security group show client -cid -fvalue)
```

# Patch `neutron-server` and `neutron-linuxbridge-agent`

These patches just make it easier to trigger the race condition. The `neutron-server` patch slows down port updates for the "client" VM, and the `neutron-linuxbridge-agent` patch makes the window where the condition can occur longer.

These steps should be performed on the node 1 (combined controller+compute node)

## Patch `ne...

Read more...

Revision history for this message
Charles Farquhar (cfarquhar) wrote :

Here's a proof-of-concept fix for a devstack environment.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi Charles,

Can You send Your patch through gerrit? We will then be able to review and merge it.

tags: added: sg-fw
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/743624

Changed in neutron:
assignee: nobody → Charles Farquhar (cfarquhar)
status: New → In Progress
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Observed similar issues in ussuri environment with openvswitch firewall. So cherrypicked the change until ussuri.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Fix released and backported to stable/victoria and stable/ussuri - https://review.opendev.org/q/I6e4abe13a7541c21399466c5eb0d61ff5780c887

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status: New → Confirmed
affects: ubuntu → neutron (Ubuntu)
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "POC patch for devstack environment 20200723a" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
Edward Hope-Morley (hopem) wrote :

The Victoria and Ussuri backports are released (upstream) as part of point releases 17.1.1 and 16.3.1 respectively.

tags: removed: patch
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

The fix is available in Ubuntu neutron packages in Hiruste, Groovy, Focal and UCA Victoria, Ussuri. Marking them as Fix released.

Changed in neutron (Ubuntu Hirsute):
status: New → Fix Released
Changed in neutron (Ubuntu Groovy):
status: New → Fix Released
Changed in neutron (Ubuntu Focal):
status: New → Fix Released
Changed in cloud-archive:
status: New → Fix Released
Changed in neutron (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.