Oslo.cache exponencially raising up connection to memcached

Bug #1888394 reported by Michal Arbet
266
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Security Advisory
Won't Fix
Undecided
Unassigned
oslo.cache
Fix Released
Undecided
Unassigned

Bug Description

Hi,

Commit merged below to master (also to stein,rocky,queens) causing robust performance issue when connections are dropped to memcache servers whatever reasons are.

https://review.opendev.org/#/c/644774/

NOTE in change >

+ # NOTE(morgan): Explicitly set flush_on_reconnect for pooled
+ # connections. This should ensure that stale data is never consumed
+ # from a server that pops in/out due to a network partition
+ # or disconnect.
+ #
+ # See the help from python-memcached:
+ #
+ # param flush_on_reconnect: optional flag which prevents a
+ # scenario that can cause stale data to be read: If there's more
+ # than one memcached server and the connection to one is
+ # interrupted, keys that mapped to that server will get
+ # reassigned to another. If the first server comes back, those
+ # keys will map to it again. If it still has its data, get()s
+ # can read stale data that was overwritten on another

In real world when memcached's client's connection is broken , whatewer reason it is ( typically some network issue) client will connect again to memcached , but will flush memcached.

If you have several clients doing above, connections to memcached is going rapidly UP.
Because of this , memcached is extremly overkilled and process of flush is repeated again again, and connections are going UP and UP.

Simple test can reproduce the problem :

1 terminal window to check how many connections I currently have :

Tue Jul 21 12:38:17 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:18 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:19 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:20 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:21 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:22 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:23 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:24 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:25 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:26 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:27 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:28 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:30 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:31 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:32 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:33 UTC 2020 >>> Number of connections : 67
Tue Jul 21 12:38:34 UTC 2020 >>> Number of connections : 162
Tue Jul 21 12:38:35 UTC 2020 >>> Number of connections : 261
Tue Jul 21 12:38:36 UTC 2020 >>> Number of connections : 357
Tue Jul 21 12:38:37 UTC 2020 >>> Number of connections : 456
Tue Jul 21 12:38:38 UTC 2020 >>> Number of connections : 556
Tue Jul 21 12:38:39 UTC 2020 >>> Number of connections : 660
Tue Jul 21 12:38:40 UTC 2020 >>> Number of connections : 752
Tue Jul 21 12:38:41 UTC 2020 >>> Number of connections : 851
Tue Jul 21 12:38:43 UTC 2020 >>> Number of connections : 952
Tue Jul 21 12:38:44 UTC 2020 >>> Number of connections : 1050
Tue Jul 21 12:38:45 UTC 2020 >>> Number of connections : 1150
Tue Jul 21 12:38:46 UTC 2020 >>> Number of connections : 1248
Tue Jul 21 12:38:47 UTC 2020 >>> Number of connections : 1343
Tue Jul 21 12:38:48 UTC 2020 >>> Number of connections : 1440
Tue Jul 21 12:38:49 UTC 2020 >>> Number of connections : 1536
Tue Jul 21 12:38:50 UTC 2020 >>> Number of connections : 1636
Tue Jul 21 12:38:51 UTC 2020 >>> Number of connections : 1738
Tue Jul 21 12:38:52 UTC 2020 >>> Number of connections : 1838
Tue Jul 21 12:38:53 UTC 2020 >>> Number of connections : 1943
Tue Jul 21 12:38:55 UTC 2020 >>> Number of connections : 2045
Tue Jul 21 12:38:56 UTC 2020 >>> Number of connections : 2145
Tue Jul 21 12:38:57 UTC 2020 >>> Number of connections : 2239
Tue Jul 21 12:38:58 UTC 2020 >>> Number of connections : 2335
Tue Jul 21 12:38:59 UTC 2020 >>> Number of connections : 2434
Tue Jul 21 12:39:00 UTC 2020 >>> Number of connections : 2534
Tue Jul 21 12:39:01 UTC 2020 >>> Number of connections : 2629
Tue Jul 21 12:39:02 UTC 2020 >>> Number of connections : 2723
Tue Jul 21 12:39:03 UTC 2020 >>> Number of connections : 2819
Tue Jul 21 12:39:04 UTC 2020 >>> Number of connections : 2916
Tue Jul 21 12:39:06 UTC 2020 >>> Number of connections : 3013
Tue Jul 21 12:39:07 UTC 2020 >>> Number of connections : 3111
Tue Jul 21 12:39:08 UTC 2020 >>> Number of connections : 3210
Tue Jul 21 12:39:09 UTC 2020 >>> Number of connections : 3314
Tue Jul 21 12:39:10 UTC 2020 >>> Number of connections : 3417
Tue Jul 21 12:39:11 UTC 2020 >>> Number of connections : 3519
Tue Jul 21 12:39:12 UTC 2020 >>> Number of connections : 3623
Tue Jul 21 12:39:13 UTC 2020 >>> Number of connections : 3723
Tue Jul 21 12:39:15 UTC 2020 >>> Number of connections : 3822
Tue Jul 21 12:39:16 UTC 2020 >>> Number of connections : 3920
Tue Jul 21 12:39:17 UTC 2020 >>> Number of connections : 4013
Tue Jul 21 12:39:18 UTC 2020 >>> Number of connections : 4109
Tue Jul 21 12:39:19 UTC 2020 >>> Number of connections : 4206
Tue Jul 21 12:39:20 UTC 2020 >>> Number of connections : 4304
Tue Jul 21 12:39:21 UTC 2020 >>> Number of connections : 4407
Tue Jul 21 12:39:22 UTC 2020 >>> Number of connections : 4511
Tue Jul 21 12:39:24 UTC 2020 >>> Number of connections : 4611
Tue Jul 21 12:39:25 UTC 2020 >>> Number of connections : 4712
Tue Jul 21 12:39:26 UTC 2020 >>> Number of connections : 4817
Tue Jul 21 12:39:27 UTC 2020 >>> Number of connections : 4918
Tue Jul 21 12:39:28 UTC 2020 >>> Number of connections : 5025
Tue Jul 21 12:39:29 UTC 2020 >>> Number of connections : 5131

2 terminal window to reproduce case when clients are sending flushs to memcached :
while true; do cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.10 11211; cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.11 11211;done

root@deploy:/home/ubuntu# while true; do cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.10 11211; cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.11 11211;done
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
.
.
.
.
.
.
.

Ok, I understand that this is security issue, but for me it makes sense only in keystone, so I really don't agree to use it by default in code.

Especially on bigger production it can cause big damage.

I will prepare a patch which will ensure that flush_on_reconnect will be configurable via backend_argument

Revision history for this message
Michal Arbet (michalarbet) wrote :

This was tested on clean openstack installation via kolla-ansible, found on real production, revert oslo cache fixed issue.

Revision history for this message
Michal Arbet (michalarbet) wrote :

NOTE : Second terminal windows is just simulating what is really going on. Really you can meet state when one memcache node is dead .. after this connections are closed. Flushing second node ...etc ..etc

Revision history for this message
Jeremy Stanley (fungi) wrote :

It looks like this may also be the same as pubic security bug 1883659 and its duplicate bug 1892852.

Changed in ossa:
status: New → Incomplete
information type: Public → Public Security
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.cache (master)

Change abandoned by Michal Arbet (<email address hidden>) on branch: master
Review: https://review.opendev.org/742193

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

This one does not mention Neutron - but was it Neutron here as well?

Revision history for this message
Mike Bayer (zzzeek) wrote :

we are seeing this occur with Neutron in two environments.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.cache 2.7.0

This issue was fixed in the openstack/oslo.cache 2.7.0 release.

Revision history for this message
Jeremy Stanley (fungi) wrote :

At this point there's no clear exploit scenario and the description of this and the other two presumed related reports seems to be of a normal (albeit potentially crippling) bug. As such, the vulnerability management team is going to treat this as a class D report per our taxonomy and not issue an advisory once it's fixed, but if anyone disagrees we can reconsider the position: https://security.openstack.org/vmt-process.html#incident-report-taxonomy

Changed in ossa:
status: Incomplete → Won't Fix
Revision history for this message
Ben Nemec (bnemec) wrote :

Per comment 7, this was released. Not sure why the bug status wasn't updated.

Changed in oslo.cache:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/keystonemiddleware 9.3.0

This issue was fixed in the openstack/keystonemiddleware 9.3.0 release.

To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.