Oslo.cache exponencially raising up connection to memcached

Bug #1888394 reported by Michal Arbet on 2020-07-21
266
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Security Advisory
Undecided
Unassigned
oslo.cache
Undecided
Unassigned

Bug Description

Hi,

Commit merged below to master (also to stein,rocky,queens) causing robust performance issue when connections are dropped to memcache servers whatever reasons are.

https://review.opendev.org/#/c/644774/

NOTE in change >

+ # NOTE(morgan): Explicitly set flush_on_reconnect for pooled
+ # connections. This should ensure that stale data is never consumed
+ # from a server that pops in/out due to a network partition
+ # or disconnect.
+ #
+ # See the help from python-memcached:
+ #
+ # param flush_on_reconnect: optional flag which prevents a
+ # scenario that can cause stale data to be read: If there's more
+ # than one memcached server and the connection to one is
+ # interrupted, keys that mapped to that server will get
+ # reassigned to another. If the first server comes back, those
+ # keys will map to it again. If it still has its data, get()s
+ # can read stale data that was overwritten on another

In real world when memcached's client's connection is broken , whatewer reason it is ( typically some network issue) client will connect again to memcached , but will flush memcached.

If you have several clients doing above, connections to memcached is going rapidly UP.
Because of this , memcached is extremly overkilled and process of flush is repeated again again, and connections are going UP and UP.

Simple test can reproduce the problem :

1 terminal window to check how many connections I currently have :

Tue Jul 21 12:38:17 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:18 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:19 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:20 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:21 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:22 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:23 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:24 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:25 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:26 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:27 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:28 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:30 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:31 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:32 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:33 UTC 2020 >>> Number of connections : 67
Tue Jul 21 12:38:34 UTC 2020 >>> Number of connections : 162
Tue Jul 21 12:38:35 UTC 2020 >>> Number of connections : 261
Tue Jul 21 12:38:36 UTC 2020 >>> Number of connections : 357
Tue Jul 21 12:38:37 UTC 2020 >>> Number of connections : 456
Tue Jul 21 12:38:38 UTC 2020 >>> Number of connections : 556
Tue Jul 21 12:38:39 UTC 2020 >>> Number of connections : 660
Tue Jul 21 12:38:40 UTC 2020 >>> Number of connections : 752
Tue Jul 21 12:38:41 UTC 2020 >>> Number of connections : 851
Tue Jul 21 12:38:43 UTC 2020 >>> Number of connections : 952
Tue Jul 21 12:38:44 UTC 2020 >>> Number of connections : 1050
Tue Jul 21 12:38:45 UTC 2020 >>> Number of connections : 1150
Tue Jul 21 12:38:46 UTC 2020 >>> Number of connections : 1248
Tue Jul 21 12:38:47 UTC 2020 >>> Number of connections : 1343
Tue Jul 21 12:38:48 UTC 2020 >>> Number of connections : 1440
Tue Jul 21 12:38:49 UTC 2020 >>> Number of connections : 1536
Tue Jul 21 12:38:50 UTC 2020 >>> Number of connections : 1636
Tue Jul 21 12:38:51 UTC 2020 >>> Number of connections : 1738
Tue Jul 21 12:38:52 UTC 2020 >>> Number of connections : 1838
Tue Jul 21 12:38:53 UTC 2020 >>> Number of connections : 1943
Tue Jul 21 12:38:55 UTC 2020 >>> Number of connections : 2045
Tue Jul 21 12:38:56 UTC 2020 >>> Number of connections : 2145
Tue Jul 21 12:38:57 UTC 2020 >>> Number of connections : 2239
Tue Jul 21 12:38:58 UTC 2020 >>> Number of connections : 2335
Tue Jul 21 12:38:59 UTC 2020 >>> Number of connections : 2434
Tue Jul 21 12:39:00 UTC 2020 >>> Number of connections : 2534
Tue Jul 21 12:39:01 UTC 2020 >>> Number of connections : 2629
Tue Jul 21 12:39:02 UTC 2020 >>> Number of connections : 2723
Tue Jul 21 12:39:03 UTC 2020 >>> Number of connections : 2819
Tue Jul 21 12:39:04 UTC 2020 >>> Number of connections : 2916
Tue Jul 21 12:39:06 UTC 2020 >>> Number of connections : 3013
Tue Jul 21 12:39:07 UTC 2020 >>> Number of connections : 3111
Tue Jul 21 12:39:08 UTC 2020 >>> Number of connections : 3210
Tue Jul 21 12:39:09 UTC 2020 >>> Number of connections : 3314
Tue Jul 21 12:39:10 UTC 2020 >>> Number of connections : 3417
Tue Jul 21 12:39:11 UTC 2020 >>> Number of connections : 3519
Tue Jul 21 12:39:12 UTC 2020 >>> Number of connections : 3623
Tue Jul 21 12:39:13 UTC 2020 >>> Number of connections : 3723
Tue Jul 21 12:39:15 UTC 2020 >>> Number of connections : 3822
Tue Jul 21 12:39:16 UTC 2020 >>> Number of connections : 3920
Tue Jul 21 12:39:17 UTC 2020 >>> Number of connections : 4013
Tue Jul 21 12:39:18 UTC 2020 >>> Number of connections : 4109
Tue Jul 21 12:39:19 UTC 2020 >>> Number of connections : 4206
Tue Jul 21 12:39:20 UTC 2020 >>> Number of connections : 4304
Tue Jul 21 12:39:21 UTC 2020 >>> Number of connections : 4407
Tue Jul 21 12:39:22 UTC 2020 >>> Number of connections : 4511
Tue Jul 21 12:39:24 UTC 2020 >>> Number of connections : 4611
Tue Jul 21 12:39:25 UTC 2020 >>> Number of connections : 4712
Tue Jul 21 12:39:26 UTC 2020 >>> Number of connections : 4817
Tue Jul 21 12:39:27 UTC 2020 >>> Number of connections : 4918
Tue Jul 21 12:39:28 UTC 2020 >>> Number of connections : 5025
Tue Jul 21 12:39:29 UTC 2020 >>> Number of connections : 5131

2 terminal window to reproduce case when clients are sending flushs to memcached :
while true; do cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.10 11211; cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.11 11211;done

root@deploy:/home/ubuntu# while true; do cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.10 11211; cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.11 11211;done
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
.
.
.
.
.
.
.

Ok, I understand that this is security issue, but for me it makes sense only in keystone, so I really don't agree to use it by default in code.

Especially on bigger production it can cause big damage.

I will prepare a patch which will ensure that flush_on_reconnect will be configurable via backend_argument

Michal Arbet (michalarbet) wrote :

This was tested on clean openstack installation via kolla-ansible, found on real production, revert oslo cache fixed issue.

Michal Arbet (michalarbet) wrote :

NOTE : Second terminal windows is just simulating what is really going on. Really you can meet state when one memcache node is dead .. after this connections are closed. Flushing second node ...etc ..etc

Jeremy Stanley (fungi) wrote :

It looks like this may also be the same as pubic security bug 1883659 and its duplicate bug 1892852.

Changed in ossa:
status: New → Incomplete
information type: Public → Public Security

Change abandoned by Michal Arbet (<email address hidden>) on branch: master
Review: https://review.opendev.org/742193

Radosław Piliszek (yoctozepto) wrote :

This one does not mention Neutron - but was it Neutron here as well?

To post a comment you must log in.
This report contains Public Security information  Edit
Everyone can see this security related information.

Other bug subscribers