Oslo.cache exponencially raising up connection to memcached
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Security Advisory |
Won't Fix
|
Undecided
|
Unassigned | ||
oslo.cache |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Hi,
Commit merged below to master (also to stein,rocky,queens) causing robust performance issue when connections are dropped to memcache servers whatever reasons are.
https:/
NOTE in change >
+ # NOTE(morgan): Explicitly set flush_on_reconnect for pooled
+ # connections. This should ensure that stale data is never consumed
+ # from a server that pops in/out due to a network partition
+ # or disconnect.
+ #
+ # See the help from python-memcached:
+ #
+ # param flush_on_reconnect: optional flag which prevents a
+ # scenario that can cause stale data to be read: If there's more
+ # than one memcached server and the connection to one is
+ # interrupted, keys that mapped to that server will get
+ # reassigned to another. If the first server comes back, those
+ # keys will map to it again. If it still has its data, get()s
+ # can read stale data that was overwritten on another
In real world when memcached's client's connection is broken , whatewer reason it is ( typically some network issue) client will connect again to memcached , but will flush memcached.
If you have several clients doing above, connections to memcached is going rapidly UP.
Because of this , memcached is extremly overkilled and process of flush is repeated again again, and connections are going UP and UP.
Simple test can reproduce the problem :
1 terminal window to check how many connections I currently have :
Tue Jul 21 12:38:17 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:18 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:19 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:20 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:21 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:22 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:23 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:24 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:25 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:26 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:27 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:28 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:30 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:31 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:32 UTC 2020 >>> Number of connections : 25
Tue Jul 21 12:38:33 UTC 2020 >>> Number of connections : 67
Tue Jul 21 12:38:34 UTC 2020 >>> Number of connections : 162
Tue Jul 21 12:38:35 UTC 2020 >>> Number of connections : 261
Tue Jul 21 12:38:36 UTC 2020 >>> Number of connections : 357
Tue Jul 21 12:38:37 UTC 2020 >>> Number of connections : 456
Tue Jul 21 12:38:38 UTC 2020 >>> Number of connections : 556
Tue Jul 21 12:38:39 UTC 2020 >>> Number of connections : 660
Tue Jul 21 12:38:40 UTC 2020 >>> Number of connections : 752
Tue Jul 21 12:38:41 UTC 2020 >>> Number of connections : 851
Tue Jul 21 12:38:43 UTC 2020 >>> Number of connections : 952
Tue Jul 21 12:38:44 UTC 2020 >>> Number of connections : 1050
Tue Jul 21 12:38:45 UTC 2020 >>> Number of connections : 1150
Tue Jul 21 12:38:46 UTC 2020 >>> Number of connections : 1248
Tue Jul 21 12:38:47 UTC 2020 >>> Number of connections : 1343
Tue Jul 21 12:38:48 UTC 2020 >>> Number of connections : 1440
Tue Jul 21 12:38:49 UTC 2020 >>> Number of connections : 1536
Tue Jul 21 12:38:50 UTC 2020 >>> Number of connections : 1636
Tue Jul 21 12:38:51 UTC 2020 >>> Number of connections : 1738
Tue Jul 21 12:38:52 UTC 2020 >>> Number of connections : 1838
Tue Jul 21 12:38:53 UTC 2020 >>> Number of connections : 1943
Tue Jul 21 12:38:55 UTC 2020 >>> Number of connections : 2045
Tue Jul 21 12:38:56 UTC 2020 >>> Number of connections : 2145
Tue Jul 21 12:38:57 UTC 2020 >>> Number of connections : 2239
Tue Jul 21 12:38:58 UTC 2020 >>> Number of connections : 2335
Tue Jul 21 12:38:59 UTC 2020 >>> Number of connections : 2434
Tue Jul 21 12:39:00 UTC 2020 >>> Number of connections : 2534
Tue Jul 21 12:39:01 UTC 2020 >>> Number of connections : 2629
Tue Jul 21 12:39:02 UTC 2020 >>> Number of connections : 2723
Tue Jul 21 12:39:03 UTC 2020 >>> Number of connections : 2819
Tue Jul 21 12:39:04 UTC 2020 >>> Number of connections : 2916
Tue Jul 21 12:39:06 UTC 2020 >>> Number of connections : 3013
Tue Jul 21 12:39:07 UTC 2020 >>> Number of connections : 3111
Tue Jul 21 12:39:08 UTC 2020 >>> Number of connections : 3210
Tue Jul 21 12:39:09 UTC 2020 >>> Number of connections : 3314
Tue Jul 21 12:39:10 UTC 2020 >>> Number of connections : 3417
Tue Jul 21 12:39:11 UTC 2020 >>> Number of connections : 3519
Tue Jul 21 12:39:12 UTC 2020 >>> Number of connections : 3623
Tue Jul 21 12:39:13 UTC 2020 >>> Number of connections : 3723
Tue Jul 21 12:39:15 UTC 2020 >>> Number of connections : 3822
Tue Jul 21 12:39:16 UTC 2020 >>> Number of connections : 3920
Tue Jul 21 12:39:17 UTC 2020 >>> Number of connections : 4013
Tue Jul 21 12:39:18 UTC 2020 >>> Number of connections : 4109
Tue Jul 21 12:39:19 UTC 2020 >>> Number of connections : 4206
Tue Jul 21 12:39:20 UTC 2020 >>> Number of connections : 4304
Tue Jul 21 12:39:21 UTC 2020 >>> Number of connections : 4407
Tue Jul 21 12:39:22 UTC 2020 >>> Number of connections : 4511
Tue Jul 21 12:39:24 UTC 2020 >>> Number of connections : 4611
Tue Jul 21 12:39:25 UTC 2020 >>> Number of connections : 4712
Tue Jul 21 12:39:26 UTC 2020 >>> Number of connections : 4817
Tue Jul 21 12:39:27 UTC 2020 >>> Number of connections : 4918
Tue Jul 21 12:39:28 UTC 2020 >>> Number of connections : 5025
Tue Jul 21 12:39:29 UTC 2020 >>> Number of connections : 5131
2 terminal window to reproduce case when clients are sending flushs to memcached :
while true; do cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.10 11211; cat <( echo "flush_all";echo "quit" ) | nc 192.168.205.11 11211;done
root@deploy:
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
.
.
.
.
.
.
.
Ok, I understand that this is security issue, but for me it makes sense only in keystone, so I really don't agree to use it by default in code.
Especially on bigger production it can cause big damage.
I will prepare a patch which will ensure that flush_on_reconnect will be configurable via backend_argument
This was tested on clean openstack installation via kolla-ansible, found on real production, revert oslo cache fixed issue.