Conflicting documentation for HA memcached

Bug #1903226 reported by Niko Smeds
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
New
Undecided
Unassigned

Bug Description

A document which nicely describes making memcached highly available exists at https://docs.openstack.org/openstack-ansible-memcached_server/ussuri/configure-ha.html, which includes examples.

However, Keystone has issues talking to memcached via the HAProxy frontend. There's actually an inline comment recommending that Keystone's `memcache_servers` value is a comma-separated list, see https://github.com/openstack/openstack-ansible-os_keystone/blob/stable/ussuri/templates/keystone.conf.j2#L27

We had to update the Keystone configuration to bypass HAProxy, else we'd frequently experience "keystone.exception.TokenNotFound: Failed to validate token" errors, which log out the user from Horizon.

Perhaps the first document should be updated to include this caveat? I'm happy to help troubleshoot or provide additional details to better understand this problem.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

I can imagine this happening in 2 cases:
1. You've changed haproxy_balance_alg from source to smth else
2. You have dynamic IP's which can change during the session

`memcache_servers` value is a comma-separated list when it's actually the list, which is not the case when a balancer used in front of it. In this case keystone decides on which memcached server specific cache is stored. The problem here is that it doesn't have option to mark one of it DOWN when you have issue with controllers, which will result in timeouts.

When haproxy is used we rely on the source IP of the request to get service to the exact same memcached server as it was before. Otherwise you will get issues like described.

I'm using this config in several of production deployments on train for a year or so, and never experienced any issues with it

Revision history for this message
Niko Smeds (nsmeds) wrote :

Thanks for the reply. We have not changed the balance mode from source - I'll paste our backend options below:

```
backend memcached-back
    mode tcp
    balance source
    stick store-request src
    stick-table type ip size 256k expire 30m
    option tcplog
    option tcp-check
```

The physical hosts, LXC containers, and clients are all statically assigned IPs.

I should have mentioned in my original post: we have also been running this successfully with Train for the past 4+ months. It was only upon a recent upgrade to Ussuri that the issue began.

Revision history for this message
Niko Smeds (nsmeds) wrote :

My colleague just mentioned that he actually went into HAProxy stats dashboard and set 2 of 3 memcached backends as down (forcing all traffic to a single memcached service). The issue still occurred.

So for some reason Keystone (perhaps as of the Ussuri release) is not happy when talking to memcached via HAProxy, even when hitting the same memcached instance.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Can you please also share your OS and memcached version? Just so that I could properly reproduce environment.

Revision history for this message
Niko Smeds (nsmeds) wrote :

Sure - here are details from both Memcached and HAProxy containers and versions.

Memcached

```
# hostnamectl
[...]
    Virtualization: lxc
  Operating System: Ubuntu 18.04.5 LTS
            Kernel: Linux 5.4.0-51-generic
      Architecture: x86-64

# memcached --version
memcached 1.5.6
```

HAProxy

```
$ hostnamectl
[...]
    Virtualization: systemd-nspawn
  Operating System: Ubuntu 18.04.3 LTS
            Kernel: Linux 5.3.0-53-generic
      Architecture: x86-64

$ haproxy -v
HA-Proxy version 1.8.8-1ubuntu0.11 2020/06/22
```

For clarity: we deploy some of our own services into nspawn containers (monitoring, LBs, PXE, etc.) and OSA deploys the OpenStack services into LXC containers.

Revision history for this message
Niko Smeds (nsmeds) wrote :

Something else we recently read was Django's documentation on memcached at https://docs.djangoproject.com/en/3.1/topics/cache/#memcached.

The document describes Django taking advantage of multiple memcached backends when they are provided in a comma-separated list.

After reading this we decided to trial reverting https://docs.openstack.org/openstack-ansible-memcached_server/ussuri/configure-ha.html entirely - all services are now talking to memcached directly.

We've also been having issues with Horizon frequently logging users out - often within 5 minutes of logging in - and since reverting the HA memcached settings users are no longer being incorrectly logged out. So that's promising!

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

The problem here, that once one memcached will be down, you might get 504 timeouts for services, since there's no option to mark memcached down and not push reuqests to it.

What makes me wonder if that might be related to having haproxy inside nspawn container - like some MTU or double nat issue.
As I was not able to reproduce this problem on the sandbox

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

The ideal solution for memcached HA is mcrouter: https://github.com/facebook/mcrouter

But we can't suggest it for using since they have only bionic packages - others are supposed to build from source

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.