OpenStack Keystone Charm

Multiple failed tempest tests with error message "tempest.lib.exceptions.UnexpectedContentType: Unexpected content type provided Details: 502"

Bug #1987104 reported by Przemyslaw Hausman on 2022-08-19

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Keystone Charm	New	Undecided	Unassigned

Bug Description

The problem is experienced on focal-yoga and focal-ussuri clouds, disaggregated architecture (3 control nodes, 9 compute nodes, 9 storage nodes). Control nodes have 2x 20 cores CPU with hyperthreading enabled.

Example of failed tempest test:

```
    <testcase classname="tempest.api.compute.servers.test_list_servers_negative.ListServersNegativeTestJSON" id="74745ad8-b346-45b5-b9b8-509d7447fc1f" name="test_list_servers_by_changes_since_future_date" time="0.000">
      <failure>Traceback (most recent call last):
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/test.py", line 181, in setUpClass
    raise value.with_traceback(trace)
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/test.py", line 166, in setUpClass
    cls.setup_credentials()
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/api/compute/base.py", line 71, in setup_credentials
    super(BaseV2ComputeTest, cls).setup_credentials()
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/test.py", line 388, in setup_credentials
    credential_type=credentials_type)
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/test.py", line 704, in get_client_manager
    creds = getattr(cred_provider, credentials_method)()
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/lib/common/dynamic_creds.py", line 432, in get_primary_creds
    return self.get_credentials('primary')
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/lib/common/dynamic_creds.py", line 399, in get_credentials
    credentials = self._create_creds(admin=is_admin)
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/lib/common/dynamic_creds.py", line 251, in _create_creds
    self.creds_client.assign_user_role(user, project, role)
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/lib/common/cred_client.py", line 82, in assign_user_role
    role['id'])
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/lib/services/identity/v3/roles_client.py", line 82, in create_user_role_on_project
    (project_id, user_id, role_id), None)
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/lib/common/rest_client.py", line 363, in put
    return self.request('PUT', url, extra_headers, headers, body, chunked)
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/lib/common/rest_client.py", line 703, in request
    self._error_checker(resp, resp_body)
  File "/home/ubuntu/snap/fcbtest/29/.rally/verification/verifier-258c7e54-6d53-41d0-a73d-67f88ca8a0a4/repo/tempest/lib/common/rest_client.py", line 794, in _error_checker
    resp=resp)
tempest.lib.exceptions.UnexpectedContentType: Unexpected content type provided
Details: 502
</failure>
    </testcase>
```

I have increased the `LogLevel` to `debug` for apache2 on keystone units and noticed the following entries in /var/log/apache/error.log:

```
[Thu Aug 18 19:04:12.106381 2022] [mpm_event:debug] [pid 1652978:tid 139648898684672] event.c(1806): Too many open connections (36), not accepting new conns in this process

[Thu Aug 18 19:04:12.107231 2022] [mpm_event:debug] [pid 1652978:tid 139648898684672] event.c(485): AH00457: Accepting new connections again: 36 active conns (13 lingering/0 clogged/0 suspended), 2 idle workers

[Thu Aug 18 19:04:12.214808 2022] [mpm_event:debug] [pid 1652978:tid 139648898684672] event.c(1806): Too many open connections (36), not accepting new conns in this process

[Thu Aug 18 19:04:12.223126 2022] [mpm_event:debug] [pid 1652978:tid 139648898684672] event.c(485): AH00457: Accepting new connections again: 36 active conns (12 lingering/0 clogged/0 suspended), 1 idle workers
```

Meanwhile, /var/log/apache2/other_vhosts_access.log shows errors with 502 error code:

```
keystone.example.net:443 10.46.21.21 - - [18/Aug/2022:19:04:12 +0000] "DELETE /v3/projects/647aa0e721304a029b2c7cb5ede5067c HTTP/1.1" 502 4528 "-" "python-urllib3/1.26.9"
```

For this deployment `worker-multiplier` config option for keystone charm is not set, so the charm configures by default 3 workers for public endpoint and 1 worker for internal endpoint:

```
ubuntu@juju-04347a-3-lxd-12:~$ grep processes /etc/apache2/sites-enabled/wsgi-openstack-api.conf -B 1
<VirtualHost *:35337>
WSGIDaemonProcess keystone-admin processes=1 threads=1 user=keystone group=keystone \
--
<VirtualHost *:4980>
WSGIDaemonProcess keystone-public processes=3 threads=1 user=keystone group=keystone \
```

By default, each worker is able to handle 25 connections, as configured in `ThreadsPerChild` in /etc/apache2/mods-enabled/mpm_event.conf:

```
ubuntu@juju-04347a-3-lxd-12:~$ cat /etc/apache2/mods-enabled/mpm_event.conf
# event MPM
# StartServers: initial number of server processes to start
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestWorkers: maximum number of worker threads
# MaxConnectionsPerChild: maximum number of requests a server process serves
<IfModule mpm_event_module>
        StartServers 2
        MinSpareThreads 25
        MaxSpareThreads 75
        ThreadLimit 64
        ThreadsPerChild 25
        MaxRequestWorkers 150
        MaxConnectionsPerChild 0
</IfModule>

# vim: syntax=apache ts=4 sw=4 sts=4 sr noet
```

So, in theory, each keystone unit should be able to handle 3 × 25 = 75 connections simultaneously.

To find out how many connections are requested during the tempest run, I have enabled mod_status apache2 module and logged server-status to a file, converting it to CSV on the fly as follows:

```
# Save server status as CSV file
# tail -n +2 # Read input starting from second line
# sed "s/^.*: //" # Remove key names, leave only values
# paste -d ";" -s # Add ";" between values
while true; do
   echo -n "$(date +'%Y-%m-%d %H:%M:%S.%N');" | \
     tee -a apache-$(hostname).txt
   curl -s http://localhost/server-status?auto | \
     tail -n +2 | \
     sed "s/^.*: //" | \
     paste -d ";" -s | \
     tee -a apache-$(hostname).txt
  sleep 0.5
done
```

See attached csv files and the example graph of TotalConns over time.

As a result, I found out that the maximum number of ConnsTotal on each keystone unit was ~90 at the peak, which is still lower than the allowed 100 connections (25 connections per worker; 4 workers). Maybe the ratio of connections was not exactly 0.75:0.25 for public and internal endpoints and some connections could not be handled. It seems that 3 workers per public API and 1 worker per internal API is not enough.

So I bumped up `worker-multiplier` as much as I could, to be within the limits enforced by configuration in mpm_event.conf.

With MaxRequestWorkers=150 and ThreadsPerChild=25, we can only have 150/25=7 workers. 7 workers will handle at maximum 25 connections each, and in total 150 connections. 7 workers split in 0.75:0.25 gives 5 public api workers and 2 internal api workers.

The target `worker-multiplier` is therefore 0.075. This is because math.ceil(80 CPUs * 0.075 worker-multiplier * 0.75 public_proccess_weight) == 5, and math.ceil(80 CPUs * 0.075 worker-multiplier * 0.25 admin_proccess_weight) == 2)

Even if we went higher than 0.075 for `worker-multiplier` -- as calculated above -- it would have no effect as we are limited by the configuration in mpm_event.conf. We can't go higher than 150 simultaneous connections in total (MaxRequestWorkers 150).

I configured `worker-multiplier` with 0.075 and re-run tempest tests. This time I was expecting all tests to pass because keystone was configured with enough workers to handle 150 simultaneous connections.

Unfortunately some tempest tests still failed, and apache's error.log still contained new "Too many open connections" entries, such as:

```
[Thu Aug 18 21:16:19.543428 2022] [mpm_event:debug] [pid 1873553:tid 140431379646208] event.c(1806): Too many open connections (33), not accepting new conns in this process

[Thu Aug 18 21:16:19.577759 2022] [mpm_event:debug] [pid 1873553:tid 140431379646208] event.c(485): AH00457: Accepting new connections again: 33 active conns (9 lingering/0 clogged/0 suspended), 1 idle workers
```

I'm not sure but I think these entries are not errors. They just mean that a particular process is not able to accept more connections. And new connections should be passed to another worker process that is able to handle them.

Looking at the server-status again reveals that the maximum number of processes at peak load was 3, each able to handle 25 connections. I don't understand why more processes were not spawned if there was a need for handling more connections?

As an experiment, I bumped up ThreadsPerChild from 25 to 50 in mpm_event.conf, restarted apache2 on all keystone units and re-run tempest. This time I found out that:
- each keystone unit is exposed to ~120 simultaneous connections (at the peak),
- apache's error.log does not contain new "Too many open connections" entries,
- all tempest tests pass.

At this point I'm not sure what exactly is causing the problem as I was successfully running tempest tests in much smaller environments, even in my Orange Box lab.

I think that config options in mpm_event.conf should be somehow adjusted when `worker-multiplier` is modified, as already mentioned in bug https://bugs.launchpad.net/charm-keystone/+bug/1811379.

Any hints / ideas would be greatly appreciated!