Multiple failed tempest tests with error message "tempest.lib.exceptions.UnexpectedContentType: Unexpected content type provided Details: 502"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Keystone Charm |
New
|
Undecided
|
Unassigned |
Bug Description
The problem is experienced on focal-yoga and focal-ussuri clouds, disaggregated architecture (3 control nodes, 9 compute nodes, 9 storage nodes). Control nodes have 2x 20 cores CPU with hyperthreading enabled.
Example of failed tempest test:
```
<testcase classname=
<
File "/home/
raise value.with_
File "/home/
cls.
File "/home/
super(
File "/home/
credential_
File "/home/
creds = getattr(
File "/home/
return self.get_
File "/home/
credentials = self._create_
File "/home/
self.
File "/home/
role['id'])
File "/home/
(project_id, user_id, role_id), None)
File "/home/
return self.request('PUT', url, extra_headers, headers, body, chunked)
File "/home/
self.
File "/home/
resp=resp)
tempest.
Details: 502
</failure>
</testcase>
```
I have increased the `LogLevel` to `debug` for apache2 on keystone units and noticed the following entries in /var/log/
```
[Thu Aug 18 19:04:12.106381 2022] [mpm_event:debug] [pid 1652978:tid 139648898684672] event.c(1806): Too many open connections (36), not accepting new conns in this process
[Thu Aug 18 19:04:12.107231 2022] [mpm_event:debug] [pid 1652978:tid 139648898684672] event.c(485): AH00457: Accepting new connections again: 36 active conns (13 lingering/0 clogged/0 suspended), 2 idle workers
[Thu Aug 18 19:04:12.214808 2022] [mpm_event:debug] [pid 1652978:tid 139648898684672] event.c(1806): Too many open connections (36), not accepting new conns in this process
[Thu Aug 18 19:04:12.223126 2022] [mpm_event:debug] [pid 1652978:tid 139648898684672] event.c(485): AH00457: Accepting new connections again: 36 active conns (12 lingering/0 clogged/0 suspended), 1 idle workers
```
Meanwhile, /var/log/
```
keystone.
```
For this deployment `worker-multiplier` config option for keystone charm is not set, so the charm configures by default 3 workers for public endpoint and 1 worker for internal endpoint:
```
ubuntu@
<VirtualHost *:35337>
WSGIDaemonP
--
<VirtualHost *:4980>
WSGIDaemonP
```
By default, each worker is able to handle 25 connections, as configured in `ThreadsPerChild` in /etc/apache2/
```
ubuntu@
# event MPM
# StartServers: initial number of server processes to start
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestWorkers: maximum number of worker threads
# MaxConnectionsP
<IfModule mpm_event_module>
ThreadLimit 64
</IfModule>
# vim: syntax=apache ts=4 sw=4 sts=4 sr noet
```
So, in theory, each keystone unit should be able to handle 3 × 25 = 75 connections simultaneously.
To find out how many connections are requested during the tempest run, I have enabled mod_status apache2 module and logged server-status to a file, converting it to CSV on the fly as follows:
```
# Save server status as CSV file
# tail -n +2 # Read input starting from second line
# sed "s/^.*: //" # Remove key names, leave only values
# paste -d ";" -s # Add ";" between values
while true; do
echo -n "$(date +'%Y-%m-%d %H:%M:%S.%N');" | \
tee -a apache-
curl -s http://
tail -n +2 | \
sed "s/^.*: //" | \
paste -d ";" -s | \
tee -a apache-
sleep 0.5
done
```
See attached csv files and the example graph of TotalConns over time.
As a result, I found out that the maximum number of ConnsTotal on each keystone unit was ~90 at the peak, which is still lower than the allowed 100 connections (25 connections per worker; 4 workers). Maybe the ratio of connections was not exactly 0.75:0.25 for public and internal endpoints and some connections could not be handled. It seems that 3 workers per public API and 1 worker per internal API is not enough.
So I bumped up `worker-multiplier` as much as I could, to be within the limits enforced by configuration in mpm_event.conf.
With MaxRequestWorke
The target `worker-multiplier` is therefore 0.075. This is because math.ceil(80 CPUs * 0.075 worker-multiplier * 0.75 public_
Even if we went higher than 0.075 for `worker-multiplier` -- as calculated above -- it would have no effect as we are limited by the configuration in mpm_event.conf. We can't go higher than 150 simultaneous connections in total (MaxRequestWorkers 150).
I configured `worker-multiplier` with 0.075 and re-run tempest tests. This time I was expecting all tests to pass because keystone was configured with enough workers to handle 150 simultaneous connections.
Unfortunately some tempest tests still failed, and apache's error.log still contained new "Too many open connections" entries, such as:
```
[Thu Aug 18 21:16:19.543428 2022] [mpm_event:debug] [pid 1873553:tid 140431379646208] event.c(1806): Too many open connections (33), not accepting new conns in this process
[Thu Aug 18 21:16:19.577759 2022] [mpm_event:debug] [pid 1873553:tid 140431379646208] event.c(485): AH00457: Accepting new connections again: 33 active conns (9 lingering/0 clogged/0 suspended), 1 idle workers
```
I'm not sure but I think these entries are not errors. They just mean that a particular process is not able to accept more connections. And new connections should be passed to another worker process that is able to handle them.
Looking at the server-status again reveals that the maximum number of processes at peak load was 3, each able to handle 25 connections. I don't understand why more processes were not spawned if there was a need for handling more connections?
As an experiment, I bumped up ThreadsPerChild from 25 to 50 in mpm_event.conf, restarted apache2 on all keystone units and re-run tempest. This time I found out that:
- each keystone unit is exposed to ~120 simultaneous connections (at the peak),
- apache's error.log does not contain new "Too many open connections" entries,
- all tempest tests pass.
At this point I'm not sure what exactly is causing the problem as I was successfully running tempest tests in much smaller environments, even in my Orange Box lab.
I think that config options in mpm_event.conf should be somehow adjusted when `worker-multiplier` is modified, as already mentioned in bug https:/
Any hints / ideas would be greatly appreciated!
Subscribing field-high as it is being encountered on a customer deployment and no reasonable workaround exist yet.