mariadbcheck@ service is called with uniq parameter every time causing high cardinality

Bug #2030748 reported by George Shuklin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Triaged
Medium
Unassigned

Bug Description

There is a systemd service template mariadbcheck@ which is been called with unique parameter every time.

Aug 08 12:03:44 infra2-osa-mvp systemd[1]: Starting mariadbcheck@ service (10.33.177.164:53204)...
Aug 08 12:03:44 infra2-osa-mvp systemd[1]: mariadbcheck@126371-10.33.177.164:9200-10.33.177.100:50808.service: Succeeded.
Aug 08 12:03:44 infra2-osa-mvp systemd[1]: Finished mariadbcheck@ service (10.33.177.100:50808).
Aug 08 12:03:44 infra2-osa-mvp systemd[1]: mariadbcheck@126372-10.33.177.164:9200-10.33.177.228:58408.service: Succeeded.
Aug 08 12:03:44 infra2-osa-mvp systemd[1]: Finished mariadbcheck@ service (10.33.177.228:58408).
Aug 08 12:03:44 infra2-osa-mvp systemd[1]: mariadbcheck@126373-10.33.177.164:9200-10.33.177.164:53204.service: Succeeded.
Aug 08 12:03:44 infra2-osa-mvp systemd[1]: Finished mariadbcheck@ service (10.33.177.164:53204).

The problem is that this 'tail' is creating a new UNIT value for journald, prometheus with node-exporter with enabled systemd-collector and for loki, if systemd logs are properly parsed.

In five days it generated about a third of million of unique values for the service name:

time journalctl -F UNIT|wc -l
311751

real 1m6.551s
user 0m11.799s
sys 0m51.035s

I believe, having O(t) systemd unit names will break many monitoring systems (because there is an assumption that unit list stays more or less the same).

Proposition: remove timestamp from parametrization for calling mariadbcheck@ service.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Hi,

Eventually, it is required to have templated name for the service, that is followed by the socket with Accept=yes
https://www.freedesktop.org/software/systemd/man/systemd.socket.html#Accept=

In order to overcome this issue, we'd need to disable `Accept`, which requires to have a proper daemon behind it, rather then just a script, that was moved from xinetd.

At the moment we don't have time to refactor clustercheck [1] as we wanted to work on replacing HAproxy with ProxySQL for MariaDB balancing.
So contributions to that in the meanwhile are welcome.

[1] https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/templates/clustercheck.j2

Changed in openstack-ansible:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Jonathan Rosser (jrosser) wrote :

A solution here might be to construct a regex to exclude the mariadbcheck processes from node-exporter using the --collector.systemd.unit-exclude flag, see the documentation here https://github.com/prometheus/node_exporter/blob/master/README.md#include--exclude-flags.

Revision history for this message
George Shuklin (george-shuklin) wrote :

Yes, that's one of the workarounds. It can be node-exporter skip setting, it can be label drop or rewrite in prometheus (and Loki, which suffer from the same cardinality problem), but none of them is a proper solution.

Revision history for this message
Jonathan Rosser (jrosser) wrote :

I'm not sure there is an alternative - xinetd is deprecated and systemd socket activated services are the modern replacement.

You have suggested we "Proposition: remove timestamp from parametrization for calling mariadbcheck@ service." but I cannot see a way to do that with systemd.socket.

This will happen in more places, I see Ironic starting to use socket activated services for tftp which will behave in exactly the same way.

Revision history for this message
George Shuklin (george-shuklin) wrote :

So, cardinality problem will become more and more of the issue. Even local journalctl is struggling if server is running for long time (which it should).

May be we can escalate it to systemd and ask them not to create a new name of each socket activation...

Revision history for this message
George Shuklin (george-shuklin) wrote :
Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Well, as I said, the only alternative is to re-write check from xinetd format to be a proper daemon, like some dummy flask or aiohttp application.

As what systemd wrintes in their docs for Accept property [1]: " For performance reasons, it is recommended to write new daemons only in a way that is suitable for Accept=no."

So they pretty much aware that such usecases are far from being perfect.

[1] https://www.freedesktop.org/software/systemd/man/systemd.socket.html#Accept=

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.