Mysql needs some tuning in order to work faster

Bug #1728577 reported by Cédric Jeanneret deactivated on 2017-10-30
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

Dear Stacker,

A bit astonished by the slowness of both Horizon and API calls, we searched a bit in order to find what could be so slow.

After some checks, it appears the Galera cluster is the main culprit. The following output is from the mysql tuner script¹:

It would be good to have a look at that output and, if possible, provide some nice tuning in the tripleO deploy process - most of them are static one, and some are memory-dependent, and might be calculated based on puppet's fact like $::memorysize_mb (it's a float).

Thank you!




Steven Hardy (shardy) on 2017-10-30
Changed in tripleo:
status: New → Triaged
milestone: none → queens-2
importance: Undecided → High

As the paste will expire, here's the same log output.

Mike Bayer (zzzeek) wrote :

I will go through this in more detail soon however the toplevel things here that are most applicable to us can be addresed with these features: - Enable innodb_buffer_pool_size configuration - Add option for innodb_flush_log_at_trx_commit = 2 for Galera only

the second option historically has been the one that really provides a quick speed boost as disk slowness is a frequent issue, especially that we're now using innodb_file_per_table.

the "should be N * number of CPU cores" stuff is more tricky since the galera node shares the CPU with *lots* of other python services that are also doing the "set workers to N * number of cores". Openstack does not in fact draw upon really deep performance needs from Galera with the possible exception of disk writes. A real analysis of "why is openstack so slow" as always needs detailed profiling of individual services as well as in the database space things like slow query logging, looking for services running too many queries, excessive data being sent over the network, looking for services that are flapping on too many connections / galera write conflicts, which you'd see as errors in the logs of individual services.

But really why I often see things as slow is that often to just run a simple command requires a dozen service calls between keystone and whatnot, which all bounce around the overcloud nodes and all go through HAProxy as well - there's no logic to try to keep these API calls local to one controller or anything like that (we've done some work internally on this and it would require large, dramatic architectural changes to tripleo to work). so network latency between nodes is a *big* factor.

Mike Bayer (zzzeek) wrote :

basically, I need to see the detailed data backing up this statement:

    After some checks, it appears the Galera cluster is the main culprit.

that data would be very helpful. I would be skeptical of this claim on the read side, that is, Galera is causing the Python applications to spend lots of time waiting on IO. On the write side, this is more plausible, but not all horizion commands write anything.

Hello Mike,

sorry for the delay, I was a bit overwhelmed with other stuff.

So. apparently, the haproxy metrics doesn't show slow things with its http backends, at least the answers are fast there. But indeed, the "bouncing balancer in all directions" might also be an issue.

Fact is, among all the proposal to """correct""" mysql/galera, one might really do a difference:
"InnoDB flush log at each commit should be disabled"

Disabling that would prevent unnecessary fsync to the disk - although it might be requested by galera as it's a "synchronized replication", meaning all members must confirm the write before the lock is removed.

Also, the lack of index in ovs_neutron.pending_ucast_macs_remotes table might be a small issue, although I confess I didn't check its content nor its usage.

Fact is: it's hard to find the bottleneck with all those crossed queries (VIP is on node A, but queries bounces through node B and C without any distinctions).

I'm trying to understand what's going on with all those services, and apparently, the galera was a good culprit, especially since the mysqlTunner script raised some issues with its configuration.
Of course, all the memory-related issues are hard to "correct", since mysql isn't the only service running there.

I'm wondering what would happen if we could push the galera cluster on dedicated nodes, btw… Not that we actually can do that, but it might be interesting.

That said, seeing all the wsgi stuff, those might as well have some perfs issues.

We have to create some grafana dashboards with all the metrics, and (re)activate the haproxy exporter so that we can get the lower stats from the endpoint/loadbalancer.

I propose to keep that issue open so that I can add new comments/information, but to not stress too much with it. It would be great to find where's the bottleneck, although I'm pretty sure there are more than one.

We might rename this issue as well, something like "global perfs are bad" or something like that, as it appears it's maybe not only the galera cluster cause.



tags: added: pike-backport-potential
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments