Bug #1728577 “Mysql needs some tuning in order to work faster” : Bugs : tripleo

Steven Hardy (shardy) on 2017-10-30

Changed in tripleo:
status:	New → Triaged
milestone:	none → queens-2
importance:	Undecided → High

Revision history for this message

Cédric Jeanneret deactivated (cjeanneret-c2c-deactivated) wrote on 2017-10-30:

#1

mysql-tuner.txt Edit (9.2 KiB, text/plain)

As the paste will expire, here's the same log output.

Revision history for this message

Mike Bayer (zzzeek) wrote on 2017-10-30:

#2

I will go through this in more detail soon however the toplevel things here that are most applicable to us can be addresed with these features:

https://review.openstack.org/#/c/490046/ - Enable innodb_buffer_pool_size configuration

https://review.openstack.org/#/c/479849/ - Add option for innodb_flush_log_at_trx_commit = 2 for Galera only

the second option historically has been the one that really provides a quick speed boost as disk slowness is a frequent issue, especially that we're now using innodb_file_per_table.

the "should be N * number of CPU cores" stuff is more tricky since the galera node shares the CPU with *lots* of other python services that are also doing the "set workers to N * number of cores". Openstack does not in fact draw upon really deep performance needs from Galera with the possible exception of disk writes. A real analysis of "why is openstack so slow" as always needs detailed profiling of individual services as well as in the database space things like slow query logging, looking for services running too many queries, excessive data being sent over the network, looking for services that are flapping on too many connections / galera write conflicts, which you'd see as errors in the logs of individual services.

But really why I often see things as slow is that often to just run a simple command requires a dozen service calls between keystone and whatnot, which all bounce around the overcloud nodes and all go through HAProxy as well - there's no logic to try to keep these API calls local to one controller or anything like that (we've done some work internally on this and it would require large, dramatic architectural changes to tripleo to work). so network latency between nodes is a *big* factor.

Revision history for this message

Mike Bayer (zzzeek) wrote on 2017-10-30:

#3

basically, I need to see the detailed data backing up this statement:

After some checks, it appears the Galera cluster is the main culprit.

that data would be very helpful. I would be skeptical of this claim on the read side, that is, Galera is causing the Python applications to spend lots of time waiting on IO. On the write side, this is more plausible, but not all horizion commands write anything.

Revision history for this message

Cédric Jeanneret deactivated (cjeanneret-c2c-deactivated) wrote on 2017-11-15:

#4

Hello Mike,

sorry for the delay, I was a bit overwhelmed with other stuff.

So. apparently, the haproxy metrics doesn't show slow things with its http backends, at least the answers are fast there. But indeed, the "bouncing balancer in all directions" might also be an issue.

Fact is, among all the proposal to """correct""" mysql/galera, one might really do a difference:
"InnoDB flush log at each commit should be disabled"

Disabling that would prevent unnecessary fsync to the disk - although it might be requested by galera as it's a "synchronized replication", meaning all members must confirm the write before the lock is removed.

Also, the lack of index in ovs_neutron.pending_ucast_macs_remotes table might be a small issue, although I confess I didn't check its content nor its usage.

Fact is: it's hard to find the bottleneck with all those crossed queries (VIP is on node A, but queries bounces through node B and C without any distinctions).

I'm trying to understand what's going on with all those services, and apparently, the galera was a good culprit, especially since the mysqlTunner script raised some issues with its configuration.
Of course, all the memory-related issues are hard to "correct", since mysql isn't the only service running there.

I'm wondering what would happen if we could push the galera cluster on dedicated nodes, btw… Not that we actually can do that, but it might be interesting.

That said, seeing all the wsgi stuff, those might as well have some perfs issues.

We have to create some grafana dashboards with all the metrics, and (re)activate the haproxy exporter so that we can get the lower stats from the endpoint/loadbalancer.

I propose to keep that issue open so that I can add new comments/information, but to not stress too much with it. It would be great to find where's the bottleneck, although I'm pretty sure there are more than one.

We might rename this issue as well, something like "global perfs are bad" or something like that, as it appears it's maybe not only the galera cluster cause.

Cheers,

C.

Hello Mike,

sorry for the delay, I was a bit overwhelmed with other stuff.

So. apparently, the haproxy metrics doesn't show slow things with its http backends, at least the answers are fast there. But indeed, the "bouncing balancer in all directions" might also be an issue.

Fact is, among all the proposal to """correct""" mysql/galera, one might really do a difference: 
"InnoDB flush log at each commit should be disabled"