tripleo

Bug #1728577
Comment #4

Comment 4 for bug 1728577

Revision history for this message

Cédric Jeanneret deactivated (cjeanneret-c2c-deactivated) wrote on 2017-11-15:

Hello Mike,

sorry for the delay, I was a bit overwhelmed with other stuff.

So. apparently, the haproxy metrics doesn't show slow things with its http backends, at least the answers are fast there. But indeed, the "bouncing balancer in all directions" might also be an issue.

Fact is, among all the proposal to """correct""" mysql/galera, one might really do a difference:
"InnoDB flush log at each commit should be disabled"

Disabling that would prevent unnecessary fsync to the disk - although it might be requested by galera as it's a "synchronized replication", meaning all members must confirm the write before the lock is removed.

Also, the lack of index in ovs_neutron.pending_ucast_macs_remotes table might be a small issue, although I confess I didn't check its content nor its usage.

Fact is: it's hard to find the bottleneck with all those crossed queries (VIP is on node A, but queries bounces through node B and C without any distinctions).

I'm trying to understand what's going on with all those services, and apparently, the galera was a good culprit, especially since the mysqlTunner script raised some issues with its configuration.
Of course, all the memory-related issues are hard to "correct", since mysql isn't the only service running there.

I'm wondering what would happen if we could push the galera cluster on dedicated nodes, btw… Not that we actually can do that, but it might be interesting.

That said, seeing all the wsgi stuff, those might as well have some perfs issues.

We have to create some grafana dashboards with all the metrics, and (re)activate the haproxy exporter so that we can get the lower stats from the endpoint/loadbalancer.

I propose to keep that issue open so that I can add new comments/information, but to not stress too much with it. It would be great to find where's the bottleneck, although I'm pretty sure there are more than one.

We might rename this issue as well, something like "global perfs are bad" or something like that, as it appears it's maybe not only the galera cluster cause.

Cheers,