Comment 2 for bug 1669456

Revision history for this message
David Ames (thedac) wrote :

Francis,

Just fyi, I do not have access ci.landscape.net, but I did look at the attached logs.

Can we rule out high load issues that have been a problem in the past in this instance?

The logs seem to indicate cascading failure, including not being able to reach the rabbit hosts at all:

2017-02-28 21:58:42.312 108426 ERROR oslo.messaging._drivers.impl_rabbit [req-f7be21f4-f063-4bad-a15a-1fb70e6de6d0 - - - - -] [f3ef7b8b-322a-48d9-a27f-631e2bd0ca32] AMQP server on 10.96.65.40:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 57718

I suspect rabbit connection issues are the symptom and not the cause.

When you see this can you gather some memory/cpu load information on the hosts?
What does this hardware look like? How many cores? We have mentioned the worker-multiplier issue before.
Percona-cluster can at times be misconfiguration to take too much memory. Check on its health. We have also made some improvements to percona in next that help guarantee this does not occur.