Just fyi, I do not have access ci.landscape.net, but I did look at the attached logs.
Can we rule out high load issues that have been a problem in the past in this instance?
The logs seem to indicate cascading failure, including not being able to reach the rabbit hosts at all:
2017-02-28 21:58:42.312 108426 ERROR oslo.messaging._drivers.impl_rabbit [req-f7be21f4-f063-4bad-a15a-1fb70e6de6d0 - - - - -] [f3ef7b8b-322a-48d9-a27f-631e2bd0ca32] AMQP server on 10.96.65.40:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 57718
I suspect rabbit connection issues are the symptom and not the cause.
When you see this can you gather some memory/cpu load information on the hosts?
What does this hardware look like? How many cores? We have mentioned the worker-multiplier issue before.
Percona-cluster can at times be misconfiguration to take too much memory. Check on its health. We have also made some improvements to percona in next that help guarantee this does not occur.
Francis,
Just fyi, I do not have access ci.landscape.net, but I did look at the attached logs.
Can we rule out high load issues that have been a problem in the past in this instance?
The logs seem to indicate cascading failure, including not being able to reach the rabbit hosts at all:
2017-02-28 21:58:42.312 108426 ERROR oslo.messaging. _drivers. impl_rabbit [req-f7be21f4- f063-4bad- a15a-1fb70e6de6 d0 - - - - -] [f3ef7b8b- 322a-48d9- a27f-631e2bd0ca 32] AMQP server on 10.96.65.40:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 57718
I suspect rabbit connection issues are the symptom and not the cause.
When you see this can you gather some memory/cpu load information on the hosts?
What does this hardware look like? How many cores? We have mentioned the worker-multiplier issue before.
Percona-cluster can at times be misconfiguration to take too much memory. Check on its health. We have also made some improvements to percona in next that help guarantee this does not occur.