StarlingX

Bug #1851287
Comment #3

Comment 3 for bug 1851287

Revision history for this message

Kevin Smith (kevin.smith.wrs) wrote on 2019-11-13:

The data and master pods are blocked on I/O writing to their respective rbd volumes according to kernel logs. There is a ceph osd bounce just before the block which is suspicious. In some cases, the situation will resolve itself over time, the pods will terminate and the lock will complete. If not, a host-lock --force with the accompanying reboot will forcefully resolve the problem. The exact same symptoms can be reproduced if the volumes are allowed to go full (which the elasticsearch-curator is preventing). Unfortunately, ability to reproduce on the lab that reported the problem has been lost.

I currently suspect the index shard re-sync between the data pods that occurs after node recovery. It is possible to disable routing (at least temporarily) before a pod starts up which could help the situation, but without the ability to reproduce it will be hard to verify if that helps.