Comment 4 for bug 1900845

Revision history for this message
clayg (clay-gerrard) wrote :

So the traceback isn't helpful - the Receiver is probably reading an empty string from the socket because of broken pipe or whatever and then trying to split it according to the ssync protocol. So one bug that could be fixed would be making the parsing more robust to network errors and timoeuts.

But the more immediate issue in the cluster is probably the timeout/broken-pipe breaking replication.

Can you verify with iostat which disks are most busy - is it the sender's or the receivers that are underwater on iops? The ideal situation is all disks are little busy - but without tuning of replication workers and concurrency we often get clusters that have a few disks that are TOO busy (better default turnings is actually something we're hoping to discuss at the virtual PTG next week! https://etherpad.opendev.org/p/swift-ptg-wallaby)

Can you configure separate replication server processes [1] on a different port than the proxy/client traffic for better i/o shaping?

1. https://docs.openstack.org/swift/latest/replication_network.html