Comment 2 for bug 1434262

Revision history for this message
Patrick Skerrett (pitrick) wrote :

Clay,

Thank you for the concise response. Let me start with a few clarifications & then I can present the results of this morning's tests..

First, let me clarify that I am using 3 nodes in this test environment to quickly & clearly see the results. In actual production I have > 20 object servers with the account/container services bundled in, so there are plenty of nodes available (A configuration that was recommended to us after a few days of on-site consulting with SwiftStack engineers).

Secondly, I know you're not splitting hairs, but I do want to stress that my workload is significantly write-heavy, with a target rate of 3000 object writes/sec. The default 500ms timeout is actually enough to bring us to a "halt" :) Its bad enough that if it were allowed to continue for long enough we'd have major problems catching back up and working through the massive backlog.

Last, I wanted clear up one item. I believe the proxy->object error/outage handling is ideal in 2.2.2 at this point. For some reason, the behavior was different in 2.2.1, (which we we running in production up till this week). That is what originally set me down this path of enhanced redundancy testing. From my observations right now, an outage of an object server only appears to add additional latency for a few seconds until the error limit triggers, and then the affected machine does not appear to be retried again until a set number of seconds. In 2.2.1, I would continuously have the latency throughout the outage window.

In regards to your suggestions:

As expected, conn_timeout set on the object server does indeed cut the overall write latency down when a container node is offline. I originally had mixed feelings about lowering this, but after sleeping on it, I do not really see a downside to lowering this value. It certainly helps the overall situation, but with so many concurrent writes going on, a large number will inevitably be slower than normal. This is expected given the available configuration options that exist today.

Setting post_quorum_timeout on the proxy, with numerous different attempts does not seem to have any effect on reducing the write latency when the cluster is experiencing a container node outage. I believe this behavior is expected as well, since we can pretty definitively pinpoint the cause of the slowdown to be the object server calling out to the down container nodes & hitting conn_timeout.

In closing, I feel better about setting conn_timeout lower today and that certainly does help to cut down the overall risk of a container outage. There still is a visible degradation, but its acceptable. However it still does not fully sit well with me. It is a shame that even if you take great care to build in enough redundant nodes into your environment, this one component will ALWAYS cause overall degradation of service if it goes down. The only way at this time to work around it is some sort of VMware type node migration, or perhaps VRRP migration of the node IP address to a standby box that will TCP reject the connections, etc etc.

Swift does so many other things well, it seems a shame that we need external help in this one component to ensure no service degradation can ever occur. A simple ignore from the object to the container node if error events > X would nicely take care of this IMO.

Thanks, Pat S.