OpenStack Object Storage (swift)

Object server does not ignore a down container server on writes leading to higher overall cluster latency

Bug #1434262 reported by Patrick Skerrett on 2015-03-19

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	Fix Released	Undecided	Unassigned

Bug Description

Environment Description -

Swift 2.2.2

1 Region, 3 Zones, 2 Replicas.
3 Object servers, 3 Account servers, 3 Container servers, 1 Proxy. I specifically split account/container into individual machines on purpose in order to narrow down the problem components, I would not normally do this in production.

$ vagrant status
Current machine states:
salt-master running
swift-storage-z01r1 192.168.3.14 running
swift-storage-z02r1 192.168.3.15 running
swift-storage-z03r1 192.168.3.16 running
swift-account-z01r1 192.168.3.17 running
swift-container-z01r1 running
swift-container-z02r1 running
swift-container-z03r1 running
swift-account-z02r1 running
swift-account-z03r1 running
swift-proxy-z01r1 running

Container Ring:
/srv/salt/swift-rings/container.builder, build version 3
4096 partitions, 2.000000 replicas, 1 regions, 3 zones, 3 devices, 0.02 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 0
The overload factor is 0.00% (0.000000)
Devices:
id region zone ip address port replication ip replication port name weight partitions balance meta

0 1 1 192.168.3.21 6001 192.168.3.21 6001 sdb1 100.00 2731 0.01
1 1 3 192.168.3.23 6001 192.168.3.23 6001 sdb1 100.00 2730 -0.02
2 1 2 192.168.3.22 6001 192.168.3.22 6001 sdb1 100.00 2731 0.01

object-server.conf:
[DEFAULT]
workers = 1
mount_check = false
bind_ip = 192.168.3.X
bind_port = 6100
log_level = DEBUG

[pipeline:main]
pipeline = recon object-server

[app:object-server]
use = egg:swift#object
threads_per_disk = 4

[object-replicator]
concurrency = 4

[object-updater]

[object-auditor]

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift-6100

Object Server logs (line appears on every write attempt when the container replica exists on a down node)

Mar 19 19:16:21 vagrant-ubuntu-trusty-64 object-server: ERROR container update failed with 192.168.3.22:6001/sdb1 (saving for async update later): Host unreachable (txn: tx767cf9a94d74489e8f1a9-00550b2084)

**************************************************************************
Problem Description -

When a container node goes down (Node & IP actually go offline, not just service disabled) overall cluster write speed is impacted where a write destined to a particular container has a replica located on the down node. In that instance, the individual writes time is increased by the conn_timeout value set in the object server config (default 500ms).

Read latency does not appear to be affected by a container outage.

When an object server goes down, the proxy server seems to have the ability to ignore the node for X seconds and does not attempt to retry to connect to it on every write operation. This helps to stabilize overall latency during outages & is the desired behavior.
However, this does not appear to be implemented for container server outages & overall latency is affected during an outage event.

In my production environment, I am trying to sustain at least 3000 WRITES per second at peak times, and a single container server outage has the ability to effectively grind all my operations to a halt. It would be great if the Object server had the ability to ignore a down container node similar to how the proxy user ignores a down object server for a time.

See original description

Tags:

Patrick Skerrett (pitrick) on 2015-03-19

description:	updated
description:	updated

Revision history for this message

clayg (clay-gerrard) wrote on 2015-03-20:

Download full text (5.0 KiB)

I agree a downed container server will have an *impact* on object writes (because we endeavor to synchronously update the listing in the container) - but I feel the problem may benefit from additional classification here.

Aside: I think describing the issue as grinding to a "halt" - is potentially confusing. No offense :\ The timeout values are *meant* to protect from a destructive latency hit when a node is down - and I think in general they're doing what they are supposed to. But most importantly they're designed to be tunable so that an operator can decide exactly how much variance in latency they're willing to accept in an effort to avoid introducing temporary inconsistencies - 500ms may be too high for some people, but it's just a default!

I'm *really* not trying to split hairs as much as make sure we identify what exactly it is we want to fix. There's plenty of things that could be better ;)

The most obvious settings you should tune for this are:

1) the conn_timeout in the object server config (as you identified)

and

2) the post_quorum_timeout on the proxy

I'm not sure if you indicated that lowering the conn_timeout in the object server config was not acceptable? 150-250ms would be fine on many networks I'm sure.

The post_quorum_timeout also seems like a good setting for you - because the worst case scenario with too-low setting here is that a container *listing* immediately following the 201 response of an object PUT may be inconsistent temporarily if you happen to hit the container replica that object server responsible for updating hasn't finished yet. But in a write heavy load you may not *doing* many container listings. Also if the final node beyond the majority success response is being slow specifically *because* of a downed container server - you're not going to get a stale listing from that node regardless :D

Try setting the proxies post_quorum_timeout down to 10-50ms and see if your result is more satisfactory.

Aside: I think describing the issue as grinding to a "halt" - is potentially confusing.  No offense :\  The timeout values are *meant* to protect from a destructive latency hit when a node is down - and I think in general they're doing what they are supposed to.  But most importantly they're designed to be tunable so that an operator can decide exactly how much variance in latency they're willing to accept in an effort to avoid introducing temporary inconsistencies - 500ms may be too high for some people, but it's just a default!

I'm *really* not trying to split hairs as much as make sure we identify what exactly it is we want to fix.  There's plenty of things that could be better ;)

The most obvious settings you should tune for this are:

1) the conn_timeout in the object server config (as you identified)

and

2) the post_quorum_timeout on the proxy

I'm not sure if you indicated that lowering the conn_timeout in the object server config was not acceptable?  150-250ms would be fine on many networks I'm sure.

The post_quorum_timeout also seems like a good setting for you - because the worst case scenario with too-low setting here is that a container *listing* immediately following the 201 response of an object PUT may be inconsistent temporarily if you happen to hit the container replica that object server responsible for updating hasn't finished yet.  But in a write heavy load you may not *doing* many container listings.  Also if the final node beyond the majority success response is being slow specifically *because* of a downed container server - you're not going to get a stale listing from that node regardless :D

Try setting the proxies post_quorum_timeout down to 10-50ms and see if your result is more satisfactory.

However, the reference to the error limiting/connection timeout tracking in the proxy for object servers seems to be suggesting something different.  Specifically, that the proxy has the ability to track the container update timeouts and change the X-Container-* headers it sends along with object writes - in the same way it will skip error limited object servers and select handoffs to receive the write - but currently this would not be possible.  The gist seems to be more-or-less getting at the idea that rather that waiting for *any* kind of timeout somewhere, it would be better to use the existing error limiting information and avoid the connection attempts - but we wouldn't be making nearly enough container requests in a write heavy load that the existing proxy error tracking on the container nodes would be sufficient for this purpose.  If we want to leverage the existing error limiting, as a first step, the object servers would need to send back a message to indicate if they went to async-pending or not.  After that we could look into how to update the container error limit tables from object PUT responses, and *then* apply error limiting in get_container_info's get_nodes calls.  But *that* assumes that it's better to write the row elsewhere and let container-replication move it than it would be to let the object-updater do it.  It's not immediately clear to me that we might not rather apply an "X-Force-Async" (I made that up) to the object request for error-limited container nodes instead of sending down a handoff nodes headers - we've had other reasons for considering the "always go to async" behavior before - but applying error limiting numbers to this purpose would require a different kind of plumbing since we currently only use error limiting to skip in iter_nodes.   Another way to use something like X-Force-Async would be if we introduced a new kind of error counter in the object server processes (possibly as middleware - but that'd still require the response to include a header indicating the container update timed-out).  However none of this addresses the saw-tooth problem of the current error limiting implementation where after enough requests incr errors on a node we *finally* *completely* stop talking to it for the error limit timeout, but after that short period every proxy worker in the cluster will flush their error count and has to again *re*discover the node is unreachable.  I'd personally be much more interested in seeing further work on a variation of https://review.openstack.org/#/c/112424/ before we start trying to applying the existing error limiting system in more places.

I'd also like to point out with this particular description - I'm worried that you only have three container *devices* in the ring - even if we were able to "skip a downed" container *server* - there'd still be no additional *devices* on either of the other nodes for a handoff to carry the the update.  It is not useful to put two copies of the row on the same device.

Thanks for the report - I think there's plenty of ways to attack improvements in this area  - let us know if either of those settings works better or worse in your environment.

Revision history for this message

Patrick Skerrett (pitrick) wrote on 2015-03-20:

Download full text (3.3 KiB)

Clay,

Thank you for the concise response. Let me start with a few clarifications & then I can present the results of this morning's tests..

First, let me clarify that I am using 3 nodes in this test environment to quickly & clearly see the results. In actual production I have > 20 object servers with the account/container services bundled in, so there are plenty of nodes available (A configuration that was recommended to us after a few days of on-site consulting with SwiftStack engineers).

Secondly, I know you're not splitting hairs, but I do want to stress that my workload is significantly write-heavy, with a target rate of 3000 object writes/sec. The default 500ms timeout is actually enough to bring us to a "halt" :) Its bad enough that if it were allowed to continue for long enough we'd have major problems catching back up and working through the massive backlog.

Last, I wanted clear up one item. I believe the proxy->object error/outage handling is ideal in 2.2.2 at this point. For some reason, the behavior was different in 2.2.1, (which we we running in production up till this week). That is what originally set me down this path of enhanced redundancy testing. From my observations right now, an outage of an object server only appears to add additional latency for a few seconds until the error limit triggers, and then the affected machine does not appear to be retried again until a set number of seconds. In 2.2.1, I would continuously have the latency throughout the outage window.

In regards to your suggestions:

As expected, conn_timeout set on the object server does indeed cut the overall write latency down when a container node is offline. I originally had mixed feelings about lowering this, but after sleeping on it, I do not really see a downside to lowering this value. It certainly helps the overall situation, but with so many concurrent writes going on, a large number will inevitably be slower than normal. This is expected given the available configuration options that exist today.

Setting post_quorum_timeout on the proxy, with numerous different attempts does not seem to have any effect on reducing the write latency when the cluster is experiencing a container node outage. I believe this behavior is expected as well, since we can pretty definitively pinpoint the cause of the slowdown to be the object server calling out to the down container nodes & hitting conn_timeout.

In closing, I feel better about setting conn_timeout lower today and that certainly does help to cut down the overall risk of a container outage. There still is a visible degradation, but its acceptable. However it still does not fully sit well with me. It is a shame that even if you take great care to build in enough redundant nodes into your environment, this one component will ALWAYS cause overall degradation of service if it goes down. The only way at this time to work around it is some sort of VMware type node migration, or perhaps VRRP migration of the node IP address to a standby box that will TCP reject the connections, etc etc.

Swift does so many other things well, it seems a shame that we need external help in this one component to ens...

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Changed in swift:
status:	New → Fix Released