Object server does not ignore a down container server on writes leading to higher overall cluster latency
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Environment Description -
Swift 2.2.2
1 Region, 3 Zones, 2 Replicas.
3 Object servers, 3 Account servers, 3 Container servers, 1 Proxy. I specifically split account/container into individual machines on purpose in order to narrow down the problem components, I would not normally do this in production.
$ vagrant status
Current machine states:
salt-master running
swift-storage-z01r1 192.168.3.14 running
swift-storage-z02r1 192.168.3.15 running
swift-storage-z03r1 192.168.3.16 running
swift-account-z01r1 192.168.3.17 running
swift-container
swift-container
swift-container
swift-account-z02r1 running
swift-account-z03r1 running
swift-proxy-z01r1 running
Container Ring:
/srv/salt/
4096 partitions, 2.000000 replicas, 1 regions, 3 zones, 3 devices, 0.02 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 0
The overload factor is 0.00% (0.000000)
Devices:
id region zone ip address port replication ip replication port name weight partitions balance meta
0 1 1 192.168.3.21 6001 192.168.3.21 6001 sdb1 100.00 2731 0.01
1 1 3 192.168.3.23 6001 192.168.3.23 6001 sdb1 100.00 2730 -0.02
2 1 2 192.168.3.22 6001 192.168.3.22 6001 sdb1 100.00 2731 0.01
object-server.conf:
[DEFAULT]
workers = 1
mount_check = false
bind_ip = 192.168.3.X
bind_port = 6100
log_level = DEBUG
[pipeline:main]
pipeline = recon object-server
[app:object-server]
use = egg:swift#object
threads_per_disk = 4
[object-replicator]
concurrency = 4
[object-updater]
[object-auditor]
[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/
Object Server logs (line appears on every write attempt when the container replica exists on a down node)
Mar 19 19:16:21 vagrant-
*******
Problem Description -
When a container node goes down (Node & IP actually go offline, not just service disabled) overall cluster write speed is impacted where a write destined to a particular container has a replica located on the down node. In that instance, the individual writes time is increased by the conn_timeout value set in the object server config (default 500ms).
Read latency does not appear to be affected by a container outage.
When an object server goes down, the proxy server seems to have the ability to ignore the node for X seconds and does not attempt to retry to connect to it on every write operation. This helps to stabilize overall latency during outages & is the desired behavior.
However, this does not appear to be implemented for container server outages & overall latency is affected during an outage event.
In my production environment, I am trying to sustain at least 3000 WRITES per second at peak times, and a single container server outage has the ability to effectively grind all my operations to a halt. It would be great if the Object server had the ability to ignore a down container node similar to how the proxy user ignores a down object server for a time.
description: | updated |
description: | updated |
Changed in swift: | |
status: | New → Fix Released |
I agree a downed container server will have an *impact* on object writes (because we endeavor to synchronously update the listing in the container) - but I feel the problem may benefit from additional classification here.
Aside: I think describing the issue as grinding to a "halt" - is potentially confusing. No offense :\ The timeout values are *meant* to protect from a destructive latency hit when a node is down - and I think in general they're doing what they are supposed to. But most importantly they're designed to be tunable so that an operator can decide exactly how much variance in latency they're willing to accept in an effort to avoid introducing temporary inconsistencies - 500ms may be too high for some people, but it's just a default!
I'm *really* not trying to split hairs as much as make sure we identify what exactly it is we want to fix. There's plenty of things that could be better ;)
The most obvious settings you should tune for this are:
1) the conn_timeout in the object server config (as you identified)
and
2) the post_quorum_timeout on the proxy
I'm not sure if you indicated that lowering the conn_timeout in the object server config was not acceptable? 150-250ms would be fine on many networks I'm sure.
The post_quorum_timeout also seems like a good setting for you - because the worst case scenario with too-low setting here is that a container *listing* immediately following the 201 response of an object PUT may be inconsistent temporarily if you happen to hit the container replica that object server responsible for updating hasn't finished yet. But in a write heavy load you may not *doing* many container listings. Also if the final node beyond the majority success response is being slow specifically *because* of a downed container server - you're not going to get a stale listing from that node regardless :D
Try setting the proxies post_quorum_timeout down to 10-50ms and see if your result is more satisfactory.
However, the reference to the error limiting/connection timeout tracking in the proxy for object servers seems to be suggesting something different. Specifically, that the proxy has the ability to track the container update timeouts and change the X-Container-* headers it sends along with object writes - in the same way it will skip error limited object servers and select handoffs to receive the write - but currently this would not be possible. The gist seems to be more-or-less getting at the idea that rather that waiting for *any* kind of timeout somewhere, it would be better to use the existing error limiting information and avoid the connection attempts - but we wouldn't be making nearly enough container requests in a write heavy load that the existing proxy error tracking on the container nodes would be sufficient for this purpose. If we want to leverage the existing error limiting, as a first step, the object servers would need to send back a message to indicate if they went to async-pending or not. After that we could look into how to update the container error limit tables from object PUT responses, and *then* apply error limiting in get_container_ info's get_nodes calls. But *that* assumes t...