swift 2.1.0 replication issue with rsync
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
New
|
Undecided
|
Unassigned |
Bug Description
We have had a swift cluster running for about 4 months now. We mainly use it to store images that we archive from our Media Database. We have a 7 node cluster and each cluster has a 2TB volume attached to it for the files that we store in swift (24GB RAM , 4 CPU virtual machines).
We initially started with a 3 node cluster, but as space started to shrink there, we added 4 more nodes with the same specs. Our archive jobs were running almost constantly since the day we started the initial cluster, and now 2 of the nodes, have gotten full. Jobs are obviously failing and what we noticed is that the issue is caused by the rsync replication.
On further analysis we noticed that swift replicates it's files using rsync by appending container/partition paths to the command but these paths have gotten so big that they reach the POSIX limit of 4096 characters per command.
We looked into the code abit and found this section in /usr/lib/
for suffix in suffixes:
spath = join(job['path'], suffix)
if os.path.
we added a simple loop and itterated over the args and wrote them to a file(see attached).
As can be seen there, the command is very long and exceeds by alot the 4096 limit.
We have currently switched to rsync_method = ssync and set the replication_
Is there maybe a solution to balance the nodes as they are in the current state
The current image is a section from newrelic dashboard where we can see cpu, disk io and memory and used space
https:/
On 21.01.16 13:51, Cosmin coroiu wrote:
> We initially started with a 3 node cluster, but as space started to
> shrink there, we added 4 more nodes with the same specs. Our archive
> jobs were running almost constantly since the day we started the initial
> cluster, and now 2 of the nodes, have gotten full. Jobs are obviously
> failing and what we noticed is that the issue is caused by the rsync
> replication.
I assume the jobs are failing due to missing space, right? And rsync
might be failing because some of the destination nodes are full?
I assume you already rebalanced the object ring, and need to move data
more quickly from the full nodes to the remaining ones?
> On further analysis we noticed that swift replicates it's files using
> rsync by appending container/partition paths to the command but these
> paths have gotten so big that they reach the POSIX limit of 4096
> characters per command.
What OS are you running? I quickly checked on rhel7 and ubuntu 14.04,
and the limit is much higher on both systems:
$ getconf ARG_MAX
2097152
Also, I created a long arg using the following command to see if it fails:
for i in `seq 1 5000`; do cmd+=" hello world" ; done ; echo $cmd > out
That worked on both systems. What error do you get? Or is this a
separated issue?
> Is there maybe a solution to balance the nodes as they are in the
> current state
You might want to have a look at the handoffs_first setting:
https:/ /github. com/openstack/ swift/blob/ master/ etc/object- server. conf-sample# L216-L224