Bug #1900845 “ssync replication failure” : Bugs : OpenStack Object Storage (swift)

Revision history for this message

clayg (clay-gerrard) wrote on 2020-10-21:

#1

I don't believe I've seen that error before.

Have you been able to use ssync successfully for a re-balance before? rsync?

Are all nodes in the cluster upgraded to the same version of swift?

Revision history for this message

Dmitry (kozlovdmtry) wrote on 2020-10-22:

#2

ssync replication completes successfully if I manually (with flag --once) run object replicator on a small handoff partition. rsync works well since I increased timeouts (rsync_timeout, http_timeout, lockup_timeout) but it is incredibly slow. It takes several hours to form list of files for one partition and start actual data transfer.
I am in progress of upgrade from Train/CentOS7 to Ussuri/CentOS8. So I have added one object server (ussuri) and move partitions to it. ssync replication work fine to this ussuri server, but fails with error (IndexError: list index out of range) when replicating from train to train.

Revision history for this message

Dmitry (kozlovdmtry) wrote on 2020-10-23:

#3

I found the same bug:
https://bugs.launchpad.net/swift/+bug/1890575

Revision history for this message

clayg (clay-gerrard) wrote on 2020-10-23:

#4

So the traceback isn't helpful - the Receiver is probably reading an empty string from the socket because of broken pipe or whatever and then trying to split it according to the ssync protocol. So one bug that could be fixed would be making the parsing more robust to network errors and timoeuts.

But the more immediate issue in the cluster is probably the timeout/broken-pipe breaking replication.

Can you verify with iostat which disks are most busy - is it the sender's or the receivers that are underwater on iops? The ideal situation is all disks are little busy - but without tuning of replication workers and concurrency we often get clusters that have a few disks that are TOO busy (better default turnings is actually something we're hoping to discuss at the virtual PTG next week! https://etherpad.opendev.org/p/swift-ptg-wallaby)

Can you configure separate replication server processes [1] on a different port than the proxy/client traffic for better i/o shaping?

1. https://docs.openstack.org/swift/latest/replication_network.html

Revision history for this message

Anton (a.porabkovich) wrote on 2020-10-31:

#5

I managed to reduce the number of errors.
I reduced the number of processes per disk servers_per_port = 2
Set
replication_concurrency = 0
replication_concurrency_per_device = 0
replication_lock_timeout = 1

I also reduced the number of proxy-server
workers = 2
And
[app:proxy-server]
node_timeout = 60

After installing these options, there are many times fewer errors, but sometimes errors appear.
At the same time, the disks are occupied up to 100%

Revision history for this message

Dmitry (kozlovdmtry) wrote on 2020-11-02:

#6

I have configured separate replication server processes, reduced concurrency, but still there are a lot of errors from ssync.receiver and ssync.sender. However all requests from proxy servers are successful and rsync replication works well. Disks are not very busy on sender and receiver.
Here is my config for replication server and replicator:

[DEFAULT]
bind_ip = 172.168.28.135
bind_port = 6010
workers = auto
devices = /import
client_timeout = 600
conn_timeout = 5

[pipeline:main]
pipeline = object-server

[app:object-server]
use = egg:swift#object
#replication_server = True
log_name = replication-server
log_facility = LOG_LOCAL4
log_level = DEBUG
log_address = /dev/log
replication_concurrency = 0
replication_concurrency_per_device = 0
replication_lock_timeout = 45

[object-replicator]
log_name = object-replicator
log_facility = LOG_LOCAL1
log_level = DEBUG
log_address = /dev/log
concurrency = 2
sync_method = ssync
rsync_timeout = 86400
rsync_bwlimit = 25m
node_timeout = 600
http_timeout = 31600
lockup_timeout = 87000
handoffs_first = True
handoff_delete = 2

BTW, if I set "replication_server = True" all PUT request from replication will fail with "405 Method not allowed"

Revision history for this message

Anton (a.porabkovich) wrote on 2020-11-02: Re: [Bug 1900845] Re: ssync replication failure

#7

Download full text (5.3 KiB)

Здравствуйте
> BTW, if I set "replication_server = True" all PUT request from
replication will fail with "405 Method not allowed"
Вы разнесли сервера репликации и клиентского трафика на разное железо?
Если нет, то ставить ничего не нужно, по коду получается
            if self.replication_server is True:
                for name, m in all_methods:
                    if (getattr(m, 'publicly_accessible', False) and
                            getattr(m, 'replication', False)):
                        self._allowed_methods.append(name)
            elif self.replication_server is False:
                for name, m in all_methods:
                    if (getattr(m, 'publicly_accessible', False) and not
                            getattr(m, 'replication', False)):
                        self._allowed_methods.append(name)
            elif self.replication_server is None:
                for name, m in all_methods:
                    if getattr(m, 'publicly_accessible', False):
                        self._allowed_methods.append(name)

У меня кольцо настроено пер девайс
Ring file /etc/swift/object-1.ring.gz is up-to-date
Devices: id region zone ip address:port replication ip:port name weight
partitions balance flags meta
            0 1 1 10.0.1.11:6061 10.0.1.11:6061 cold0 100.00
      4096 0.00
            1 1 1 10.0.1.11:6062 10.0.1.11:6062 cold1 100.00
      4096 0.00
            2 1 1 10.0.1.11:6063 10.0.1.11:6063 cold2 100.00
      4096 0.00
т.е. каждый диск смотрит в свой порт
и далее
servers_per_port = 2
если ставлю
servers_per_port = больше 2, то перегруз в системе прыгает свыше 80+
особенно при удалении множества объектов в 100+потоков

пн, 2 нояб. 2020 г. в 11:45, Dmitry <email address hidden>:

> I have configured separate replication server processes, reduced
> concurrency, but still there are a lot of errors from ssync.receiver and
> ssync.sender. However all requests from proxy servers are successful and
> rsync replication works well. Disks are not very busy on sender and
> receiver.
> Here is my config for replication server and replicator:
>
> [DEFAULT]
> bind_ip = 172.168.28.135
> bind_port = 6010
> workers = auto
> devices = /import
> client_timeout = 600
> conn_timeout = 5
>
> [pipeline:main]
> pipeline = object-server
>
> [app:object-server]
> use = egg:swift#object
> #replication_server = True
> log_name = replication-server
> log_facility = LOG_LOCAL4
> log_level = DEBUG
> log_address = /dev/log
> replication_concurrency = 0
> replication_concurrency_per_device = 0
> replication_lock_timeout = 45
>
> [object-replicator]
> log_name = object-replicator
> log_facility = LOG_LOCAL1
> log_level = DEBUG
> log_address = /dev/log
> concurrency = 2
> sync_method = ssync
> rsync_timeout = 86400
> rsync_bwlimit = 25m
> node_timeout = 600
> http_timeout = 31600
> lockup_timeout = 87000
> handoffs_first = True
> handoff_delete = 2
>
> BTW, if I set "replication_server = True" all PUT request from
> replication will fail with "405 Method not allowed"
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (1890575).
> https://bugs...

Здравствуйте
> BTW, if I set "replication_server = True" all PUT request from
replication will fail with "405 Method not allowed"
Вы разнесли сервера репликации и клиентского трафика на разное железо?
Если нет, то ставить ничего не нужно, по коду получается
            if self.replication_server is True:
                for name, m in all_methods:
                    if (getattr(m, 'publicly_accessible', False) and
                            getattr(m, 'replication', False)):
                        self._allowed_methods.append(name)
            elif self.replication_server is False:
                for name, m in all_methods:
                    if (getattr(m, 'publicly_accessible', False) and not
                            getattr(m, 'replication', False)):
                        self._allowed_methods.append(name)
            elif self.replication_server is None:
                for name, m in all_methods:
                    if getattr(m, 'publicly_accessible', False):
                        self._allowed_methods.append(name)

У меня кольцо настроено пер девайс
Ring file /etc/swift/object-1.ring.gz is up-to-date
Devices:   id region zone ip address:port replication ip:port  name weight
partitions balance flags meta
            0      1    1  10.0.1.11:6061      10.0.1.11:6061 cold0 100.00
      4096    0.00
            1      1    1  10.0.1.11:6062      10.0.1.11:6062 cold1 100.00
      4096    0.00
            2      1    1  10.0.1.11:6063      10.0.1.11:6063 cold2 100.00
      4096    0.00
т.е. каждый диск смотрит в свой порт
и далее
servers_per_port = 2
если ставлю
servers_per_port =  больше 2, то перегруз в системе прыгает свыше 80+
особенно при удалении множества объектов в 100+потоков

пн, 2 нояб. 2020 г. в 11:45, Dmitry <1900845@bugs.launchpad.net>:

> I have configured separate replication server processes, reduced
> concurrency, but still there are a lot of errors from ssync.receiver and
> ssync.sender. However all requests from proxy servers are successful and
> rsync replication works well. Disks are not very busy on sender and
> receiver.
> Here is my config for replication server and replicator:
>
> [DEFAULT]
> bind_ip = 172.168.28.135
> bind_port = 6010
> workers = auto
> devices = /import
> client_timeout = 600
> conn_timeout = 5
>
> [pipeline:main]
> pipeline = object-server
>
> [app:object-server]
> use = egg:swift#object
> #replication_server = True
> log_name = replication-server
> log_facility = LOG_LOCAL4
> log_level = DEBUG
> log_address = /dev/log
> replication_concurrency = 0
> replication_concurrency_per_device = 0
> replication_lock_timeout = 45
>
> [object-replicator]
> log_name = object-replicator
> log_facility = LOG_LOCAL1
> log_level = DEBUG
> log_address = /dev/log
> concurrency = 2
> sync_method = ssync
> rsync_timeout = 86400
> rsync_bwlimit = 25m
> node_timeout = 600
> http_timeout = 31600
> lockup_timeout = 87000
> handoffs_first = True
> handoff_delete = 2
>
> BTW, if I set "replication_server = True" all PUT request from
> replication will fail with "405 Method not allowed"
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (1890575).
> https://bugs.launchpad.net/bugs/1900845
>
> Title:
>   ssync replication failure
>
> Status in OpenStack Object Storage (swift):
>   New
>
> Bug description:
>   Openstack Train, CentOS 7.
>
>   I am trying to replicate partition with ssync, but it fails with this
>   error on the receiver:
>
>   Oct 21 12:26:53 obj21 object-server[13189]: 172.168.28.135/objects/32
>   EXCEPTION in ssync.Receiver: #012Traceback (most recent call
>   last):#012  File "/usr/lib/python2.7/site-
>   packages/swift/obj/ssync_receiver.py", line 165, in __call__#012
>   for data in self.missing_check():#012  File "/usr/lib/python2.7/site-
>   packages/swift/obj/ssync_receiver.py", line 350, in missing_check#012
>   want = self._check_missing(line)#012  File "/usr/lib/python2.7/site-
>   packages/swift/obj/ssync_receiver.py", line 293, in _check_missing#012
>   remote = decode_missing(line)#012  File "/usr/lib/python2.7/site-
>   packages/swift/obj/ssync_receiver.py", line 41, in decode_missing#012
>   t_data = urllib.parse.unquote(parts[1])#012IndexError: list index out
>   of range
>
>   On the sender side there is message:
>   object-replicator: 172.168.28.162:6000/objects/32 120.0 seconds:
> missing_check send line
>
>   Number of files in partition: 214945
>
>   object-server.conf:
>
>   [DEFAULT]
>   bind_ip = 172.168.28.135
>   bind_port = 6000
>   user = swift
>   swift_dir = /etc/swift
>   devices = /import
>   mount_check = true
>   log_level = DEBUG
>   node_timeout = 120
>   [pipeline:main]
>   pipeline = healthcheck recon object-server
>   [app:object-server]
>   use = egg:swift#object
>   [filter:healthcheck]
>   use = egg:swift#healthcheck
>   [filter:recon]
>   use = egg:swift#recon
>   recon_cache_path = /var/cache/swift
>   [object-replicator]
>   log_name = object-replicator
>   log_facility = LOG_LOCAL1
>   log_level = DEBUG
>   log_address = /dev/log
>   concurrency = 4
>   sync_method = ssync
>   rsync_timeout = 86400
>   rsync_bwlimit = 15m
>   http_timeout = 21600
>   lockup_timeout = 87000
>   handoffs_first = True
>   handoff_delete = 2
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/swift/+bug/1900845/+subscriptions
>

Revision history for this message

Dmitry (kozlovdmtry) wrote on 2020-11-03:

#8

Replication server and standard object server (for client traffic) listen different sockets. My ring:
object.builder, build version 345, id f07e536c91ec40238834399d7532f5e2
1024 partitions, 2.000000 replicas, 2 regions, 2 zones, 5 devices, 1.56 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 1 (0:00:00 remaining)
The overload factor is 0.00% (0.000000)
Ring file object.ring.gz is up-to-date
Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta
           20 1 5 172.168.28.161:6000 172.168.28.161:6010 objects 13000.00 512 0.00
           21 1 5 172.168.28.162:6000 172.168.28.162:6010 objects 13000.00 512 0.00
            1 2 4 172.168.28.131:6000 172.168.28.131:6010 objects 12700.00 500 -0.04
            0 2 4 172.168.28.134:6000 172.168.28.134:6010 objects 300.00 12 1.56
            3 2 4 172.168.28.135:6000 172.168.28.135:6010 objects 13000.00 512 0.00

OpenStack Object Storage (swift)

ssync replication failure

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches