cyclic replication in one cluster

Bug #1365330 reported by Anton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Undecided
Unassigned

Bug Description

Hi, we have several clusters,
in one cluster I look endless replication in last 2-3 months from 3 servers

root@str-20 /srv/node/sdp1/objects/255277/91c/f94b7a601fce1a35b92ef3c8e928b91c # grep .1398473456.33743.data.FG9inr /var/log/syslog -A1
Sep 4 06:26:06 str-20 object-replicator: <f+++++++++ 91c/f94b7a601fce1a35b92ef3c8e928b91c/.1398473456.33743.data.FG9inr
Sep 4 06:26:06 str-20 object-replicator: Successful rsync of /srv/node/sdp1/objects/255277/91c at 10.10.2.26::object/dev24/objects/255277 (2.696)
--
Sep 4 06:52:17 str-20 object-replicator: <f+++++++++ 91c/f94b7a601fce1a35b92ef3c8e928b91c/.1398473456.33743.data.FG9inr
Sep 4 06:52:17 str-20 object-replicator: Successful rsync of /srv/node/sdp1/objects/255277/91c at 10.10.2.26::object/dev24/objects/255277 (1.563)

and more records in older syslog files
ok, go into str-26:/srv/node/dev24/objects/255277/91c/f94b7a601fce1a35b92ef3c8e928b91c

root@str-26 /srv/node/dev24/objects/255277/91c/f94b7a601fce1a35b92ef3c8e928b91c # stat 1398473456.33743.data
  File: ‘1398473456.33743.data’
  Size: 467330599 Blocks: 912760 IO Block: 4096 regular file
Device: 8a1h/2209d Inode: 23489931 Links: 1
Access: (0600/-rw-------) Uid: ( 1001/ swift) Gid: ( 1001/ swift)
Access: 2014-05-22 08:24:20.637048253 +0000
Modify: 2014-05-22 08:24:41.892736780 +0000
Change: 2014-05-22 08:24:41.892736780 +0000
 Birth: -

root@str-26 /srv/node/dev24/objects/255277/91c/f94b7a601fce1a35b92ef3c8e928b91c # swift-object-info 1398473456.33743.data
....
....
Use your own device location of servers:
such as "export DEVICE=/srv/node"
ssh 10.10.2.26 "ls -lah ${DEVICE:-/srv/node*}/dev24/objects/255277/91c/f94b7a601fce1a35b92ef3c8e928b91c"
ssh 10.10.2.20 "ls -lah ${DEVICE:-/srv/node*}/sdp1/objects/255277/91c/f94b7a601fce1a35b92ef3c8e928b91c"
....

ok, go to second node

root@str-20 /srv/node/sdp1/objects/255277/91c/f94b7a601fce1a35b92ef3c8e928b91c # ls -la
total 68928
drwxr-xr-x 2 swift swift 42 Sep 1 01:44 .
drwxr-xr-x 3 swift swift 45 Sep 1 01:41 ..
-rw------- 1 swift swift 70582272 Sep 1 01:56 .1398473456.33743.data.FG9inr

Now it file from dot

Revision history for this message
clayg (clay-gerrard) wrote :

I have no good explanation why the file name on str-20 looks like that? Maybe you can get rid of it?

is the md5 on the two files the same?

str-20:/srv/node/sdp1/objects/255277/91c/f94b7a601fce1a35b92ef3c8e928b91c/
  .1398473456.33743.data.FG9inr

and

str-26:/srv/node/dev24/objects/255277/91c/f94b7a601fce1a35b92ef3c8e928b91c/
  1398473456.33743.data

If so I'd probably start with a targeted audit on str-20:/srv/node/sdp1

swift-object-auditor /etc/swift/object-server.conf once verbose -d sdp1

I think once mode on the auditor may be running ZBF more than once these days after the parallel auditor change - but if possible, it might be useful to post any interesting log lines here.

If that doesn't clean it up, then I'd probably remove it manually and clear out the hashes.pkl for that partition on str-20:

rm /srv/node/sdp1/objects/255277/hashes.pkl

And then push from str-26:

swift-object-replicator /etc/swift/object-server.conf once verbose -d sdp1 -p 255277

FWIW I couldn't duplicate the issue on my dev environment just by renaming the file. The auditor seems to skip the bogus file name, which is a little disappointing. But when replication from a healthy node pushed the dot file would get cleaned up.

What version of swift are you running?

Revision history for this message
Anton (hettbox) wrote :
Download full text (5.4 KiB)

Hi, latest vertsion of swift.

This file is not one, and a lot of,

Sep 5 04:15:03 str-22 object-replicator: <f+++++++++ 4d0/4e2e7607e2000428f95fa3a4a6b7c4d0/.1395468846.90646.data.QnEDD6
Sep 5 04:15:03 str-22 object-replicator: Successful rsync of /srv/node/sdc1/objects/80057/4d0 at 10.10.2.26::object/dev02/objects/80057 (0.644)

root@str-22 /srv/node/sdc1/objects/80057/4d0/4e2e7607e2000428f95fa3a4a6b7c4d0 # md5sum .1395468846.90646.data.QnEDD6
b4935a224f6002f0a17178f052499885 .1395468846.90646.data.QnEDD6

root@str-26 /srv/node/dev02/objects/80057/4d0/4e2e7607e2000428f95fa3a4a6b7c4d0 # md5sum 1395468846.90646.data
13bfc0fb36aee76475fdb47c5cc5da35 1395468846.90646.data

root@str-22 /srv/node/sdc1/objects/80057/4d0/4e2e7607e2000428f95fa3a4a6b7c4d0 # stat .1395468846.90646.data.QnEDD6
  File: ‘.1395468846.90646.data.QnEDD6’
  Size: 27787264 Blocks: 54272 IO Block: 4096 regular file
Device: 821h/2081d Inode: 3284120671 Links: 1
Access: (0600/-rw-------) Uid: ( 105/ swift) Gid: ( 111/ swift)
Access: 2014-08-11 19:22:29.145600133 +0000
Modify: 2014-08-11 19:37:30.718214541 +0000
Change: 2014-08-11 19:37:30.718214541 +0000
 Birth: -

root@str-26 /srv/node/dev02/objects/80057/4d0/4e2e7607e2000428f95fa3a4a6b7c4d0 # stat 1395468846.90646.data
  File: ‘1395468846.90646.data’
  Size: 116520978 Blocks: 227584 IO Block: 4096 regular file
Device: 4191h/16785d Inode: 13068963 Links: 1
Access: (0600/-rw-------) Uid: ( 1001/ swift) Gid: ( 1001/ swift)
Access: 2014-05-28 04:47:01.913650641 +0000
Modify: 2014-05-28 04:47:03.517626254 +0000
Change: 2014-05-28 04:47:03.517626254 +0000
 Birth: -

Only on str-26 file is correct.

Yesterday i run to delete all fies from dot on str-20

find . -name "\.*\.data\.*" -not -name .lock -print -exec rm {} \;

but today files now exist.

More info

root@str-22 /srv/node/sdc1 # find . -name "\.*\.data\.*"
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.MuiTu2
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.ZR7erw
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.5WiP1h
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.nrIlEU
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.GDNEhA
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.8D1Xc9
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.8u1n2X
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.pU0T4n
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.ag3F8z
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.lGQnaL
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.tvlN07
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.NA6dRm
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.G5pTDZ
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.WTzazU
./objects/193910/40a/bd5db408a6b9bbbe221da796abb9d40a/.1387319821.57697.data.F5JT74
./objects/193910/40a/...

Read more...

Revision history for this message
Anton (hettbox) wrote :

More info

root@str-26 /etc/swift # cat /srv/node/dev02/objects/80057/hashes.pkl
▒}q(U4d0U e19449e97f546be2c4f3c99b53c74951U9a1U bae4e422dcbdb8401698f0b97be40e27Udc0U cb5fca1d91a468fbcc5686b840aefcffUc21qU 7ad956ffdbdfeb66ba47b9801c98810bqu.

root@str-22 /srv/node # cat /srv/node/sdc1/objects/80057/hashes.pkl
▒}qU4d0qU 2f56bbd223012fd517af5bef27a41c49s.

Revision history for this message
Anton (hettbox) wrote :

Hi, I found problem,
on str-26 one network interface in bond working not correct, its get a problem with replication from this server to other server.

Sep 17 01:43:25 str-26 object-replicator: Killing long-running rsync: ['rsync', '--recursive', '--whole-file', '--human-readable', '--xattrs', '--itemize-changes', '--ignore-existing', '--timeout=30', '--contimeout=30', '--bwlimit=0', '/srv/node/dev22/objects/75223/34e', '/srv/node/dev22/objects/75223/781', '/srv/node/dev22/objects/75223/edd', '/srv/node/dev22/objects/75223/871', '/srv/node/dev22/objects/75223/1e2', '/srv/node/dev22/objects/75223/054', '/srv/node/dev22/objects/75223/ba9', '/srv/node/dev22/objects/75223/3a5', '10.10.2.20::object/sdp1/objects/75223']

Changed in swift:
status: New → Invalid
Revision history for this message
clayg (clay-gerrard) wrote :

OH! so those were temporary rsync files!?

... interesting.

Revision history for this message
Anton (hettbox) wrote :

It seems that this is so :)

Revision history for this message
Anton (hettbox) wrote :

But I do not understand, replicator on str-20 replicate temporary rsync files?

Revision history for this message
Anton (hettbox) wrote :

Sep 18 05:50:05 str-20 object-replicator: Successful rsync of /srv/node/sdaf1/objects/12960/deb at 10.10.2.26::object/dev01/objects/12960 (0.914)
Sep 18 05:50:52 str-20 object-replicator: <f+++++++++ 330/b0b9b94792137dda5bbc34ebc6d3c330/.1387472627.58288.data.CHGDnB

Anton (hettbox)
Changed in swift:
status: Invalid → New
Revision history for this message
Thiago da Silva (thiagodasilva) wrote :
Changed in swift:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.