Auditor doesn't recognize missing files

Bug #971914 reported by Daniele Valeriani on 2012-04-02
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Wishlist
Unassigned

Bug Description

Swift version: 1.4.8

If you change a file on a storage node, then the auditor detects the different etag and it replaces the object.
If instead you delete it, then the auditor doesn't recognize the missing file and it doesn't replicate it.
This is an example:

root@proxy01:~# swift-get-nodes /etc/swift/object.ring.gz MORRO_daniele test nani

Account MORRO_daniele
Container test
Object nani

Partition 164816
Hash a0f41e680094d8347105afac1492dda3

Server:Port Device 10.176.198.12:6000 sdd1
Server:Port Device 10.176.198.8:6000 sdb1
Server:Port Device 10.176.195.108:6000 sde1

curl -I -XHEAD "http://10.176.198.12:6000/sdd1/164816/MORRO_daniele/test/nani"
curl -I -XHEAD "http://10.176.198.8:6000/sdb1/164816/MORRO_daniele/test/nani"
curl -I -XHEAD "http://10.176.195.108:6000/sde1/164816/MORRO_daniele/test/nani"

ssh 10.176.198.12 "ls -lah /srv/node/sdd1/objects/164816/da3/a0f41e680094d8347105afac1492dda3/"
ssh 10.176.198.8 "ls -lah /srv/node/sdb1/objects/164816/da3/a0f41e680094d8347105afac1492dda3/"
ssh 10.176.195.108 "ls -lah /srv/node/sde1/objects/164816/da3/a0f41e680094d8347105afac1492dda3/"
root@proxy01:~#

Then on one of the storage nodes:

root@storage32:~# ls /srv/node/sdd1/objects/164816/da3/a0f41e680094d8347105afac1492dda3/1332623898.77785.data
/srv/node/sdd1/objects/164816/da3/a0f41e680094d8347105afac1492dda3/1332623898.77785.data
root@storage32:~# rm /srv/node/sdd1/objects/164816/da3/a0f41e680094d8347105afac1492dda3/1332623898.77785.data
root@storage32:~# ls /srv/node/sdd1/objects/164816/da3/a0f41e680094d8347105afac1492dda3/1332623898.77785.data
ls: cannot access /srv/node/sdd1/objects/164816/da3/a0f41e680094d8347105afac1492dda3/1332623898.77785.data: No such file or directory

Nothing appears in the logs. Instead:

root@storage11:~# echo "sugna" > /srv/node/sdd1/objects/164816/da3/a0f41e680094d8347105afac1492dda3/1332623898.77785.data

Triggers the normal replication: the etag can be calculated, it doesn't match with the others hence the replication is performed.

Changed in swift:
status: New → In Progress
Changed in swift:
status: In Progress → Confirmed
Changed in swift:
assignee: nobody → You Yamagata (y-yamagata)
You Yamagata (y-yamagata) wrote :

To fix this problem I am developing an additional auditor 'swift-object-auditor2'. (still working)

https://github.com/yoyama/swift/blob/DEVEL_AUDITOR2/swift/obj/auditor2.py

This auditor manage new pickle files 'hashes_expire.pkl' which store expiration of suffix values.
If the auditor detect expired suffix, it calls swift.obj.replicator.get_hashes() and update hashes.pkl.
If an object files are lost in this time, the hash value of suffix is changed, so object replicator
will detect difference of hash value and fix it.

I appreciate if someone give me comments or advices.

Changed in swift:
status: Confirmed → In Progress
Ray Yen (grsyen) wrote :

Does this bug fix?

You Yamagata (y-yamagata) wrote :

I am working on it and submitting a patch as follows.

https://review.openstack.org/#/c/11452/

sirkonst (sirkonst) wrote :

How is process?

clayg (clay-gerrard) wrote :

I'm not sure if that patch is still in progress?

Generally speaking it's not the auditors *job* to fix missing files - that's the replicator.

But it looks like the purposed change is more subtle? Is the strategy just for the auditor to recognize what looks like an empty object data directory and invalidate the hash.pkl like the same way it does on quarantine? Or is there something else going on or needed?

clayg (clay-gerrard) on 2014-03-20
Changed in swift:
status: In Progress → Incomplete
assignee: You Yamagata (y-yamagata) → nobody
importance: Undecided → Wishlist
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Object Storage (swift) because there has been no activity for 60 days.]

Changed in swift:
status: Incomplete → Expired
John Leach (johnleach) on 2018-11-09
Changed in swift:
status: Expired → Confirmed
John Leach (johnleach) wrote :

We've come across this too. Problem exists from 1.13 through to 2.17.0

To reproduce, run dispersion populate tool, delete one of the *.data objects files off disk and run the dispersion report to confirm the object is missing.

Run the object auditor on that server and notice it doesn't detect the problem.

Run the object replicator from other nodes and notice it doesn't sync the missing object.

If you delete the hashes.pkl file from the partition then the replicator will then sync the missing data file.

Confirmed that if you change the filesize of the .data file then the auditor notices, quarantines the file and then the replicator syncs it.

And if you change the data in the .data file but keep the size the same, auditor notices and quarantines etc.

But if the file is just missing, the auditor just doesn't notice and doesn't update the hashes file apparently, so the replicator skips the partition when it runs.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers