auditor error when missing data file

Bug #1260132 reported by clayg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Confirmed
Low
Unassigned

Bug Description

so I have a directory that looks like this:

clayg@swift:~$ ls /srv/node1/sdb1/objects/736/ce8/b80cbd53563425737e12108c6b497ce8/
1386805231.06924.meta

The object-server just 500's:

Dec 11 16:46:21 swift object-server: ERROR __call__ error with HEAD /sdb4/736/AUTH_test/test/asdf : #012Traceback (most recent call last):#012 File "/mnt/workspace/swift/swift/obj/server.py", line 661, in __call__#012 res = method(req)#012 File "/mnt/workspace/swift/swift/common/utils.py", line 2012, in wrapped#012 return func(*a, **kw)#012 File "/mnt/workspace/swift/swift/common/utils.py", line 760, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/mnt/workspace/swift/swift/obj/server.py", line 540, in HEAD#012 metadata = disk_file.read_metadata()#012 File "/mnt/workspace/swift/swift/obj/diskfile.py", line 1258, in read_metadata#012 with self.open():#012 File "/mnt/workspace/swift/swift/obj/diskfile.py", line 973, in open#012 data_file, meta_file, ts_file = self._get_ondisk_file()#012 File "/mnt/workspace/swift/swift/obj/diskfile.py", line 1088, in _get_ondisk_file#012 " %s, meta_file: %s, ts_file: %s" % (data_file, meta_file, ts_file)#012AssertionError: On-disk file search algorithm contract is broken: data_file: None, meta_file: /srv/node4/sdb4/objects/736/ce8/b80cbd53563425737e12108c6b497ce8/1386805246.55842.meta, ts_file: None (txn: tx5171b4f05f244420824ea-0052a9075d)
Dec 11 16:46:21 swift object-server: 127.0.0.1 - - [12/Dec/2013:00:46:21 +0000] "HEAD /sdb4/736/AUTH_test/test/asdf" 500 1046 "HEAD http://localhost:8080/v1/AUTH_test/test/asdf" "tx5171b4f05f244420824ea-0052a9075d" "proxy-server 1794" 0.0021
Dec 11 16:46:21 swift proxy-server: ERROR 500 From Object Server 127.0.0.1:6040/sdb4 (txn: tx5171b4f05f244420824ea-0052a9075d)

My auditor is also whining about AssertionError:

object-auditor: ERROR Trying to audit /srv/node1/sdb1/objects/736/ce8/b80cbd53563425737e12108c6b497ce8: #012Traceback (most recent call last):#012 File "/mnt/workspace/swift/swift/obj/auditor.py", line 155, in failsafe_object_audit#012 self.object_audit(location)#012 File "/mnt/workspace/swift/swift/obj/auditor.py", line 173, in object_audit#012 with df.open():#012 File "/mnt/workspace/swift/swift/obj/diskfile.py", line 973, in open#012 data_file, meta_file, ts_file = self._get_ondisk_file()#012 File "/mnt/workspace/swift/swift/obj/diskfile.py", line 1088, in _get_ondisk_file#012 " %s, meta_file: %s, ts_file: %s" % (data_file, meta_file, ts_file)#012AssertionError: On-disk file search algorithm contract is broken: data_file: None, meta_file: /srv/node1/sdb1/objects/736/ce8/b80cbd53563425737e12108c6b497ce8/1386805231.06924.meta, ts_file: None

ssync replicator too:

object-replicator: 127.0.0.1:6010/sdb1/736 Unexpected response: ":ERROR: 0 'On-disk file search algorithm contract is broken: data_file: None, meta_file: /srv/node1/sdb1/objects/736/ce8/b80cbd53563425737e12108c6b497ce8/1386805231.06924.meta, ts_file: None'"

Seems like in that condition someone could step up to the table and quarantine that bad-boy or push in a datafile under it or something. e.g. rsync seems to square it as long as it noticed the hashes.pkl is out of date...

Revision history for this message
Samuel Merritt (torgomatic) wrote :

Yeah, quarantine is probably the way to go here, then a 404 from the object server (or whatever it usually does on quarantine).

I guess you could get here if an object server is unlinking stuff and dies halfway through...?

Revision history for this message
Peter Portante (peter-a-portante) wrote :

I think we have to find out how we got into that state. If we don't know, then we run the risk of quarantining the objects out of existence by other logic errors in the code.

Any time a .data gets removed in the system, a .ts file should be placed first, and then all the .meta and .data files can be removed.

The other problem we face is old listdir() results. It is quite possible that if we detect that condition, we should re-read the directory and try again since a quarantine operation could be performed on good data, just stale directory reads.

Revision history for this message
Peter Portante (peter-a-portante) wrote :

FWIW: added some tests to show current behavior of hash_cleanup_listdir(): https://gist.github.com/portante/7923353

Revision history for this message
gholt (gholt) wrote :

A system could get into that state just due to file system corruption (power fault, bit error, who knows?).

I don't think writing a .ts before deleting stuff and then deleting the .ts or whatever is truly the solution. You can still end up with corruption that leaves you with just a .meta, especially with all the alternate file systems folks are wanting to use. Best to handle the situation.

Quarantining seems the right thing to do, in my opinion. If this node is the only one with the .meta then yes there is a risk that the .meta was good and that the node just lost the .data. But that would mean one node lost the .data and the other nodes lost the .meta which seems pretty unlikely. More likely would be that the node with just the .meta brought it back to life due to some corruption and that it shouldn't be there. But, I guess the point is, the state is not knowable so quarantine makes sense (to me).

The hash_cleanup_listdir should not ever leave such a state since it deletes superseded files only, or a .ts that stands alone and is older than the reclaim age. But again, with file system corruption, anything is possible, including directories where there should be files, bizzare names, perfect names in seemingly weird orders, etc.

Revision history for this message
clayg (clay-gerrard) wrote :

fwiw this wasn't a logic error, i simulated file system corruption or whatever you want to call it by deleting the .data file.

I was testing the https://review.openstack.org/#/c/61006/ and just wanted to see what happened.

So no worry about quarantine hiding some other issue. Also quarantine just moves stuff out of the way for inspection yeah - so even if we quarantined all three copies we'd have some chance to look into what happened, or at least the state of things when we quarantined.

Revision history for this message
Peter Portante (peter-a-portante) wrote :

Okay, great to hear. I have a patch to try to address it. Just formulating some tests for it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/61822
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=bc0807b3c58a44b81c54fe016497545c54971d92
Submitter: Jenkins
Branch: master

commit bc0807b3c58a44b81c54fe016497545c54971d92
Author: Peter Portante <email address hidden>
Date: Wed Dec 11 20:41:34 2013 -0500

    Refactor to share on-disk layout with hash cleanup

    Closes-Bug: 1260132
    Change-Id: Iaa367c686b8dc49dd53c55a7cca661d9611044f8

Changed in swift:
status: New → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/ec)

Fix proposed to branch: feature/ec
Review: https://review.openstack.org/66462

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/ec)
Download full text (23.8 KiB)

Reviewed: https://review.openstack.org/66462
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=3895441afd1f8ca49a09a483f402a961009a8661
Submitter: Jenkins
Branch: feature/ec

commit bad52f11218a11978d1efb0832f164a60a363cc2
Author: Clay Gerrard <email address hidden>
Date: Fri Jan 10 00:31:55 2014 -0800

    Allow programmatic reloading of Swift hash path config

    New util's function validate_hash_conf allows you to programmatically reload
    swift.conf and hash path global vars HASH_PATH_SUFFIX and HASH_PATH_PREFIX
    when they are invalid.

    When you load swift.common.utils before you have a swift.conf there's no good
    way to force a re-read of swift.conf and repopulate the hash path config
    options - short of restarting the process or reloading the module - both of
    which are hard to unittest. This should be no worse in general and in some
    cases easier.

    Change-Id: I1ff22c5647f127f65589762b3026f82c9f9401c1

commit 7b9c283203479cb9916951e1ce1f466f197dea36
Author: Samuel Merritt <email address hidden>
Date: Fri Jan 10 12:57:53 2014 -0800

    Add missing license header to test file

    All the other tests have license headers, so this one should too.

    I picked 2013 for the copyright year because that's when "git log"
    says it was first and last touched.

    Change-Id: Idd41a179322a3383f6992e72d8ba3ecaabd05c47

commit 47fcc5fca2c5020b69f3c2c7f0a8032f6c77354a
Author: Christian Schwede <email address hidden>
Date: Fri Jan 10 07:14:43 2014 +0000

    Update account quota doc

    A note was added stating that the same limitations apply to
    account quotas as for container quotas. An example on uploads
    without a content-length headers was added.

    Related-Bug: 1267659
    Change-Id: Ic29b527cb71bf5903c2823844a1cf685ab6813dd

commit 6426f762d0d87063f9813630c620d880a4191046
Author: Peter Portante <email address hidden>
Date: Mon Dec 9 20:52:58 2013 -0500

    Raise diskfile.py module coverage to > 98%

    We attempt to get the code coverage (with branch coverage) to 100%,
    but fall short because due to interactions between coverage.py and
    CPython's peephole optimizer. See:

        https://bitbucket.org/ned/coveragepy/issue/198/continue-marked-as-not-covered

    In the main diskfile module, we remove the check for a valid
    "self._tmppath" since it is only one of a number of fields that could
    be verified and it was not worth trying to get coverage for it. We
    also remove the try / except around the close() method call in the
    DiskFileReader's app_iter_ranges() method since it will never be
    called in a context that will raise a quarantine exception (by
    definition ranges can't generate a quarantine event).

    We also:

    * fix where quarantine messages are checked to ensure the
      generator is actually executed before the check
    * in new and modified tests:
      * use assertTrue in place of assert_
      * use assertEqual in place of assertEquals
    * fix references to the reserved word "object"

    Change-Id: I6379be04adfc5012cb0b91748fb3ba3f11200b48

commit 5196eae...

Changed in swift:
status: Fix Committed → In Progress
assignee: nobody → Peter Portante (peter-a-portante)
importance: Undecided → Low
Changed in swift:
assignee: Peter Portante (peter-a-portante) → nobody
Changed in swift:
status: In Progress → Confirmed
Revision history for this message
Matthew Oliver (matt-0) wrote :

Is this still a problem, I tried to recreate the problem (based on Clay's initial post), and the auditor and object-server didn't stack trace.

Revision history for this message
Thiago da Silva (thiagodasilva) wrote :

I just tried to recreate this bug, same as Matt and Clay and I also no longer see a stack trace, the object server is returning a 404.

But the dangling .meta file is not being quarantined either, it is just left there until I start the replicator and the .data file is put back in place.

Should it be quarantined? or should we trust the replicator will take care of it eventually?

Revision history for this message
Tim Burke (1-tim-z) wrote :

I think we want that to hang around; for all we know, that's the only copy of the .meta that got persisted.

*Maybe* we could clean it up if it's older than the reclaim age?

Revision history for this message
clayg (clay-gerrard) wrote :

This looks *way* fixed?

https://review.openstack.org/#/c/61822/

What gives?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (master)

Change abandoned by Peter Portante (<email address hidden>) on branch: master
Review: https://review.openstack.org/61876

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.