OpenStack Object Storage (swift)

Successful DELETE of "ghost" object returns 404

Bug #2003941 reported by Matthew Vernon on 2023-01-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	New	Undecided	Unassigned

Bug Description

I'm running swift version 2.26.0 (from Debian, package version 2.26.0-10).

For some reason (I've not got to the bottom of this yet), we have a number of "ghost" objects - they appear in container listings but if you GET/HEAD them you get 404.

If you DELETE one of these objects, though, swift still says 404, but does in fact successfully delete the object - it no longer appears in the container listing. I think swift SHOULD return 200/202/204 in this case, since the DELETE has in fact occurred successfully.

We found this out because the client we were using for some maintenance errors out on 404.

[one might quibble about whether 404 is the most correct response to GET/HEAD for this ghost objects, but that's a separate question]

Revision history for this message

clayg (clay-gerrard) wrote on 2023-01-26:

One way you can get a ghost listing is you have an expired object (includes x-delete-at metdatadata) that hasn't yet been reaped by the object-expirer daemon (maybe becaues it's running behind or mis-configured ... do you monitor you .expiring_objects queue?). In this scenario using swift-get-nodes you would be able to discover the on-disk location and validate if there is still an object .data file or a tombstone with .ts

The other way you can get a ghost listing is you had a swift-container-server node with a containers database that was isolated from the cluster for longer than the configured reclaim_age (the default is only 7 days). After a reclaim_age the connected database servers will "reclaim" any "tombstone rows" indicating that an object was deleted. If an isolated database with a record of PUT at t1 rejoins the cluster and finds that all database records of the DELETE at t2 have been relcaimed the "missing" PUT at t1 will be replicated to the connected database servers w/o any tombstone row to prevent it. In this scenario the tombstone files in the object-data layer are likely also reclaimed, so you want find any object-data. But you might be able to determine which database were connected and isolated by examining the ROW_ID of the ghost rows.

FWIW s3api always returns 2XX for DELETE even if the object exists - but the swift API has always returned the status code that seemed most appropriate from what it observed when writing down the tombstone object for the delete (either there was a .data file => 2xx, or there was a tombstone or no data => 404). I agree it that we could "quibble" about the status - but it wouldn't be obviously helpful, and there may be barriers to changing the expected response from the swift v1 API to consider. Clients can probably infer, it's not such a big problem for DELETE to return "not found", but swiftclient at least may not be setting the best example here:

(nvidia) clayg@banana:~/Workspace/nvidia$ swift list test
test
(nvidia) clayg@banana:~/Workspace/nvidia$ swift delete test foo
Error Deleting: test/foo: Object DELETE failed: https://stg-swiftstack-maglev.ngc.nvidia.com/v1/AUTH_clayg/test/foo 404 Not Found [first 60 chars of response] b'<html><h1>Not Found</h1><p>The resource could not be found.<' (txn: txe9e3279a72184e69ba1f6-0063d2a9f5)
(nvidia) clayg@banana:~/Workspace/nvidia$ echo $?
1

One way you can get a ghost listing is you have an expired object (includes x-delete-at metdatadata) that hasn't yet been reaped by the object-expirer daemon (maybe becaues it's running behind or mis-configured ... do you monitor you .expiring_objects queue?).  In this scenario using swift-get-nodes you would be able to discover the on-disk location and validate if there is still an object .data file or a tombstone with .ts

The other way you can get a ghost listing is you had a swift-container-server node with a containers database that was isolated from the cluster for longer than the configured reclaim_age (the default is only 7 days).  After a reclaim_age the connected database servers will "reclaim" any "tombstone rows" indicating that an object was deleted.  If an isolated database with a record of PUT at t1 rejoins the cluster and finds that all database records of the DELETE at t2 have been relcaimed the "missing" PUT at t1 will be replicated to the connected database servers w/o any tombstone row to prevent it.  In this scenario the tombstone files in the object-data layer are likely also reclaimed, so you want find any object-data.  But you might be able to determine which database were connected and isolated by examining the ROW_ID of the ghost rows.

FWIW s3api always returns 2XX for DELETE even if the object exists - but the swift API has always returned the status code that seemed most appropriate from what it observed when writing down the tombstone object for the delete (either there was a .data file => 2xx, or there was a tombstone or no data => 404).  I agree it that we could "quibble" about the status - but it wouldn't be obviously helpful, and there may be barriers to changing the expected response from the swift v1 API to consider.  Clients can probably infer, it's not such a big problem for DELETE to return "not found", but swiftclient at least may not be setting the best example here:

(nvidia) clayg@banana:~/Workspace/nvidia$ swift list test
	test
	(nvidia) clayg@banana:~/Workspace/nvidia$ swift delete test foo
	Error Deleting: test/foo: Object DELETE failed: https://stg-swiftstack-maglev.ngc.nvidia.com/v1/AUTH_clayg/test/foo 404 Not Found  [first 60 chars of response] b'<html><h1>Not Found</h1><p>The resource could not be found.<' (txn: txe9e3279a72184e69ba1f6-0063d2a9f5)
	(nvidia) clayg@banana:~/Workspace/nvidia$ echo $?
	1

Revision history for this message

Matthew Vernon (matthew-debian) wrote on 2023-01-27:

Hi,

We don't run the object expirer, and I am reasonably sure our clients don't set the expiry header. Is there a (simple!) way to check if swift thinks there are objects it should be expiring?

It's certainly possible we have at some point(s) had container disks detached from the cluster for more than `reclaim_age` - this cluster has been running for years, and sometimes nodes fail in ways that cause us to leave them off until their hardware has been fixed.

If that were the case (and was the cause of these ghost listings), could that result in inconsistent container listings (i.e. sometimes you get the ghosts, sometimes you don't)? I'm seeing behaviour that suggests this might be going on. Is there a way to check the various replicas of a container database are all the same? [we have 3-times replication in our container ring]

rclone like the swift CLI treats 404 on delete as a hard failure:

2023/01/27 10:32:37 ERROR : wikipedia-commons-local-public.98/9/98/Professor_Izaak_Mauritis_Kolthoff.jpg: Couldn't delete: Object Not Found
[return code 1]

Thanks,

Matthew

Revision history for this message

Matthew Vernon (matthew-debian) wrote on 2023-02-09:

Hi,

Sorry for the slow response, I've spent a while poking around this.

What we seem to have ended up with is that one of our container databases (out of three replicas) has a number of additional entries in it - i.e. it has entries in its object table that it thinks are undeleted that are not present in the other two replicas (which agree with each other in terms of the number of undeleted objects).

Those rows in our divergent container database no longer correspond to anything on disk (i.e. the path found by swift-get-nodes don't exist at all, no .data nor .ts)

Also, some of them appear in every container listing, but the number of objects returned varies each time (e.g. doing swift list | wc -l gives you a different answer, and it's not a monotonic sequence); is all of this weirdness likely solely a consequence of having had a disconnected node at some point in the past?

[I take your point that clients may rely on the existing behaviour, but I still think 404 on a DELETE that succeeds (i.e. AFAICT the erroneous row in the divergent database goes away, but I should probably check harder given the problems with inconsistent listing) is incorrect]

Thanks,

Matthew

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.