Different reclaim ages for accounts and containers can result in un-reclaimable containers

Bug #1300850 reported by Donagh McCabe on 2014-04-01
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Undecided
Donagh McCabe

Bug Description

We have a system where the reclaim_age on accounts was lower than the reclaim_age on containers. We have found a number of container databases that have not been reclaimed. In all cases:

- There is a single copy of the container database (i.e.. the other two copies have been reclaimed)
- There are no account databases (i.e. all reclaimed)
- The reported_delete_timestamp=0, whereas status, deleted_timestamp and put_timestamp as as expected

It is thought you can get into this state as follows:
1/ Account is deleted
2/ The objects and then containers are deleted. Everything is in expected
   states -- specifically, the container's reported_delete_timestamp
   is the same as delete_timestamp.
3/ Container server A reaches reclaim age: deletes the container
   database.
4/ Server B or C (assuming 3 replicas) runs container-replicator which
   restores the database to server A
5/ The account database has already been reclaimed (see below).
6/ The container-updater on server A cannot push the container data to the
   account so the reported_delete_timestamp remains at 0
7/ The reclaim age on server B and C is reached so they delete their copies

The race in steps 3 and 4 does not always occur (because step 7 happens), so not all containers are left behind. This race may happen a lot on a normal system (where "normal" means same reclaim_age) , but will not be noticed because the container-replicator will update reported_delete_timestamp within a few minutes. However, even on a normal system, reclaims of account databases will happen within the same timeframe as containers, so a race between account reclaim and the race in steps 3 and 4 might happen togeher, leaving the container database in place.

The effect of the reclaimed container database is two fold:
- Takes up disk space
- The account-server continues to get updates from the container-updater...and backtraces because the database file does not exist.

We are debugging a proposed solution whereby we will reclaim after reclaim_age* 4. ie., if the container-updater cannot push stats to the account after a month of trying, it's probably never going to work so we might as well give up.

clayg (clay-gerrard) wrote :

I don't understand. Why aren't the container's reported_ timestamps getting updated during the reclaim period? The account server should respond 2XX to PUT container requests with x-account-override-deleted. Is the container-updater not able to cycle within the account's reclaim age?

If we're trying to handle the container-updater hitting an account that has already been reclaimed then the account server needs to catch the Exception coming out of account_broker.put_container and handle it. I think that it should be the job of the container-updater to recognize when is the right time to "give up" on updating the account's. For example; perhaps when the majority of account servers respond 404 (or whatever response is returned from the exception handler for put_container) and broker.is_deleted and time() - delete_timestamp > reclaim_age.

Donagh McCabe (donagh-mccabe) wrote :

> Why aren't the container's reported_ timestamps getting updated during the reclaim period?

They are reported. Lets say that reclaim_age for accounts is 6 days and containers are usual 7 days. When the account is deleted, the container-updater has 6 days during which it can push the timestamps to the account. It almost certainly manages to do this.

On day 6, the account database is reclaimed.

On day 7, three container-replicators are due to reclaim/delete the database files. My supposition is that one (node A) is slightly ahead of the other two and deletes the database. Since the other servers (B and C) are slightly behind, they have not reached their reclaim_age so one of the container-replicators rsyncs the database back to A. As part of this process, the reported_*_timestamps are set to zero (see container.backend._newid). Minutes later, B and C reclaim their database files.

Normally, this race would resolve itself, because container-updater on A would push the timestamps to the account and set reported_*_timstamps. However, in this scenario the account was deleted the day before, so container-updater never succeeds.

I don't have logs to prove all of this -- I'm using a mental exercise to work out how we got into this state.

> ...then the account server needs to catch the Exception coming out of account_broker.put_container and handle it...

I won't disagree with that. By not handling the exception, the account-server ends up reporting e500 instead of e404. However, I regarded that as a second order problem. I would need to spend more time to decide if 404 meant something to other components before proposing to fix that. As it happens, the e500 was useful to find this problem because we noticed that some nodes has higher e500 counts than other nodes.

> ....it should be the job of the container-updater to recognize when is the right time to "give up" on updating the account's

This would stop the e500 noise in the system. However, we end up with an orphaned db file that's never deleted. By deleting the db in container-replicator, we stop container-updater stumbling across such files.

Note: I have a proposed fix here: https://review.openstack.org/#/c/84696/ I don't know why gerrit didn't update this bug (mabe because it's WIP).

Takashi Kajinami (kajinamit) wrote :

I don't think "giving up" on updating is a good idea, because it may cause danger that
account will never be reclaimed when all container-server fail to update and give up.

> As part of this process, the reported_*_timestamps are set to zero (see container.backend._newid).
I think this is the root of this problem.
When database is completely rsynced from container-server B to container-server A,
we don't have to reset reported_*_timestamp at container-server A
because container-server B have updated the account.
# We have to reset it only after merging databases.

I think removing that reset processing is a solution for this problem.
It stops container-server A from trying this unnessesory updating after replication,
and then replicated container will be reclaimed normally.

Donagh McCabe (donagh-mccabe) wrote :

>it may cause danger that account will never be reclaimed when all container-server fail to update and give up.

I'm proposing to delete the container databases, not give up on contianer-updater. However, I take your point: if we delete the container database, the account might not get reclaimed. However, to get into this state, all three copies of the account must be down for a month (or conversely, the container db must be off line for a month). The upshot either way is that *some* database file becomes un-reclaimable.

> As part of this process, the reported_*_timestamps are set to zero (see container.backend._newid).
>> I think this is the root of this problem.

I'm reluctant to change that -- there must be good reason for the __newid() function. This is also used when you do a merge of records (vs. complete file copy)

In addition, this would only prevent future un-reclaimable databases -- not clean out my existing database files.

Can I suggest that you make comments against the proposed change at: https://review.openstack.org/#/c/84696
The Swift core developers tend to interact more in gerret than in launchpad

Changed in swift:
assignee: nobody → Donagh McCabe (donagh-mccabe)
status: New → In Progress
Matthew Oliver (matt-0) wrote :

Would something like the container-replicator only replicator if the account exists (so an account check) solve this in this very specific use case. Existing problem containers or a storage node coming back online with older reported_*_timestamps would still have the issue however. Quarantining makes sense.

Juan J. Martínez (jjmartinez) wrote :

Would it be beneficial to set a more conservative values for reclaim_age?

If account's reclaim_age > container's reclaim_age> objects' reclaim_age, that should get rid of the race condition, shouldn't it?

Also, what would be the best way of getting rid of the unclaimed containers? Just delete them or is there a better way to let the cluster do it?

clayg (clay-gerrard) wrote :

@Juan

Yes, at a *minimum* we should document best practices on the reclaim ages of objects, containers & accounts!

Yes, I would expect if traversing the container tree on all nodes is a viable option in your cluster removing the database (or moving to an out of the datadir directory) would allow everything to process cleanly.

... although fixing the bug would be even better <grrr>

There's some concern that entirety of the problem and the ramifications aren't fully understood - but if you recently encountered this situation in a production environment you might be able to provide additional forensics? Like what settings you were running for the reclaim ages - how many containers you found in this state - if all replicas of the containers reflected the same last update values - anything else that seems relevant...

Juan J. Martínez (jjmartinez) wrote :

Sorry for the delayed response, apparently I didn't subscribe to this bug report.

@clayg (clay-gerrard): we had the defaults for those values and we have experienced the problem of un-reclaimable containers in one of our production clusters, that we've been running since 2011. I guess that being live longer increases the chances of the race condition happening, as more accounts have been deleted than in more recent clusters.

After a quick look I've counted 6 containers on that cluster, but I need to do a proper analysis of the logs to be 100% sure.

I agree that a proper fix would be better. We changed reclaim age as I explained in my previous comment, but it is hard to tell if it has solved the problem because we don't have that many accounts deleted and being a race condition it may or may not happen (even if the new configuration didn't solve the problem).

clayg (clay-gerrard) wrote :

@juan np, I'm sure that will be helpful info when someone dives in. Thanks.

If you're still digging in feel free to jump in #openstack-swift on Freenode and ask if any swift junkies for help - we're pretty easy to nerd snipe.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers