Comment 2 for bug 1300850

Donagh McCabe (donagh-mccabe) wrote :

> Why aren't the container's reported_ timestamps getting updated during the reclaim period?

They are reported. Lets say that reclaim_age for accounts is 6 days and containers are usual 7 days. When the account is deleted, the container-updater has 6 days during which it can push the timestamps to the account. It almost certainly manages to do this.

On day 6, the account database is reclaimed.

On day 7, three container-replicators are due to reclaim/delete the database files. My supposition is that one (node A) is slightly ahead of the other two and deletes the database. Since the other servers (B and C) are slightly behind, they have not reached their reclaim_age so one of the container-replicators rsyncs the database back to A. As part of this process, the reported_*_timestamps are set to zero (see container.backend._newid). Minutes later, B and C reclaim their database files.

Normally, this race would resolve itself, because container-updater on A would push the timestamps to the account and set reported_*_timstamps. However, in this scenario the account was deleted the day before, so container-updater never succeeds.

I don't have logs to prove all of this -- I'm using a mental exercise to work out how we got into this state.

> ...then the account server needs to catch the Exception coming out of account_broker.put_container and handle it...

I won't disagree with that. By not handling the exception, the account-server ends up reporting e500 instead of e404. However, I regarded that as a second order problem. I would need to spend more time to decide if 404 meant something to other components before proposing to fix that. As it happens, the e500 was useful to find this problem because we noticed that some nodes has higher e500 counts than other nodes.

> should be the job of the container-updater to recognize when is the right time to "give up" on updating the account's

This would stop the e500 noise in the system. However, we end up with an orphaned db file that's never deleted. By deleting the db in container-replicator, we stop container-updater stumbling across such files.

Note: I have a proposed fix here: I don't know why gerrit didn't update this bug (mabe because it's WIP).