foreground auditor skips partitions
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
New
|
Undecided
|
Unassigned |
Bug Description
SRE noticed that when they try to run the foreground auditor targeting a specific disk (because they want to drain it and reconstructor is hitting io/disk errors) that it won't DO anything.
at the begining of a cycle the auditor workers write down a status file per-disk with all the parts on that disk:
$ cat /srv/node4/
{"partitions": ["42", "24", "46", "54", "11", "13", "59", "36", "14", "15", "12", "2", "49", "10", "37", "0", "58", "22", "26", "38", "45", "33", "28", "16", "52", "63", "20", "25", "8", "30", "55", "19", "48", "17", "6", "51"]}
every 60s the worker will rewrite the file with the list of partitions it still has left to audit
At the end of the cycle it deletes the file.
Each cycle (or on process startup) WAY down in diskfile the audit location generator will look for a pre-existing status file (left over from a restarted auditor) and ONLY yield out those parts to audit.
If you run a fg-auditor while a bg-auditor is mid-cycle it won't "go back" and audit the parts that had already been processed by the bg-auditor.
If it's possible that the bg-auditor was "hung" for some reason the stale auditor-status file will be cleaned up at the end of the fg-auditor cycle and on the next invocation it should audit all locations again.
Currently the auditor only lets you target a specific device; we could perhaps add support to target specific parts in which case the status file should probably be ignored and definately should be removed at the end of the run.
the plubming into the audit_location_
partitions = partitions or get_auidtor_
https:/
As a work around SRE can delete the auditor status file before begining a fg auditor run.
An alternative "fix" might at least do the listdir and LOG that some partitions are being skipped; noting that the auditor status file will be removed when it finishes and you can re-run again to audit all partitions.