If a drive gets hosed, replicator can continue launching "unlimited" rsyncs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
New
|
Undecided
|
Unassigned |
Bug Description
In a scenario where the something happens and the OS hoses a disk:
sd 0:0:16:0: [sdq] Unhandled sense code
sd 0:0:16:0: [sdq] Result: hostbyte=DID_OK driverbyte=
sd 0:0:16:0: [sdq] Sense Key : Medium Error [current]
Info fld=0xaed41780
sd 0:0:16:0: [sdq] Add. Sense: Unrecovered read error
sd 0:0:16:0: [sdq] CDB: Read(10): 28 00 ae d4 17 80 00 00 10 00
XFS (sdq): metadata I/O error: block 0xaed41780 ("xfs_trans_
XFS (sdq): xfs_imap_to_bp: xfs_trans_
sd 0:0:16:0: [sdq] Unhandled sense code
sd 0:0:16:0: [sdq] Result: hostbyte=DID_OK driverbyte=
sd 0:0:16:0: [sdq] Sense Key : Medium Error [current]
Info fld=0xaed41780
sd 0:0:16:0: [sdq] Add. Sense: Unrecovered read error
sd 0:0:16:0: [sdq] CDB: Read(10): 28 00 ae d4 17 80 00 00 10 00
XFS (sdq): metadata I/O error: block 0xaed41780 ("xfs_trans_
XFS (sdq): xfs_imap_to_bp: xfs_trans_
Now the replicator will spawn rsyncs from this disk, but that very quickly zombify and stay around forever (process table snippet):
swift 18168 1.4 0.0 107656 1148 ? R Dec02 15:24 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18178 1.7 0.0 107620 1028 ? R Dec02 21:27 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18194 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
swift 18195 0.0 0.0 0 0 ? Z 05:11 0:00 [rsync] <defunct>
swift 18196 0.0 0.0 0 0 ? Z 09:28 0:00 [rsync] <defunct>
swift 18211 2.1 0.0 107620 1032 ? R Dec02 32:11 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18215 1.8 0.0 107624 1040 ? R Dec02 23:13 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18220 1.4 0.0 107624 1036 ? R Dec02 16:49 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18232 0.0 0.0 0 0 ? Z 02:17 0:00 [rsync] <defunct>
swift 18241 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
swift 18246 0.0 0.0 0 0 ? Z 08:02 0:00 [rsync] <defunct>
swift 18257 1.3 0.0 107624 1044 ? R Dec02 12:59 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18267 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
swift 18274 0.0 0.0 0 0 ? Z 06:39 0:00 [rsync] <defunct>
swift 18289 1.6 0.0 107620 1032 ? R Dec02 22:25 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18294 1.6 0.0 107620 1036 ? R Dec02 24:11 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18305 0.0 0.0 0 0 ? Z 03:45 0:00 [rsync] <defunct>
swift 18311 1.8 0.0 107620 1028 ? R Dec02 23:08 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18315 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
We rely on the timeout argument to rsync to clean itself up, but in these cases it will never clean up. Probably ought to have some kind of check for maximum number of rsyncs running at the same time or something like that so we don't continue launching them if there are already X running.
Here's what a kernel log looks like when a drive goes bad and we keep spinning up rsyncs.