OpenStack Object Storage (swift)

If a drive gets hosed, replicator can continue launching "unlimited" rsyncs

Bug #1398962 reported by Caleb Tennis on 2014-12-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	New	Undecided	Unassigned

Bug Description

In a scenario where the something happens and the OS hoses a disk:

sd 0:0:16:0: [sdq] Unhandled sense code
sd 0:0:16:0: [sdq] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:16:0: [sdq] Sense Key : Medium Error [current]
Info fld=0xaed41780
sd 0:0:16:0: [sdq] Add. Sense: Unrecovered read error
sd 0:0:16:0: [sdq] CDB: Read(10): 28 00 ae d4 17 80 00 00 10 00
XFS (sdq): metadata I/O error: block 0xaed41780 ("xfs_trans_read_buf") error 121 buf count 8192
XFS (sdq): xfs_imap_to_bp: xfs_trans_read_buf() returned error 121.
sd 0:0:16:0: [sdq] Unhandled sense code
sd 0:0:16:0: [sdq] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:16:0: [sdq] Sense Key : Medium Error [current]
Info fld=0xaed41780
sd 0:0:16:0: [sdq] Add. Sense: Unrecovered read error
sd 0:0:16:0: [sdq] CDB: Read(10): 28 00 ae d4 17 80 00 00 10 00
XFS (sdq): metadata I/O error: block 0xaed41780 ("xfs_trans_read_buf") error 121 buf count 8192
XFS (sdq): xfs_imap_to_bp: xfs_trans_read_buf() returned error 121.

Now the replicator will spawn rsyncs from this disk, but that very quickly zombify and stay around forever (process table snippet):

swift 18168 1.4 0.0 107656 1148 ? R Dec02 15:24 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18178 1.7 0.0 107620 1028 ? R Dec02 21:27 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18194 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
swift 18195 0.0 0.0 0 0 ? Z 05:11 0:00 [rsync] <defunct>
swift 18196 0.0 0.0 0 0 ? Z 09:28 0:00 [rsync] <defunct>
swift 18211 2.1 0.0 107620 1032 ? R Dec02 32:11 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18215 1.8 0.0 107624 1040 ? R Dec02 23:13 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18220 1.4 0.0 107624 1036 ? R Dec02 16:49 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18232 0.0 0.0 0 0 ? Z 02:17 0:00 [rsync] <defunct>
swift 18241 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
swift 18246 0.0 0.0 0 0 ? Z 08:02 0:00 [rsync] <defunct>
swift 18257 1.3 0.0 107624 1044 ? R Dec02 12:59 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18267 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
swift 18274 0.0 0.0 0 0 ? Z 06:39 0:00 [rsync] <defunct>
swift 18289 1.6 0.0 107620 1032 ? R Dec02 22:25 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18294 1.6 0.0 107620 1036 ? R Dec02 24:11 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18305 0.0 0.0 0 0 ? Z 03:45 0:00 [rsync] <defunct>
swift 18311 1.8 0.0 107620 1028 ? R Dec02 23:08 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18315 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>

We rely on the timeout argument to rsync to clean itself up, but in these cases it will never clean up. Probably ought to have some kind of check for maximum number of rsyncs running at the same time or something like that so we don't continue launching them if there are already X running.