If a drive gets hosed, replicator can continue launching "unlimited" rsyncs

Bug #1398962 reported by Caleb Tennis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
New
Undecided
Unassigned

Bug Description

In a scenario where the something happens and the OS hoses a disk:

sd 0:0:16:0: [sdq] Unhandled sense code
sd 0:0:16:0: [sdq] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:16:0: [sdq] Sense Key : Medium Error [current]
Info fld=0xaed41780
sd 0:0:16:0: [sdq] Add. Sense: Unrecovered read error
sd 0:0:16:0: [sdq] CDB: Read(10): 28 00 ae d4 17 80 00 00 10 00
XFS (sdq): metadata I/O error: block 0xaed41780 ("xfs_trans_read_buf") error 121 buf count 8192
XFS (sdq): xfs_imap_to_bp: xfs_trans_read_buf() returned error 121.
sd 0:0:16:0: [sdq] Unhandled sense code
sd 0:0:16:0: [sdq] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:16:0: [sdq] Sense Key : Medium Error [current]
Info fld=0xaed41780
sd 0:0:16:0: [sdq] Add. Sense: Unrecovered read error
sd 0:0:16:0: [sdq] CDB: Read(10): 28 00 ae d4 17 80 00 00 10 00
XFS (sdq): metadata I/O error: block 0xaed41780 ("xfs_trans_read_buf") error 121 buf count 8192
XFS (sdq): xfs_imap_to_bp: xfs_trans_read_buf() returned error 121.

Now the replicator will spawn rsyncs from this disk, but that very quickly zombify and stay around forever (process table snippet):

swift 18168 1.4 0.0 107656 1148 ? R Dec02 15:24 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18178 1.7 0.0 107620 1028 ? R Dec02 21:27 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18194 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
swift 18195 0.0 0.0 0 0 ? Z 05:11 0:00 [rsync] <defunct>
swift 18196 0.0 0.0 0 0 ? Z 09:28 0:00 [rsync] <defunct>
swift 18211 2.1 0.0 107620 1032 ? R Dec02 32:11 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18215 1.8 0.0 107624 1040 ? R Dec02 23:13 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18220 1.4 0.0 107624 1036 ? R Dec02 16:49 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18232 0.0 0.0 0 0 ? Z 02:17 0:00 [rsync] <defunct>
swift 18241 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
swift 18246 0.0 0.0 0 0 ? Z 08:02 0:00 [rsync] <defunct>
swift 18257 1.3 0.0 107624 1044 ? R Dec02 12:59 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18267 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>
swift 18274 0.0 0.0 0 0 ? Z 06:39 0:00 [rsync] <defunct>
swift 18289 1.6 0.0 107620 1032 ? R Dec02 22:25 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18294 1.6 0.0 107620 1036 ? R Dec02 24:11 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18305 0.0 0.0 0 0 ? Z 03:45 0:00 [rsync] <defunct>
swift 18311 1.8 0.0 107620 1028 ? R Dec02 23:08 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-e
swift 18315 0.0 0.0 0 0 ? Z Dec02 0:00 [rsync] <defunct>

We rely on the timeout argument to rsync to clean itself up, but in these cases it will never clean up. Probably ought to have some kind of check for maximum number of rsyncs running at the same time or something like that so we don't continue launching them if there are already X running.

Revision history for this message
Caleb Tennis (ctennis) wrote :

Here's what a kernel log looks like when a drive goes bad and we keep spinning up rsyncs.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.