Comment 8 for bug 1691570

Revision history for this message
clayg (clay-gerrard) wrote :

yes! that! I think if the "watcher" was moved into a separate os thread it'd have a better chance of detecting the lockup. But to do anything about it we'd need to be able to respawn a *process* - so the control needs to be moved *way* up. The "wrokers" design [1] recently added to the reconstructor is ripe to be ported/extended/unified with the replicator and could help with this problem in two ways:

1) the workers option is designed to isolate processes to a subset of disks - a dead/hung worker will only prevent progress on a smaller subset of devices (the bad one(s)!)
2) the workers are isolated processes, and there's a controller process that is able to detect when they die and restart them - if a watch dog decides one isn't making progress any more it could just shoot it in the head and the framework would automatically start up a new one in it's place!