yes! that! I think if the "watcher" was moved into a separate os thread it'd have a better chance of detecting the lockup. But to do anything about it we'd need to be able to respawn a *process* - so the control needs to be moved *way* up. The "wrokers" design [1] recently added to the reconstructor is ripe to be ported/extended/unified with the replicator and could help with this problem in two ways:
1) the workers option is designed to isolate processes to a subset of disks - a dead/hung worker will only prevent progress on a smaller subset of devices (the bad one(s)!)
2) the workers are isolated processes, and there's a controller process that is able to detect when they die and restart them - if a watch dog decides one isn't making progress any more it could just shoot it in the head and the framework would automatically start up a new one in it's place!
yes! that! I think if the "watcher" was moved into a separate os thread it'd have a better chance of detecting the lockup. But to do anything about it we'd need to be able to respawn a *process* - so the control needs to be moved *way* up. The "wrokers" design [1] recently added to the reconstructor is ripe to be ported/ extended/ unified with the replicator and could help with this problem in two ways:
1) the workers option is designed to isolate processes to a subset of disks - a dead/hung worker will only prevent progress on a smaller subset of devices (the bad one(s)!)
2) the workers are isolated processes, and there's a controller process that is able to detect when they die and restart them - if a watch dog decides one isn't making progress any more it could just shoot it in the head and the framework would automatically start up a new one in it's place!