The intent there was not to force a recalculate on un-dirtied suffixes every 11th replication cycle.
That hook was only to do a *single* listdir IO to check if there's any suffixes on disk that are not in the hashes.pkl - if the suffix is in the hashes.pkl (and not dirtied) the current understanding of observed operating systems is that the hash would be correct [1].
Honestly the whole endeavor seemed suspect, I'm not sure anyone really observed production systems syncing around suffixes and failing to invalidate them with a post-REPLICATE request, but it was only *one* syscall/IO and the idea of a deterministic way to do it periodically was sort of cute so I guess someone decided it was worth the complexity.
1. understood to be correct, barring a bug. Operators (myself among them) have in the past observed mis-hashed suffixes (i.e. un-dirty hashed suffix value in hashes.pkl doesn't match recalculated hash on files in the suffix) - but those have all been explainable by bugs introduced with the hashes.invalid change that have all since been fixed and I haven't yet observed the issue on clusters deployed since we fixed those bugs. But because such a class of bugs is known to exist, there is broad support/interest in having an operator tunable that sets a *timer* to force a recalc on any un-dirtied suffix after some... weeks. But doing it every 10 cycles would burn an unacceptable amount of IO on stable systems.
The intent there was not to force a recalculate on un-dirtied suffixes every 11th replication cycle.
That hook was only to do a *single* listdir IO to check if there's any suffixes on disk that are not in the hashes.pkl - if the suffix is in the hashes.pkl (and not dirtied) the current understanding of observed operating systems is that the hash would be correct [1].
Honestly the whole endeavor seemed suspect, I'm not sure anyone really observed production systems syncing around suffixes and failing to invalidate them with a post-REPLICATE request, but it was only *one* syscall/IO and the idea of a deterministic way to do it periodically was sort of cute so I guess someone decided it was worth the complexity.
1. understood to be correct, barring a bug. Operators (myself among them) have in the past observed mis-hashed suffixes (i.e. un-dirty hashed suffix value in hashes.pkl doesn't match recalculated hash on files in the suffix) - but those have all been explainable by bugs introduced with the hashes.invalid change that have all since been fixed and I haven't yet observed the issue on clusters deployed since we fixed those bugs. But because such a class of bugs is known to exist, there is broad support/interest in having an operator tunable that sets a *timer* to force a recalc on any un-dirtied suffix after some... weeks. But doing it every 10 cycles would burn an unacceptable amount of IO on stable systems.