Comment 1 for bug 1809472

Revision history for this message
clayg (clay-gerrard) wrote :

The intent there was not to force a recalculate on un-dirtied suffixes every 11th replication cycle.

That hook was only to do a *single* listdir IO to check if there's any suffixes on disk that are not in the hashes.pkl - if the suffix is in the hashes.pkl (and not dirtied) the current understanding of observed operating systems is that the hash would be correct [1].

Honestly the whole endeavor seemed suspect, I'm not sure anyone really observed production systems syncing around suffixes and failing to invalidate them with a post-REPLICATE request, but it was only *one* syscall/IO and the idea of a deterministic way to do it periodically was sort of cute so I guess someone decided it was worth the complexity.

1. understood to be correct, barring a bug. Operators (myself among them) have in the past observed mis-hashed suffixes (i.e. un-dirty hashed suffix value in hashes.pkl doesn't match recalculated hash on files in the suffix) - but those have all been explainable by bugs introduced with the hashes.invalid change that have all since been fixed and I haven't yet observed the issue on clusters deployed since we fixed those bugs. But because such a class of bugs is known to exist, there is broad support/interest in having an operator tunable that sets a *timer* to force a recalc on any un-dirtied suffix after some... weeks. But doing it every 10 cycles would burn an unacceptable amount of IO on stable systems.