Comment 1 for bug 1695911

Mike Rylander (mrylander) wrote :

This isn't the first time we've run into this sort of issue. For another example, see: mfr. We truncate there (transparently, through a view and query rewriting), FWIW.

The reason we have a unique index over (sort_value, value) rather than just (value) or (sort_value) alone is because of non-filing prefixes. Two strings may have different value, er, values, but if they come from fields that allows and use non-filing characters, they may have colliding sort_value values. And, likewise, you may have to identical value values but one defines a non-filing character count and so should sort differently. So, both the raw value and the sort_value must be considered together when defining "unique".

Moving to a calculated value (MD5, SHA1, etc) would be non-good, since the index is used for searching and sorting the data.

Truncation seems reasonable, and, IMO, favoring truncation of the value column is probably preferable up to some minimum length. For instance, perhaps we truncate the sort_value column to no shorter than 2000, and give the balance of the ~2700 bytes to the value column. That allows the non-filing part of the value column to continue acting as a tie breaker, which is how things work today.