Authors column: Odd behaviour with digits in alphabetical partitioning

Bug #1948560 reported by ownedbycats
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
Charles Haley

Bug Description

Calibre 5.30 (x64)

I have a book by the author "10052020" (for context, it's a fanfic). For some reason, when using alphabetical partitioning, it displays the full number rather than just the first digit. I combine letters, but it also happens when I uncombine them.

I suppose this could be an intentional behaviour, but it has a rather unexpected result in this case.

Thank you!

Revision history for this message
ownedbycats (ownedbycats) wrote :
description: updated
Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1948560

Changing the component for this bug.

 assignee cbhaley
 status triaged

Changed in calibre:
assignee: nobody → Charles Haley (cbhaley)
status: New → Triaged
Revision history for this message
Charles Haley (cbhaley) wrote :

@kovid: I need advice. The problem is caused by the tweak 'numeric_collation'. When used, if an item (e.g. author name) begins with numbers the icu method collation_order() returns the entire number as the first letter.

I see three possibilities:

1) Make the tag browser use a special implementation of collation_order() that ignores the tweak. I tried this and it works. Problem: collation_order() is used in several other places for what appears to be the same reason, to partition the list, and those places will behave differently.

2) Change the behavior of collation_order() so it always ignores the tweak. This makes the behavior consistent across calibre, but it might break something. I looked at all the collation_order() use sites and I think that changing the behavior is correct, but I might be wrong.

3) Do nothing, insisting on the comment in the tweak that says "note that doing so will cause problems with text that starts with numbers and is a little slower."

Which of the three would you prefer I do?

Revision history for this message
Charles Haley (cbhaley) wrote :

The one place where changing the behavior of collation_order() might be wrong is db.categories, where the method is used to generate the sort key when sorting categories by name within first_letter. I *think* it should use the altered method so that the sort order of the category list will match the expected by-first-letter order.

Revision history for this message
Kovid Goyal (kovid) wrote :

I am OK with (2) as far as I can see collation_order() is only used for
tag browser partitioning (in various contexts), and using numeric
collation for partitioning doesn't really make sense anyway.

However, I think perhaps the function should be renamed to
collation_order_for_partitioning() to make it clear what its intended
use is.

Revision history for this message
Charles Haley (cbhaley) wrote :

This attachment shows one problem. If I have an author named '10' and an author named '2' then get_categories() sorts 2 before 10. This results in the first letter partitioning shown in the attachment. Because 2 is less than 10 it appears first, which I think is wrong when dealing with first letters but what do I know? :)

Revision history for this message
Kovid Goyal (kovid) wrote : Re: [Bug 1948560] Re: Authors column: Odd behaviour with digits in alphabetical partitioning

On Sun, Oct 24, 2021 at 09:53:21AM -0000, Charles Haley wrote:
> The one place where changing the behavior of collation_order() might be
> wrong is db.categories, where the method is used to generate the sort
> key when sorting categories by name within first_letter. I *think* it
> should use the altered method so that the sort order of the category
> list will match the expected by-first-letter order.

I agree, the same function should be used for this as well.

Revision history for this message
Charles Haley (cbhaley) wrote :

Renaming the method requires changing icu.c, which I can't do. Do you want to do that?

Revision history for this message
Charles Haley (cbhaley) wrote :

I think it would also require making beta releases. True?

Revision history for this message
Kovid Goyal (kovid) wrote :

I dont think there is a canonically right answer here. Seem by itself you would want the list to be sorted with numeric collation. You definitely would expect numbers in the middle of the word to be sorted numerically.

Revision history for this message
Kovid Goyal (kovid) wrote :

No it doesn't need native code changes. I have pushed a commit to add a new function that ignores the tweak. That should be used instead of collation_order()

Revision history for this message
Charles Haley (cbhaley) wrote :

I think we have arrived at option 3: do nothing. With this option the behavior might be weird but it is explainable and consistent. Numbers sort in numeric order no matter where they appear in the item. First letters follow the sort order. '2' is less than '10' so it shows first.

If we change the sort for the categories then the order in the categories won't match the order in the books list, which isn't good. If we try heuristics such as don't use numeric_sort in names then things could go inconsistent quickly with strange sorts in different cases.

The question "What is the sort order for these items?" supports option 3.
- 10 Angry Men
- 2 Angry Men
- The 10 Angry Men
- The 2 Angry Men

With numeric_sort in play we would see in the book list
- 2 Angry Men
- 10 Angry Men
- The 2 Angry Men
- The 10 Angry Men

Without it we would see in the book list:
- 10 Angry Men
- 2 Angry Men
- The 10 Angry Men
- The 2 Angry Men

Which is right is in the eye of the beholder. Regardless, I think that the tag browser should show items in the same order as the book list.

Revision history for this message
Charles Haley (cbhaley) wrote :

The delay in getting notifications from this bug system is annoying.

I used your push in get_categories() and the tag browser. The screen capture shows the results -- the item order in the tag browser and the booklist are different. It effectively means that the tweak is never used in the tag browser. If you are OK with this then I will also change the srv.metadata() where collation_order() is used for partitioning after calling get_categories(). I will also add a test for the new method to icu_test.

The catalog generator uses collation_order() but doesn't use get_categories instead sorting with sort_key(). I won't touch it because I don't understand what it is doing and I don't want the behavior to change.

OK to make the changes?

Revision history for this message
Charles Haley (cbhaley) wrote (last edit ):

Your new function collation_order_for_partitioning() needs changing. See the patch.

Revision history for this message
Charles Haley (cbhaley) wrote :
Revision history for this message
Kovid Goyal (kovid) wrote :

How about the following option 4:

When partitioning by first letter use str.isdigit() to check the string returned by ICU and if it has length > 1 and only digits, use only the first digit to partition. That addresses the problem raised in the original bug report without changing sorting behavior.

This would imply creating a separate function for partitioning by first letter that is used for that purpose through out the code base.

And thanks, I applied your patch.

Revision history for this message
Charles Haley (cbhaley) wrote :

That doesn't work, at least for me. Consider the attached screen capture, which is ordered and partitioned according to option 3 (no change). How would this change under your proposal? If I understand correctly then then the first letter list would be 5 (3 items), 1 (1 item), 5 (1 item), 2 (1 item). This doesn't feel like an improvement.

However, you might be suggesting that we sort the first letter list using only one digit of any item beginning with a number, leaving the items within each partition in their original order. That would produce 1, 2, 5 with the items within. In strict terms the sort order no longer matches the book list but it does match what one expects in a first letter list. I will look at this.

Revision history for this message
Charles Haley (cbhaley) wrote :
Revision history for this message
Charles Haley (cbhaley) wrote :

This is what I mean:

Revision history for this message
Kovid Goyal (kovid) wrote :

Yes your second para and the second screenshot is what I am suggesting.

Revision history for this message
Charles Haley (cbhaley) wrote :

I am reasonably confident that these changes are OK. I tried them with and without the numeric_collation tweak, and with and without partition by first letter. I also tried the content server.

Changed in calibre:
status: Triaged → Fix Committed
Revision history for this message
Kovid Goyal (kovid) wrote : Fixed in master

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

 status fixreleased

Changed in calibre:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.