COLLATE "en_US.UTF-8" sorting takes 30x longer on newer builds

Bug #1648641 reported by nicholas wilson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
glibc (Ubuntu)
New
Undecided
Unassigned

Bug Description

(do humor my lack of full understanding of these packages).

Was having issues sorting with COLLATE "en_US.UTF-8" on ubuntu 16.04, told it was related to glibc.

On ubuntu 14.04 (with eglibc 2.19) I could sort a file of 2 million lines of international text (<40chars per line) in 20 seconds. On 16.04 (with glibc 2.23) sorting the same file with the same COLLATE took 10+ minutes. My only theory is that in 2.22 glibc added new 7.0 Unicode library (?) but really don't have a real grasp of what's going on here.

Came upon this issue when trying to index my database for over 400M rows. What should've taken 4 hours was running for over 24 hours (never finished). Created a subset of that table to test / sort.

Not sure how to replicate it easily, tried creating subsets to show my issue without success. Instead put 5000 lines into pastebin that you can try sorting yourself on 14.04 vs 16.04.
http://pastebin.com/r47uD690

If you put that into a file and run the following you can see the discrepancy between 14.04 and 16.04:
LC_COLLATE="en_US.UTF-8" sort /path/to/file > /dev/null

LC_COLLATE="C" has no problems (should be way faster anyways, but differences between 14.04 and 16.04 not noticeable).

If you do it on a 14.04 fresh build it takes < 1 second. On 16.04 it takes 8+ seconds. Small example, but it appeared to be even worse the larger the file (e.g. earlier example of 20 seconds vs 10 minutes).

That's about all the info I have at this moment. If you need more information throw me a question. I am not very technically familiar with a lot of packages involved. Only posting here as I was directed to glibc as a potential issue with regards to sorting in different COLLATE settings.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.