coreutils: printf formatting bug for nb_NO and nn_NO locales

Bug #2058775 reported by Thomas Dreibholz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GLibC
Invalid
Medium
coreutils (Ubuntu)
New
Undecided
Unassigned

Bug Description

I just discovered a printf bug for at least the nb_NO and nn_NO locales when printing numbers with thousands separator. To reproduce:

#!/bin/bash
for l in de_DE en_US nb_NO ; do
   echo "LC_NUMERIC=$l.UTF-8"
   for n in 1 100 1000 10000 100000 1000000 10000000 ; do
      LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>\n" $n
   done
done

The expected output of "%'10d" is a right-formatted number string with 10 characters.

The output of the test script is fine for e.g. LC_NUMERIC=de_DE.UTF-8 and LC_NUMERIC=en_US.UTF-8:

LC_NUMERIC=de_DE.UTF-8
< 1>
< 100>
< 1.000>
< 10.000>
< 100.000>
< 1.000.000>
<10.000.000>
LC_NUMERIC=en_US.UTF-8
< 1>
< 100>
< 1,000>
< 10,000>
< 100,000>
< 1,000,000>
<10,000,000>

However, for LC_NUMERIC=nb_NO.UTF-8 and LC_NUMERIC=nn_NO.UTF-8, the formatting is wrong:

LC_NUMERIC=nb_NO.UTF-8
< 1>
< 100>
< 1 000>
< 10 000>
< 100 000>
<1 000 000>
<10 000 000>
LC_NUMERIC=nn_NO.UTF-8
< 1>
< 100>
< 1 000>
< 10 000>
< 100 000>
<1 000 000>
<10 000 000>

I reproduced the issue with coreutils-8.32-4.1ubuntu1.1 (Ubuntu 22.04) as well as coreutils-9.3-5.fc39.x86_64 (Fedora 39).

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: coreutils 8.32-4.1ubuntu1.1
ProcVersionSignature: Ubuntu 6.5.0-26.26~22.04.1-generic 6.5.13
Uname: Linux 6.5.0-26-generic x86_64
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: pass
CurrentDesktop: KDE
Date: Fri Mar 22 21:33:13 2024
InstallationDate: Installed on 2022-11-29 (479 days ago)
InstallationMedia: Kubuntu 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809.1)
SourcePackage: coreutils
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Thomas Dreibholz (dreibh) wrote :
description: updated
Revision history for this message
Thomas Dreibholz (dreibh) wrote :

Launchpad suppresses the spaces, it seems. I attached a screenshot of the terminal output to display the formatting issue.

Revision history for this message
Thomas Dreibholz (dreibh) wrote (last edit ):

In a hexdump, printf seems to add a 3 characters for the thousands separator:

#!/bin/sh
for l in de_DE en_US nb_NO nn_NO ; do
   echo "LC_NUMERIC=$l.UTF-8"
   for n in 1 100 1000 10000 100000 1000000 10000000 ; do
      LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'8d>" $n | hexdump -C
   done
done

Output:

LC_NUMERIC=nb_NO.UTF-8
00000000 3c 20 20 20 20 20 20 20 31 3e |< 1>|
0000000a
00000000 3c 20 20 20 20 20 31 30 30 3e |< 100>|
0000000a
00000000 3c 20 31 e2 80 af 30 30 30 3e |< 1...000>|
0000000a
00000000 3c 31 30 e2 80 af 30 30 30 3e |<10...000>|
0000000a
00000000 3c 31 30 30 e2 80 af 30 30 30 3e |<100...000>|
0000000b
00000000 3c 31 e2 80 af 30 30 30 e2 80 af 30 30 30 3e |<1...000...000>|
0000000f
00000000 3c 31 30 e2 80 af 30 30 30 e2 80 af 30 30 30 3e |<10...000...000>|
00000010
LC_NUMERIC=nn_NO.UTF-8
00000000 3c 20 20 20 20 20 20 20 31 3e |< 1>|
0000000a
00000000 3c 20 20 20 20 20 31 30 30 3e |< 100>|
0000000a
00000000 3c 20 31 e2 80 af 30 30 30 3e |< 1...000>|
0000000a
00000000 3c 31 30 e2 80 af 30 30 30 3e |<10...000>|
0000000a
00000000 3c 31 30 30 e2 80 af 30 30 30 3e |<100...000>|
0000000b
00000000 3c 31 e2 80 af 30 30 30 e2 80 af 30 30 30 3e |<1...000...000>|
0000000f
00000000 3c 31 30 e2 80 af 30 30 30 e2 80 af 30 30 30 3e |<10...000...000>|
00000010

However, both in Konsole as well as in XTerm, the issue occurs. So, the bytes "0xe2 0x80 0xaf" inserted by printf for the thousands separator seem to be incorrect? "0xe2 0x80 0xaf" is UTF-8 NARROW NO-BREAK SPACE -> https://www.fileformat.info/info/unicode/char/202f/index.htm .

Revision history for this message
Thomas Dreibholz (dreibh) wrote :

Loading the test script output as text file in LibreOffice also shows the same issue (screenshot attached). So, it is not a bug of Konsole or XTerm. Probably, the 3-byte UTF-8 thousands separator character of the locale is not useful. May be it should be a simple space, or a normal UTF-8 non-breakable space (0xc2 0xa0, HTML "&nbsp;")?

Revision history for this message
In , Thomas Dreibholz (dreibh) wrote :
Download full text (3.2 KiB)

There is a formatting bug for integers in printf() when using locale settings and formatting with thousands separator.

Test program printfbug.c:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main(int argc, char** argv)
{
   setlocale (LC_ALL, "");

   struct lconv* loc = localeconv();
   printf("Thousands Separator: <%s>\n", loc->thousands_sep);

   for(int i = 1; i <argc; i++) {
      int n = atoi(argv[i]);
      double f = atof(argv[i]);
      printf("double <%'10.0f>\tint <%'10d>\n", f, n);
   }
   return 0;
}

Test run:
for l in en_US de_DE nb_NO nn_NO ; do
   echo "$l:" ; LC_ALL=$l.UTF-8 ./printfbug 1 10 100 1000 10000 100000 1000000 10000000
done

Output:
en_US:
Thousands Separator: <,>
double < 1> int < 1>
double < 10> int < 10>
double < 100> int < 100>
double < 1,000> int < 1,000>
double < 10,000> int < 10,000>
double < 100,000> int < 100,000>
double < 1,000,000> int < 1,000,000>
double <10,000,000> int <10,000,000>
de_DE:
Thousands Separator: <.>
double < 1> int < 1>
double < 10> int < 10>
double < 100> int < 100>
double < 1.000> int < 1.000>
double < 10.000> int < 10.000>
double < 100.000> int < 100.000>
double < 1.000.000> int < 1.000.000>
double <10.000.000> int <10.000.000>
nb_NO:
Thousands Separator: < >
double < 1> int < 1>
double < 10> int < 10>
double < 100> int < 100>
double < 1 000> int < 1 000>
double < 10 000> int < 10 000>
double < 100 000> int < 100 000>
double < 1 000 000> int <1 000 000>
double <10 000 000> int <10 000 000>
nn_NO:
Thousands Separator: < >
double < 1> int < 1>
double < 10> int < 10>
double < 100> int < 100>
double < 1 000> int < 1 000>
double < 10 000> int < 10 000>
double < 100 000> int < 100 000>
double < 1 000 000> int <1 000 000>
double <10 000 000> int <10 000 000>

That is, en_US and de_DE are fine (they use ',' and '.' as thousands separator). But nb_NO and nn_NO produce the wrong output when using integers (%'10d). However, float is fine as well (%'10.0f).

For nb_NO and nn_NO, the separator is a 3-byte UTF-8 character 0xe2 0x80 0xaf, which is UTF-8 NARROW NO-BREAK SPACE -> https://www.fileformat.info/info/unicode/char/202f/index.htm . It seems that for integer formatting, the number of bytes is processed, instead of counting the actual characters. For float formatting, the number of characters is counted correctly.

That is:
$ LC_ALL=nb_NO.UTF-8 ./printfbug 1000 | hexdump -C
00000000 54 68 6f 75 73 61 6e 64 73 20 53 65 70 61 72 61 |Thousands Separa|
00000010 74 6f 72 3a 20 3c e2 80 af 3e 0a 64 6f 75 62 6c |tor: <...>.doubl|
00000020 65 20 3c 20 20 20 20 20 31 e2 80 af 30 30 30 3e |e < 1...000>|
00000030 09 69 6e 74 20 3c 20 20 20 31 e2 80 af 30 30 30 |.int < 1...000|
00000040 3e 0a |>.|
00000042

I can reproduce the issue under Ubuntu 22.04, Ubuntu 24.04 (devel...

Read more...

Changed in coreutils:
importance: Unknown → Medium
status: Unknown → New
affects: coreutils → glibc
Revision history for this message
In , Andreas Schwab (schwab-linux-m68k) wrote :

dup

*** This bug has been marked as a duplicate of bug 28943 ***

Changed in glibc:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.