Ubuntu
coreutils package

coreutils: printf formatting bug for nb_NO and nn_NO locales

Bug #2058775 reported by Thomas Dreibholz on 2024-03-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	GLibC	Invalid	Medium	sourceware-bugs #31542
	coreutils (Ubuntu)	New	Undecided	Unassigned

Bug Description

I just discovered a printf bug for at least the nb_NO and nn_NO locales when printing numbers with thousands separator. To reproduce:

#!/bin/bash
for l in de_DE en_US nb_NO ; do
   echo "LC_NUMERIC=$l.UTF-8"
   for n in 1 100 1000 10000 100000 1000000 10000000 ; do
      LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>\n" $n
   done
done

The expected output of "%'10d" is a right-formatted number string with 10 characters.

The output of the test script is fine for e.g. LC_NUMERIC=de_DE.UTF-8 and LC_NUMERIC=en_US.UTF-8:

LC_NUMERIC=de_DE.UTF-8
< 1>
< 100>
< 1.000>
< 10.000>
< 100.000>
< 1.000.000>
<10.000.000>
LC_NUMERIC=en_US.UTF-8
< 1>
< 100>
< 1,000>
< 10,000>
< 100,000>
< 1,000,000>
<10,000,000>

However, for LC_NUMERIC=nb_NO.UTF-8 and LC_NUMERIC=nn_NO.UTF-8, the formatting is wrong:

LC_NUMERIC=nb_NO.UTF-8
< 1>
< 100>
< 1 000>
< 10 000>
< 100 000>
<1 000 000>
<10 000 000>
LC_NUMERIC=nn_NO.UTF-8
< 1>
< 100>
< 1 000>
< 10 000>
< 100 000>
<1 000 000>
<10 000 000>

I reproduced the issue with coreutils-8.32-4.1ubuntu1.1 (Ubuntu 22.04) as well as coreutils-9.3-5.fc39.x86_64 (Fedora 39).

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: coreutils 8.32-4.1ubuntu1.1
ProcVersionSignature: Ubuntu 6.5.0-26.26~22.04.1-generic 6.5.13
Uname: Linux 6.5.0-26-generic x86_64
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: pass
CurrentDesktop: KDE
Date: Fri Mar 22 21:33:13 2024
InstallationDate: Installed on 2022-11-29 (479 days ago)
InstallationMedia: Kubuntu 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809.1)
SourcePackage: coreutils
UpgradeStatus: No upgrade log present (probably fresh install)

See original description

Tags:

Revision history for this message

Thomas Dreibholz (dreibh) wrote on 2024-03-22:

Dependencies.txt Edit (278 bytes, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.5 KiB, text/plain; charset="utf-8")
ProcEnviron.txt Edit (140 bytes, text/plain; charset="utf-8")

description:

updated

Revision history for this message

Thomas Dreibholz (dreibh) wrote on 2024-03-22:

Screenshot of the test script output Edit (258.7 KiB, image/png)

Launchpad suppresses the spaces, it seems. I attached a screenshot of the terminal output to display the formatting issue.

Revision history for this message

Thomas Dreibholz (dreibh) wrote on 2024-03-22 (last edit on 2024-03-22):

In a hexdump, printf seems to add a 3 characters for the thousands separator:

#!/bin/sh
for l in de_DE en_US nb_NO nn_NO ; do
   echo "LC_NUMERIC=$l.UTF-8"
   for n in 1 100 1000 10000 100000 1000000 10000000 ; do
      LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'8d>" $n | hexdump -C
   done
done

Output:

LC_NUMERIC=nb_NO.UTF-8
00000000 3c 20 20 20 20 20 20 20 31 3e |< 1>|
0000000a
00000000 3c 20 20 20 20 20 31 30 30 3e |< 100>|
0000000a
00000000 3c 20 31 e2 80 af 30 30 30 3e |< 1...000>|
0000000a
00000000 3c 31 30 e2 80 af 30 30 30 3e |<10...000>|
0000000a
00000000 3c 31 30 30 e2 80 af 30 30 30 3e |<100...000>|
0000000b
00000000 3c 31 e2 80 af 30 30 30 e2 80 af 30 30 30 3e |<1...000...000>|
0000000f
00000000 3c 31 30 e2 80 af 30 30 30 e2 80 af 30 30 30 3e |<10...000...000>|
00000010
LC_NUMERIC=nn_NO.UTF-8
00000000 3c 20 20 20 20 20 20 20 31 3e |< 1>|
0000000a
00000000 3c 20 20 20 20 20 31 30 30 3e |< 100>|
0000000a
00000000 3c 20 31 e2 80 af 30 30 30 3e |< 1...000>|
0000000a
00000000 3c 31 30 e2 80 af 30 30 30 3e |<10...000>|
0000000a
00000000 3c 31 30 30 e2 80 af 30 30 30 3e |<100...000>|
0000000b
00000000 3c 31 e2 80 af 30 30 30 e2 80 af 30 30 30 3e |<1...000...000>|
0000000f
00000000 3c 31 30 e2 80 af 30 30 30 e2 80 af 30 30 30 3e |<10...000...000>|
00000010

However, both in Konsole as well as in XTerm, the issue occurs. So, the bytes "0xe2 0x80 0xaf" inserted by printf for the thousands separator seem to be incorrect? "0xe2 0x80 0xaf" is UTF-8 NARROW NO-BREAK SPACE -> https://www.fileformat.info/info/unicode/char/202f/index.htm .

Revision history for this message

Thomas Dreibholz (dreibh) wrote on 2024-03-22:

Screenshot of the test script output loaded in LibreOffice Edit (383.0 KiB, image/png)

Loading the test script output as text file in LibreOffice also shows the same issue (screenshot attached). So, it is not a bug of Konsole or XTerm. Probably, the 3-byte UTF-8 thousands separator character of the locale is not useful. May be it should be a simple space, or a normal UTF-8 non-breakable space (0xc2 0xa0, HTML " ")?

Revision history for this message

In Sourceware.org Bugzilla #31542, Thomas Dreibholz (dreibh) wrote on 2024-03-24:

Download full text (3.2 KiB)

There is a formatting bug for integers in printf() when using locale settings and formatting with thousands separator.

Test program printfbug.c:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main(int argc, char** argv)
{
setlocale (LC_ALL, "");

struct lconv* loc = localeconv();
printf("Thousands Separator: <%s>\n", loc->thousands_sep);

   for(int i = 1; i <argc; i++) {
      int n = atoi(argv[i]);
      double f = atof(argv[i]);
      printf("double <%'10.0f>\tint <%'10d>\n", f, n);
   }
   return 0;
}

Test run:
for l in en_US de_DE nb_NO nn_NO ; do
echo "$l:" ; LC_ALL=$l.UTF-8 ./printfbug 1 10 100 1000 10000 100000 1000000 10000000
done

Output:
en_US:
Thousands Separator: <,>
double < 1> int < 1>
double < 10> int < 10>
double < 100> int < 100>
double < 1,000> int < 1,000>
double < 10,000> int < 10,000>
double < 100,000> int < 100,000>
double < 1,000,000> int < 1,000,000>
double <10,000,000> int <10,000,000>
de_DE:
Thousands Separator: <.>
double < 1> int < 1>
double < 10> int < 10>
double < 100> int < 100>
double < 1.000> int < 1.000>
double < 10.000> int < 10.000>
double < 100.000> int < 100.000>
double < 1.000.000> int < 1.000.000>
double <10.000.000> int <10.000.000>
nb_NO:
Thousands Separator: < >
double < 1> int < 1>
double < 10> int < 10>
double < 100> int < 100>
double < 1 000> int < 1 000>
double < 10 000> int < 10 000>
double < 100 000> int < 100 000>
double < 1 000 000> int <1 000 000>
double <10 000 000> int <10 000 000>
nn_NO:
Thousands Separator: < >
double < 1> int < 1>
double < 10> int < 10>
double < 100> int < 100>
double < 1 000> int < 1 000>
double < 10 000> int < 10 000>
double < 100 000> int < 100 000>
double < 1 000 000> int <1 000 000>
double <10 000 000> int <10 000 000>

That is, en_US and de_DE are fine (they use ',' and '.' as thousands separator). But nb_NO and nn_NO produce the wrong output when using integers (%'10d). However, float is fine as well (%'10.0f).

For nb_NO and nn_NO, the separator is a 3-byte UTF-8 character 0xe2 0x80 0xaf, which is UTF-8 NARROW NO-BREAK SPACE -> https://www.fileformat.info/info/unicode/char/202f/index.htm . It seems that for integer formatting, the number of bytes is processed, instead of counting the actual characters. For float formatting, the number of characters is counted correctly.

That is:
$ LC_ALL=nb_NO.UTF-8 ./printfbug 1000 | hexdump -C
00000000 54 68 6f 75 73 61 6e 64 73 20 53 65 70 61 72 61 |Thousands Separa|
00000010 74 6f 72 3a 20 3c e2 80 af 3e 0a 64 6f 75 62 6c |tor: <...>.doubl|
00000020 65 20 3c 20 20 20 20 20 31 e2 80 af 30 30 30 3e |e < 1...000>|
00000030 09 69 6e 74 20 3c 20 20 20 31 e2 80 af 30 30 30 |.int < 1...000|
00000040 3e 0a |>.|
00000042

I can reproduce the issue under Ubuntu 22.04, Ubuntu 24.04 (devel...

There is a formatting bug for integers in printf() when using locale settings and formatting with thousands separator.

Test program printfbug.c:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main(int argc, char** argv)
{
   setlocale (LC_ALL, "");

struct lconv* loc = localeconv();
   printf("Thousands Separator: <%s>\n", loc->thousands_sep);

for(int i = 1; i <argc; i++) {
      int    n = atoi(argv[i]);
      double f = atof(argv[i]);
      printf("double <%'10.0f>\tint <%'10d>\n", f, n);
   }
   return 0;
}

Test run:
for l in en_US de_DE nb_NO nn_NO ; do
   echo "$l:" ; LC_ALL=$l.UTF-8 ./printfbug 1 10 100 1000 10000 100000 1000000 10000000
done

Output:
en_US:
Thousands Separator: <,>
double <         1>     int <         1>
double <        10>     int <        10>
double <       100>     int <       100>
double <     1,000>     int <     1,000>
double <    10,000>     int <    10,000>
double <   100,000>     int <   100,000>
double < 1,000,000>     int < 1,000,000>
double <10,000,000>     int <10,000,000>
de_DE:
Thousands Separator: <.>
double <         1>     int <         1>
double <        10>     int <        10>
double <       100>     int <       100>
double <     1.000>     int <     1.000>
double <    10.000>     int <    10.000>
double <   100.000>     int <   100.000>
double < 1.000.000>     int < 1.000.000>
double <10.000.000>     int <10.000.000>
nb_NO:
Thousands Separator: < >
double <         1>     int <         1>
double <        10>     int <        10>
double <       100>     int <       100>
double <     1 000>     int <   1 000>
double <    10 000>     int <  10 000>
double <   100 000>     int < 100 000>
double < 1 000 000>     int <1 000 000>
double <10 000 000>     int <10 000 000>
nn_NO:
Thousands Separator: < >
double <         1>     int <         1>
double <        10>     int <        10>
double <       100>     int <       100>
double <     1 000>     int <   1 000>
double <    10 000>     int <  10 000>
double <   100 000>     int < 100 000>
double < 1 000 000>     int <1 000 000>
double <10 000 000>     int <10 000 000>

That is, en_US and de_DE are fine (they use ',' and '.' as thousands separator). But nb_NO and nn_NO produce the wrong output when using integers (%'10d). However, float is fine as well (%'10.0f).

That is:
$ LC_ALL=nb_NO.UTF-8 ./printfbug 1000 | hexdump -C
00000000  54 68 6f 75 73 61 6e 64  73 20 53 65 70 61 72 61  |Thousands Separa|
00000010  74 6f 72 3a 20 3c e2 80  af 3e 0a 64 6f 75 62 6c  |tor: <...>.doubl|
00000020  65 20 3c 20 20 20 20 20  31 e2 80 af 30 30 30 3e  |e <     1...000>|
00000030  09 69 6e 74 20 3c 20 20  20 31 e2 80 af 30 30 30  |.int <   1...000|
00000040  3e 0a                                             |>.|
00000042

I can reproduce the issue under Ubuntu 22.04, Ubuntu 24.04 (development version), and Fedora 39.

Bug Watch Updater (bug-watch-updater) on 2024-03-24

Changed in coreutils:
importance:	Unknown → Medium
status:	Unknown → New

Thomas Dreibholz (dreibh) on 2024-03-24

affects:

coreutils → glibc

Revision history for this message

In Sourceware.org Bugzilla #31542, Andreas Schwab (schwab-linux-m68k) wrote on 2024-03-25:

dup

*** This bug has been marked as a duplicate of bug 28943 ***

Bug Watch Updater (bug-watch-updater) on 2024-03-26

Changed in glibc:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

sourceware-bugs #31542
[RESOLVED DUPLICATE] Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntucoreutils package

coreutils: printf formatting bug for nb_NO and nn_NO locales

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
coreutils package