Commit messages on Windows are interpreted as cp1252 even if they are UTF-8

Bug #610229 reported by Max Kanat-Alexander
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
Medium
Unassigned

Bug Description

A developer on the Bugzilla Project recently checked in two commits with garbled characters--UTF-8 that got interpreted as windows-1252 (cp1252) instead. You can see here that he attempted to insert U+2013 (an en-dash) and it got garbled:

  http://bzr.mozilla.org/bugzilla/trunk/revision/7401

He says that he was using the same editor that he uses to edit his localization for Bugzilla, which is definitely in UTF-8.

My suspicion is that bzr is converting to UTF-8 using the terminal encoding, but on Windows the terminal encoding will nearly always be cp1252 or something that isn't UTF-8, even if people are writing in UTF-8.

Revision history for this message
Martin Packman (gz) wrote :

Bazaar is using the user encoding, which will generally be the right option, but does cause problems like this.

Couple of options. Could add yet another config option for the editor encoding. Using something like the notepad heuristic, which auto-detects unicode encodings and falls back to the windows codepage, would mostly work but leaves the problem of what encoding to *write* to the temporary file. Both still have the potential to mangle the encoding.

Finally, might be worth looking at changing the default to UTF-8 (perhaps with BOM), which most things really should support these days. I still have one editor installed that doesn't support unicode, but everything I use regularly does.

Changed in bzr:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Alexander Belchenko (bialix) wrote :

We need --encoding command-line option.

@Max: as workaround your developer can use qcommit dialog from qbzr which does the right thing with encodings.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay, thanks. The developer is Marc Schumann, BTW, who's also subscribed to this bug now.

I think making UTF-8 the default would possibly be a reasonable choice, too. I think almost everybody who would be typing international comments into commit messages on Windows nowadays would be writing them in UTF-8, although Marc would have a better idea than I would.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 610229] Re: Commit messages on Windows are interpreted as cp1252 even if they are UTF-8

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Max Kanat-Alexander wrote:
> Okay, thanks. The developer is Marc Schumann, BTW, who's also subscribed
> to this bug now.
>
> I think making UTF-8 the default would possibly be a reasonable choice,
> too. I think almost everybody who would be typing international comments
> into commit messages on Windows nowadays would be writing them in UTF-8,
> although Marc would have a better idea than I would.
>

To be fair his *system* says that he should be writing in cp1252, or
whatever, which is what we are respecting.

I think it would be reasonable to have a bazaar configuration setting
(possibly also settable at runtime with a -Ooption flag) that would set
the file-encoding as per user request, instead of querying
locale.getpreferredencoding().

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxR5+sACgkQJdeBCYSNAANxAACfR/XiXUntVZ42jWBCeJbY2tPx
3YoAoL+8a2M+k2Y/xI3YjZ69enV9fPWr
=1TWe
-----END PGP SIGNATURE-----

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

@John His system only says that the terminal he's writing in encodes its input in cp1252--however, on Windows, editors are not *in* the terminal. So that doesn't seem right, does it?

Revision history for this message
Alexander Belchenko (bialix) wrote :

Max Kanat-Alexander пишет:
> @John His system only says that the terminal he's writing in encodes its
> input in cp1252--however, on Windows, editors are not *in* the terminal.
> So that doesn't seem right, does it?

This is not correct. Actually Windows has 2 encodings: ANSI (default
for GUI/non-unicode applications and editors too) and OEM (default for
terminal/console).

CP1252 is ANSI encoding, it's not terminal encoding. Terminal encoding
would be CP850 (IIUC). So Bazaar prefers ANSI encoding for commit
messages from editors, because it's used to be default behavior for
Windows.

--
All the dude wanted was his rug back

Revision history for this message
Marc Schumann (wurblzap) wrote : Re: [Bug 610229] Re: Commit messages on Windows are interpreted as cp1252 even if they are UTF-8

John,

2010/7/29 John A Meinel <email address hidden>:
> To be fair his *system* says that he should be writing in cp1252, or
> whatever, which is what we are respecting.

all right, so this is a registry setting? Saying what encoding I'm
usually writing in? Then I should be able to change this. It's
reasonable that I tell the system my preferred encoding. Can you
please point me to the place where I can change this?

   Marc

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 610229] Re: Commit messages on Windows are interpreted as cp1252 even if they are UTF-8

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Max Kanat-Alexander wrote:
> @John His system only says that the terminal he's writing in encodes its
> input in cp1252--however, on Windows, editors are not *in* the terminal.
> So that doesn't seem right, does it?
>

Actually, the terminal encoding is cp437. The Windows system page is set
to cp1252.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxTRKoACgkQJdeBCYSNAAMVKgCeIgy1SuaJHIV7uI1TWB24bjYP
Th0AoIfuVk74etoYiR7jUc+Oi0/9vsoc
=512U
-----END PGP SIGNATURE-----

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

@John That may well be, but nitpicking the details of Windows encodings isn't going to help people writing Unicode commit messages on Windows using bzr. What bzr gets for InputCP (I'm guessing via Python's standard locale mechanisms?) just doesn't seem that relevant if somebody is using a GUI editor which ignores it (which all GUI editors do, given that they are writing files and not printing output to a terminal or anything else where InputCP or OutputCP would matter).

Jelmer Vernooij (jelmer)
tags: added: encoding win32
Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.