be smart at guessing encoding

Bug #57985 reported by David Allouche
4
Affects Status Importance Assigned to Milestone
Bazaar GTK+ Frontends
Confirmed
Medium
Unassigned

Bug Description

Bug 44677 is fixed by decode with errors=replace. But the gui could be smarter and more helpful when dealing with non-utf8 encodings.

Tags: diff encoding
Revision history for this message
David Allouche (ddaa) wrote :

As discussed with LartiQ in #bzr, I think bzr-gtk should be smarter at guessing how to decode arbitrary file contents. Generally, the logic should look like:

1. Look for a BOM. If we find one, we can be confident that the encoding is utf-something. BOM are normally found in utf-16 and utf-32 files, but LartiQ reports that it's sometimes used in utf-8 documents as well (although it makes no sense, since utf-8 fixes the bit ordering).

2. Try decoding with utf-8. I do not know of any encoding/language that normally (in non-pathological documents) produce data that is valid utf-8.

3. Optionally, more heuristics. Some text editors looks for patterns in the document to guess the encoding. I believe emacs has some magic of that sort.

4. Try the locale encoding, as provided by sys.getpreferredencoding (per j-a-meinel)

5. If that still does not work, decode('ascii', 'replace').

FINALLY: always display a control that shows the encoding, and provides direct user control to override the automatic detection. The choices should at least include utf-8, the locale encoding, and explicit input of any arbitrary encoding supported by Python. Optionally, the choices could include a list of user-configurable favourite encodings.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

I think behaviour should be predictable by the user and trying to be smarter than the user is hard. If it is user-specifiable, with the initial encoding guessed by bzr-gtk, I'm all for it.

David Allouche (ddaa)
Changed in bzr-gtk:
importance: Untriaged → Medium
status: Unconfirmed → Confirmed
Changed in bzrk:
importance: Untriaged → Medium
status: Unconfirmed → Confirmed
Revision history for this message
Alexander Belchenko (bialix) wrote :

Is this possible to bzr-gtk to read encoding from a sort of config file or from environment variable (like $ENCODING)?

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Yes, though the problem is that the encoding might be different for different branches or even for different files in a branch.

Revision history for this message
David Allouche (ddaa) wrote :

Adding to Jelmer's comment in reply to bialix:
And that the encoding of files in a bzr branch may not bear any relation to the user's locale setting. User locale is good for interepreting user input, and for encoding unicode text to the user's preferred encoding.

But decoding of source files is a different problem than the one addressed by the user's locale. Source file data is not user input (it's rather like untrusted data off the internet) and it's definitely not unicode (otherwise we would not be worrying about decoding it).

Revision history for this message
Alexander Belchenko (bialix) wrote :

Well, I understand your pain in wide sense of guessing encoding. But it's too big problem. Make problem smaller.

If I work on branch and I precisely know what encoding files are -- how I can tell to bzr-gtk this info? I think support for giving this info could be first small but important step towards encodings support.

Right now bzr-gtk simply unusable with my russian-text files.

Revision history for this message
Jelmer Vernooij (jelmer) wrote : Re: [Bug 57985] Re: be smart at guessing encoding

On Thu, 2006-10-05 at 15:05 +0000, bialix wrote:
> Well, I understand your pain in wide sense of guessing encoding. But
> it's too big problem. Make problem smaller.
>
> If I work on branch and I precisely know what encoding files are -- how
> I can tell to bzr-gtk this info? I think support for giving this info
> could be first small but important step towards encodings support.
>
> Right now bzr-gtk simply unusable with my russian-text files.
That's why I think the best way to solve this is as ddaa proposed:

- try to guess an encoding (maybe always UTF-8?)
- allow the user (in the diff window) to change the encoding from the
default

Cheers,

Jelmer
--
Jelmer Vernooij <email address hidden> - http://samba.org/~jelmer/

Revision history for this message
David Allouche (ddaa) wrote :

That is steps 2, 5 and FINALLY in the bug description.

Revision history for this message
Alexander Belchenko (bialix) wrote :

Jelmer Vernooij пишет:
> On Thu, 2006-10-05 at 15:05 +0000, bialix wrote:
>> Well, I understand your pain in wide sense of guessing encoding. But
>> it's too big problem. Make problem smaller.
>>
>> If I work on branch and I precisely know what encoding files are -- how
>> I can tell to bzr-gtk this info? I think support for giving this info
>> could be first small but important step towards encodings support.
>>
>> Right now bzr-gtk simply unusable with my russian-text files.
> That's why I think the best way to solve this is as ddaa proposed:
>
> - try to guess an encoding (maybe always UTF-8?)
> - allow the user (in the diff window) to change the encoding from the
> default

It's not only in diff window.
gannotate also works bad with non-utf-8 text files.

--
Alexander

Jelmer Vernooij (jelmer)
tags: added: diff encoding
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.