crossplatform unicode issues

Bug #1666829 reported by RJVB on 2017-02-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Qarte
Undecided
Unassigned

Bug Description

A priori there is nothing platform-specific in the qarte code, and indeed it works on Mac and MS Windows without significant patching.

There is a difference in the way Python/Mac handles Unicode strings, though, that leads to runtime failure. Attached is a patch that handles the few remaining expressions where this is a blocking problem in basic usage. From the looks of it these are all cases where a filename is retrieved from the server that is UTF8-encoded.

If there is no way to make the `open()` function treat its filename argument as a unicode string then there are probably other locations where equivalent operations are carried out (e.g. in the scheduled loader implementation). Alternatively it should be possible to override/replace the open function with one that calls codecs.open .

Edit: the patch is safe to use on Linux.

RJVB (rjvbertin) wrote :
description: updated
VinsS (vincent-vandevyvre) wrote :

I don't understand, the filename is always an unicode string.

There's no differences between open() and codecs.open() for the argument filename.

Have you a traceback of error that I can see more informations of the problem?

RJVB (rjvbertin) wrote :

I don't get tracebacks (not sure why), just errors like this on the terminal:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 368: ordinal not in range(128)

I think this bit from the `open` documentation is relevant:

"In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding."

I'm not aware of a `locale.setpreferredencoding()` function, so there may not be a robust way to set an application-wide default encoding.

VinsS (vincent-vandevyvre) wrote :

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 368: ordinal not in range(128)

That's clearly not a question of filename encoding. Maybe have you modified the code?

Without a traceback I can't say anything.

Changed in qarte:
status: New → Incomplete
RJVB (rjvbertin) wrote :

No, that's with the original code.

It's been so long ago that I figured out how to fix the errors that I forgot the details. It's indeed not the filename that's the problem, but the file contents. The encoding argument to codecs.open sets the encoding for the entire opened file. With that knowledge refreshed I do recall that the character at position 368 in the summary had a byte value of 0x39 .

How can I get the traceback you want if python doesn't generate one by itself?

RJVB (rjvbertin) wrote :

Ok, I'm not getting tracebacks because most places where the encoding error occurs are protected with a try/except construct.

Not so in arteconcert.py, so I added a line `lgg.info(cnt) just after the "Concert list updated" print-out:

18:29:19: INFO - arteconcert Concert list updated
18:29:19: INFO - arteconcert {
    "classic": [
        {
            "date": "2017-01-27 18:00",
            "duration": "7216",
            "expires": "1498600740",
            "id": "44049",
            "imgurl": "http://concert.arte.tv/sites/default/files/atoms/image/opa/069077-006-A_1949605.jpg",
            "jsonurl": "http://concert.arte.tv/fr/player/62876",
            "summary": "Invit\xe9e au Musiikkitalo Helsinki, la Maison de la musique de la capitale finlandaise, Patricia Kopatchinskaja interpr\xe8te le concerto pour violon de Gy\xf6rgy Ligeti. Cette \u0153uvre a \xe9t\xe9 d\xe9dicac\xe9e par Ligeti au violoniste germano-bulgare Saschko Gawriloff qui l\u2019a interpr\xe9t\xe9 pour la premi\xe8re fois en 1993 avec l\u2019Ensemble intercontemporain, sous la direction de Pierre Boulez.\n\nEgalement au\xa0programme de ce concert : l'ouverture Leonore et la symphonie n\xb07 de Beethoven.\n\nLe Finnish Radio Symphony Orchestra, FRSO, est plac\xe9 sous la direction de\xa0Jukka-Pekka Saraste.\n\n\xa0\n\nPhoto\xa0\xa9 Felix Broede\n",
            "teaser": "Invit\xe9e au Musiikkitalo Helsinki, la Maison de la musique de la capitale finlandaise, Patricia Kopatchinskaja interpr\xe8te le concerto pour violon de...",
            "title": "Patricia Kopatchinskaja et le FRSO sous la direction de Jukka-Pekka Saraste",
            "url": "http://concert.arte.tv/fr/patricia-kopatchinskaja-et-le-frso-sous-la-direction-de-jukka-pekka-saraste"
        },
<SNIP>
Traceback (most recent call last):
  File "/opt/local/share/qarte/core.py", line 227, in set_videos_list
    self.artelive.config_parser()
  File "/opt/local/share/qarte/arteconcert.py", line 79, in config_parser
    self.update_concerts()
  File "/opt/local/share/qarte/arteconcert.py", line 90, in update_concerts
    outf.write(cnt)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 368: ordinal not in range(128)

Position 368 can hardly be anything else but the first \xe9 in the json dump above.

RJVB (rjvbertin) wrote :

An additional thought: if my understanding of the documentation is correct, the default encoding used for the operations that fail for me is based on the user's current locale settings. That could mean they can also occur on Linux for users who use a more or less exotic locale.

In other words, if it is certain that Arte only serves UTF8 text data it might be best to figure out how to make Qarte use UTF8 encoding throughout explicitly, or generalise my patch by replacing the standard open(name,mode) function by codecs.open(name,mode,'utf8').

VinsS (vincent-vandevyvre) wrote :

No problem with Python 3

---------------------------------
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('test', 'w')
>>> f.write('déjàvu')
6
>>> f.close()
>>> f = open('test', 'r')
>>> l = f.read()
>>> l
'déjàvu'
---------------------------------

When you launch Qarte with the argument -d the log must begin with:
---------------------------------
vincent@djoliba:~$ qarte -d
11:29:01: INFO - qarte Qarte-3.6.1
11:29:01: INFO - qarte Python 3.4.3 on Linux-3.13.0-110-generic-x86_64-with-Ubuntu-14.04-trusty
11:29:01: INFO - qarte File system encoding: utf-8
11:29:01: INFO - qarte System encoding: utf-8
11:29:01: INFO - qarte Locale encoding: ('fr_BE', 'UTF-8')
----------------------------------

If not, this is a configuration problem

RJVB (rjvbertin) wrote :

As I thought, this is a locale issue; I'm getting

11:56:38: INFO - qarte Locale encoding: (None, None)

QLocale.system().name() returns 'en_GB' for me but for the rest I'm indeed running the default C locale.

This just underlines what I argued earlier: you cannot be certain of the locale and default encoding the user is using. Even if it were good practice to oblige the user to call Qarte with tweaked LC env. variables you still cannot assume that this will work the same way on all platforms.
Which takes me back to my position that with a locale.setpreferredencoding() function the only proper solution is to make sure the proper encoding is used for each of the concerned operations. Esp. when that requires such a minimal change.

Or maybe `locale.setlocale(locale.LC_ALL, (QLocale.system().name(), "utf-8"))` would do the trick, executed at the right place?

RJVB (rjvbertin) wrote :

This patch selects UTF-8 encoding at the application level.

On both my systems the returned locale is different between `locale.getlocale()` and `QLocale.system().name()`. PyQt5 will probably use the latter for GUI translations so it seems appropriate to standardise on that locale.

RJVB (rjvbertin) wrote :

Annoyingly it appears to be impossible to use `locale.setlocale()` on MS Windows: it raises an error for all locales, including `None` which ought to be supported.
The result is that getdefaultlocale() returns ('None', 'None') just like it does on my Mac set-up.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers