Declared and actual XML encoding should match, and the encoding should be XML valid

Bug #394943 reported by Renato Silva on 2009-07-02
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
bzr-xmloutput
High
Guillermo Gonzalez

Bug Description

Windows XP, console encoding cp850, Bazaar 1.16.1. The command bzr xmlstatus > file is saving the file as latin-1 with the following content:

<?xml version="1.0" encoding="UTF-8"?><status workingtree_root="C:/Programa‡Æo/"></status>

‡Æ should be çã (it is, when I do not send the output to a file, or when I run chcp 1252 before the command). I tried to look into the source code of xmloutput and bzr but I could not figure out what's wrong.

Vincent Ladeuil (vila) on 2009-07-03
affects: bzr → bzr-xmloutput
Renato Silva (renatosilva) wrote :

I think it's possible to be a bzr issue itself.

I suspect this is a problem with console character encoding in bzr

The follwoing command send ஠to the file:

C:\Python26\python.exe -c "print (u'çã'.encode('cp850'))" > file

It seems that cp850 is being guessed for sys.stdout when it is a file. That is, when sys.stdout.encoding is None, the unicode objects are encoded as cp850. However this encoding is used for terminal only afaik.

I think that bzr/xmloutput should use UTF-8 encoding, not the one from terminal, when sending output to a file.

Alexander Belchenko (bialix) wrote :

as claimed by directive <?xml version="1.0" encoding="UTF-8"?> all strings should be actually utf-8.

xmloutput plugin should use exact encoding for its output and explicitly encode all unicode strings to utf-8.

Alexander Belchenko (bialix) wrote :

I suspect this problem occurs because Linux developers used to have utf-8 as encoding everywhere. This is obviously wrong.

Renato Silva (renatosilva) wrote :

It seems that because to_file.encoding is 'cp850', the to_file.write method is decoding str objects to encode them again as cp850. If you send u'maçã'.encode('utf-8') to the method, it tries to convert it to cp850, instead of just printing it out.

I think to_file.write should not decode str objects, as they are supposed to be already decoded.

Renato Silva (renatosilva) wrote :

Mzz from IRC wrote the attached patch. It needs to be verified as it breaks the tests.

Alexander Belchenko (bialix) wrote :

The fix seems right for me, but it should be applied to all xml-* commands.

I'd suggest to look at _setup_outf method in bzrlib.commands.Command class and/or overload it.

Renato Silva (renatosilva) wrote :

The problem is that XML encoding header is not matching the actual data encoding. If developers decide to let multiple encodings (user option, get from terminal etc.) it is important to avoid invalid encodings in XML header: http://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncodingDecl. For example, it seems that "cp850" is an invalid value for that.

Maybe it would be better to just write out everything as UTF-8, and keep the XML header static with a well-known valid entry.

summary: - Wrong chars in bzr xmloutput > file
+ Declared and actual XML encoding do not match

Renato,
Thanks for linking the branches

Changed in bzr-xmloutput:
assignee: nobody → Guillermo Gonzalez (verterok)
importance: Undecided → High
status: New → Confirmed
Changed in bzr-xmloutput:
milestone: none → 0.8.6
Changed in bzr-xmloutput:
status: Confirmed → Fix Committed
Changed in bzr-xmloutput:
status: Fix Committed → Fix Released
Renato Silva (renatosilva) wrote :

This bug is still reproducible in 0.8.6.

Changed in bzr-xmloutput:
status: Fix Released → Confirmed
Guillermo Gonzalez (verterok) wrote :

Hi Renato,

I thought this was fixed. sorry for marking it released :/

Is this happening in all commands or just in xmlstatus?

Changed in bzr-xmloutput:
milestone: 0.8.6 → 0.8.7
Renato Silva (renatosilva) wrote :

The idea is that all commands need to be ensured to have the declared encoding matching the actual data. Also, care must be taken to not use invalid encodings like cp850. As I said above, a possible solution is to use UTF-8 always, and there's an experimental patch for that attached in this bug. I think reviewing the comments may help to clarify the issue.

summary: - Declared and actual XML encoding do not match
+ Declared and actual XML encoding should match, and the encoding should
+ be XML valid
Guillermo Gonzalez (verterok) wrote :

Ok, I see the patch, thanks!

I'm not sure that's going to work for the terminal use case, but I need to check how bzr handle the text internally.
If it's unicode we are ok, but if it's bytes we need to encode/decode using the right encoding before encoding it using utf-8

Renato Silva (renatosilva) wrote :

I believe there's no 'terminal use case'. Why one would want to read the XML output in a terminal? Not to get the bzr output for sure, otherwise he could just use the regular version of the command. Afaik xmloutput is targeted at structuring the output of bzr for being handled programatically.

Hence I don't see much or any sense in adapting to the terminal encoding. As I said above, if you choose to adapt, you have to remember that some terminal encodings are XML invalid and you should handle that situation.

Renato Silva (renatosilva) wrote :

So I think the encoding should be static, UTF-8 by default. In the case the user really needs to read the output in a non-UTF-8 terminal, the plugin could have some argument like --use-terminal-encoding or --encoding=<value>. That encoding would be declared in the XML preamble and used for the actual data encoding.

Guillermo Gonzalez (verterok) wrote :

That sounds reasonable.

Changed in bzr-xmloutput:
milestone: 0.8.7 → later
Changed in bzr-xmloutput:
milestone: 0.8.8 → none
milestone: none → 0.8.8
status: Confirmed → Triaged
description: updated
Changed in bzr-xmloutput:
milestone: 0.8.8 → none
Changed in bzr-xmloutput:
milestone: none → 0.9
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers