Cannot write xml file when certain characters appear in the path

Bug #757673 reported by Andreas Preikschat on 2011-04-11
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

Hello,

Prehistory: Recently there was a bug filed in our project (Bug #744337), where somebody had problem so export some songs.
I tried to reproduce this, which I finally could:

It seems that, lxml cannot save a xml file to directory which contains characters like "ĉûüë" (see test script). But opening a file and passing the file object in my_tree.write() works.

Traceback (most recent call last):
  File "Z:\test.py", line 31, in <module>
    save_to_file(xml)
  File "Z:\test.py", line 26, in save_to_file
    encoding=u'utf-8', xml_declaration=True, pretty_print=True)
  File "lxml.etree.pyx", line 1850, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:44526)
  File "serializer.pxi", line 455, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:90047)
IOError: [Errno 2] No such file or directory

However, I was not able to reproduce this on my Linux Box, but on windows.

I'll attach a small script to reproduce this.

The requested information (windows XP):
Python : sys.version_info(major=2, minor=7, micro=0, releaselevel='final', serial=0)
lxml.etree : (2, 3, -99, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Cheers

(If you need further information, please ask for them!)

Andreas Preikschat (googol) wrote :
description: updated
description: updated
description: updated
Andreas Preikschat (googol) wrote :

Note: This also occurs when opening a file with such a path. (If you want me to open another bug report, please let me know).

from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
parsed_file = etree.parse(file_path, parser)

scoder (scoder) wrote :

Hi,

thanks for the report. lxml needs to byte encode filenames in order to pass them to libxml2, and it uses the encoding given by sys.getfilesystemencoding() for that ("mbcs" on Windows, as per spec). It looks like libxml2 has a heuristic on Windows that converts UTF-8 encoded file names back to UCS2, so it might be enough to always set the file system encoding to UTF-8 on that platform. I don't have Windows available, could you try this override hack on your side? (You need Cython 0.14.1 for the source build.)

"""
diff -r b8b1e13760eb src/lxml/lxml.etree.pyx
--- a/src/lxml/lxml.etree.pyx Tue Apr 12 21:04:11 2011 +0200
+++ b/src/lxml/lxml.etree.pyx Fri Apr 15 07:54:07 2011 +0200
@@ -121,6 +121,7 @@
     _FILENAME_ENCODING = b'ascii'
 else:
     _FILENAME_ENCODING = _FILENAME_ENCODING.encode(u"UTF-8")
+_FILENAME_ENCODING = b'UTF-8'
 cdef char* _C_FILENAME_ENCODING
 _C_FILENAME_ENCODING = _cstr(_FILENAME_ENCODING)

 """

If this doesn't work for you, you can look at _encodeFilename() in apihelpers.pxi. That's where the filename encoding happens. There's also a heuristic in there that tries to recognise file system paths (as opposed to URLs). Maybe you can experiment a bit with that to see if it actually works as expected in your case.

Stefan

Andreas Preikschat (googol) wrote :

Hello Stefan,

Can you give me a step by step description what to do? That'll help me a lot.
Cheers

> Can you give me a step by step description what to do?

Not on the bug tracker. Does this help?

http://lxml.de/build.html

Stefan

Andreas Preikschat (googol) wrote :

Hello,

I just tried to compile lxml with this change, but failed with this error message:
   unable to find vcvarsall.bat

Sorry, I prefer to work with Linux :-)

So I'd very very glad if somebody who has a windows environment could test this out!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments