Cannot write xml file when certain characters appear in the path

Bug #757673 reported by Andreas Preikschat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Hello,

Prehistory: Recently there was a bug filed in our project (Bug #744337), where somebody had problem so export some songs.
I tried to reproduce this, which I finally could:

It seems that, lxml cannot save a xml file to directory which contains characters like "ĉûüë" (see test script). But opening a file and passing the file object in my_tree.write() works.

Traceback (most recent call last):
  File "Z:\test.py", line 31, in <module>
    save_to_file(xml)
  File "Z:\test.py", line 26, in save_to_file
    encoding=u'utf-8', xml_declaration=True, pretty_print=True)
  File "lxml.etree.pyx", line 1850, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:44526)
  File "serializer.pxi", line 455, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:90047)
IOError: [Errno 2] No such file or directory

However, I was not able to reproduce this on my Linux Box, but on windows.

I'll attach a small script to reproduce this.

The requested information (windows XP):
Python : sys.version_info(major=2, minor=7, micro=0, releaselevel='final', serial=0)
lxml.etree : (2, 3, -99, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Cheers

(If you need further information, please ask for them!)

Revision history for this message
Andreas Preikschat (googol-deactivatedaccount) wrote :
description: updated
description: updated
description: updated
Revision history for this message
Andreas Preikschat (googol-deactivatedaccount) wrote :

Note: This also occurs when opening a file with such a path. (If you want me to open another bug report, please let me know).

from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
parsed_file = etree.parse(file_path, parser)

Revision history for this message
scoder (scoder) wrote :

Hi,

thanks for the report. lxml needs to byte encode filenames in order to pass them to libxml2, and it uses the encoding given by sys.getfilesystemencoding() for that ("mbcs" on Windows, as per spec). It looks like libxml2 has a heuristic on Windows that converts UTF-8 encoded file names back to UCS2, so it might be enough to always set the file system encoding to UTF-8 on that platform. I don't have Windows available, could you try this override hack on your side? (You need Cython 0.14.1 for the source build.)

"""
diff -r b8b1e13760eb src/lxml/lxml.etree.pyx
--- a/src/lxml/lxml.etree.pyx Tue Apr 12 21:04:11 2011 +0200
+++ b/src/lxml/lxml.etree.pyx Fri Apr 15 07:54:07 2011 +0200
@@ -121,6 +121,7 @@
     _FILENAME_ENCODING = b'ascii'
 else:
     _FILENAME_ENCODING = _FILENAME_ENCODING.encode(u"UTF-8")
+_FILENAME_ENCODING = b'UTF-8'
 cdef char* _C_FILENAME_ENCODING
 _C_FILENAME_ENCODING = _cstr(_FILENAME_ENCODING)

 """

If this doesn't work for you, you can look at _encodeFilename() in apihelpers.pxi. That's where the filename encoding happens. There's also a heuristic in there that tries to recognise file system paths (as opposed to URLs). Maybe you can experiment a bit with that to see if it actually works as expected in your case.

Stefan

Revision history for this message
Andreas Preikschat (googol-deactivatedaccount) wrote :

Hello Stefan,

Can you give me a step by step description what to do? That'll help me a lot.
Cheers

Revision history for this message
scoder (scoder) wrote : Re: [Bug 757673] Re: Cannot write xml file when certain characters appear in the path

> Can you give me a step by step description what to do?

Not on the bug tracker. Does this help?

http://lxml.de/build.html

Stefan

Revision history for this message
Andreas Preikschat (googol-deactivatedaccount) wrote :

Hello,

I just tried to compile lxml with this change, but failed with this error message:
   unable to find vcvarsall.bat

Sorry, I prefer to work with Linux :-)

So I'd very very glad if somebody who has a windows environment could test this out!

Revision history for this message
scoder (scoder) wrote :

Closing as outdated. A lot has changed since this bug was reported.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.