preserve white space outside root element

Bug #526799 reported by fantasai on 2010-02-24
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Wishlist
Unassigned

Bug Description

Overview:

  lxml doesn't preserve white space outside the document element. This removes the trailing newline at the end of the file, interferes with diffing, and makes the output harder to read.

Steps to reproduce:

    parser = etree.XMLParser(no_network=True,
                             remove_comments=False,
                             strip_cdata=False,
                             resolve_entities=False)
    tree = etree.parse(source, parser=parser
    o = html5lib.serializer.serialize(tree, tree='lxml',
                                      format='html',
                                      quote_attr_values=True)

  Using lxml.etree.tostring(tree) also demonstrates this problem, except it prints a line break after the doctype line.

Actual results:
  White space outside the document element is stripped.

Expected results:
  White space is preserved in the tree so that it can be serialized back out in its original state, preserving among doctype declarations, PIs, comments, etc.

Other information:
  I don't actually know what version of lxml this is, or how to get that information. :(

scoder (scoder) on 2011-08-12
Changed in lxml:
importance: Undecided → Wishlist
status: New → Triaged

I too would like to see this improvement. Many times I need to make widespread changes to XML files that are stored in a version control system (e.g.: Visual Studio project files, sample data for unit tests, etc. etc.). It is often helpful to write a script that can process the XML content in a structured way rather than doing a dumb search-replace. If every single file that gets touched by the XML parser gets re-written with all the white space changed then it is difficult at best to use common diff tools to see exactly what parts were "actually" changed. The problem is further exacerbated as the number of modified files increases (e.g.: hundreds or thousands of files are very difficult to analyse and compare).

I suspect the reason this library works as it does (and most, if not all other XML processing libraries I've looked at) is because it parses the source file and stores its content in some internal structure for efficient XML processing operations. After the modifications are complete it likely looses the context of how that information was originally formed in the source XML.

However, this doesn't negate or minimize the importance of the use case I have just described. I suspect my use case is probably just one of many where people could have the need to preserve the format and style of the original content. If this or any other library was able to satisfy this requirement I suspect it would open a whole new market for that tool.

NOTE: I should also mention that personally I would increase the importance of this enhancement request. Surely there are many other people out there that have similar needs. Just doing a quick troll of the online forums and seeing many other people experiencing similar problems convinces me of that.

Also, a more trivial example of what kind of expected behavior I would like to see would look something like this:

file = open ("somefile.xml", "r")
original_content = file.read()

tree = etree.parse("somefile.xml")
root = etree.getroot()
parsed_content = etree.tostring(root)

# here, original_content should == parsed_content

scoder (scoder) wrote :

> original_content should == parsed_content

You shouldn't expect that. XML has places where whitespace is irrelevant (e.g. before the root element), and the parser is free to discard them. Only character content inside of the root element is considered relevant and is guaranteed to be preserved by the parser (unless requested otherwise).

For example, you could do this:

    <?xml version= "1.0" encoding = "utf8" ?><root/>

For the parser, the additional whitespace is completely useless and it will just skip over it. Trying to preserve things like these would be hopeless and just lead to lots of wasted memory during processing.

BTW, in your code above, you are explicitly asking the library to serialise only the root element (you only pass "root" into the tostring() function), not the whole document (which would be "tree").

I would suggest you pass all your documents through lxml once to get them in a well-defined (sort-of "normalised") format. Maybe even pretty print them, or make it strip all ignorable whitespace. After that, further input-output processing shouldn't change them anymore.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers