Serializing ElementTree duplicates ns

Bug #1965070 reported by Jens Troeger
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Happens on both versions:

Python : sys.version_info(major=3, minor=10, micro=2, releaselevel='final', serial=0)
lxml.etree : (4, 8, 0, 0)
Python : sys.version_info(major=3, minor=9, micro=10, releaselevel='final', serial=0)
lxml.etree : (4, 7, 1, 0)

Host libraries:

libxml used : (2, 9, 13)
libxml compiled : (2, 9, 13)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

The following test code is, I think, self-explanatory. First, the working example:

>>> b = b'<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"></html>'
>>> lxml.etree.tostring(lxml.html.fromstring(b))
b'<html xmlns="http://www.w3.org/1999/xhtml"/>'

Then we add a DOCTYPE and the serialized tree contains the same xmlns twice:

>>> b = b'<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"></html>'
>>> lxml.etree.tostring(lxml.html.fromstring(b))
b'<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"></html>'

Which is invalid XML and can’t be read: lxml.etree.XMLSyntaxError: Attribute xmlns redefined, line 1, column 80

Cheers,
Jens

Revision history for this message
Jens Troeger (jens.troeger) wrote :

I think it depends on the DOCTYPE itself:

>>> b = b'<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"></html>'
>>> lxml.etree.tostring(lxml.html.fromstring(b))
b'<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"></html>'

and

>>> b = b'<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"></html>'
>>> lxml.etree.tostring(lxml.html.fromstring(b))
b'<html xmlns="http://www.w3.org/1999/xhtml"/>'

Revision history for this message
scoder (scoder) wrote :

I think it rather depends on the parser. HTML does not have namespaces, so

    <html xmlns="http://www.w3.org/1999/xhtml">

Is parsed as a tag "html" with an attribute "xmlns", not with a namespace declaration. This seems correct.

Changed in lxml:
status: New → Invalid
Revision history for this message
Jens Troeger (jens.troeger) wrote :

The answer doesn’t seem to address why lxml.etree.tostring() produces invalid XML where an element has multiple attributes with the same name?

Revision history for this message
scoder (scoder) wrote :

My guess is that this is due to the doctype. It's not two attributes – it's one attribute called "xmlns", and a default namespace declaration.

Seems a case of "Doctor, doctor! It hurts when I do this ... ! – Then don't do it!"

Revision history for this message
Jens Troeger (jens.troeger) wrote :

I agree that the two different DOCTYPEs seem to be responsible for the two different behaviors. Is that managed by lxml, or by the underlying native libraries?

> […] it's one attribute called "xmlns", and a default namespace declaration.

That statement seems to disagree with the actual error

    lxml.etree.XMLSyntaxError: Attribute xmlns redefined, line 1, column 80

which considers both an attribute.

> Seems a case of "Doctor, doctor! It hurts when I do this ... ! – Then don't do it!"

Alas, I have no control over the XML that other people/frameworks produce.

Personally I prefer to find and address the root of a problem instead of fudging input. In this case, the existence of different DOCTYPEs leads to different serialized XML: one invalid (DOCTYPE for XHTML 1.0 Transitional) and one valid (DOCTYPE for XHTML 1.1). Which brings me back to my initial question above, is the DOCTYPE parsing/management handled by lxml?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.