When an XML file has multiple aliases for a single namespace URI, the last alias encountered is the only one used

Bug #1915583 reported by Leonard Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Confirmed
Low
Unassigned

Bug Description

Original mailing list thread:

https://groups.google.com/g/beautifulsoup/c/4j0dMYJ48pw

Consider markup like this:

<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">

The "http://www.idpf.org/2007/opf" namespace URI is given two different aliases: "opf" and "". Reading a file in and writing it out can modify tags in that namespace: if they came in with the first alias, they'll come out with the second. That's because Beautiful Soup keeps a mapping of namespace URIs to aliases, and the last one seen takes precedence.

This isn't invalid -- the XML document means the same thing as it did before -- but many processing tools rely on looking for specific namespace aliases rather than URIs. lxml is able to preserve the aliases, so it may be possible to do the same when Beautiful Soup uses the lxml parser, assuming the user doesn't mess with the aliases after parsing the document.

This probably requires tagging every Tag object with the alias it came in with, not just the namespace URI it came in with -- hopefully lxml makes this possible.

Changed in beautifulsoup:
status: New → Confirmed
Revision history for this message
Leonard Richardson (leonardr) wrote :

We also have a smaller but much more serious problem that happens when a namespace's prefix is the empty string, as opposed to None. Attributes for that tag are output as ":foo" rather than "foo". This is not a problem for tag names, only attribute names.

This is fixed in revision 595. I'm leaving this issue open because it describes a real, though much less serious, problem.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Here's a script that demonstrates the problem:

---
from bs4 import BeautifulSoup
from lxml import etree

markup = """<?xml version='1.0' encoding='UTF-8'?>

<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:opf2="http://www.idpf.org/2007/opf" xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/">
<metadata>
<dc:identifier opf:scheme="URI">http://www.gutenberg.org/2600</dc:identifier>
<dc:identifier opf2:scheme="URI">http://www.gutenberg.org/2600</dc:identifier>
<dc:identifier scheme="URI">http://www.gutenberg.org/2600</dc:identifier>
</metadata>
</package>
"""

from bs4 import BeautifulSoup
print(BeautifulSoup(markup, 'xml'))
print("-" * 80)

parser = etree.XMLParser(remove_blank_text=True)
from StringIO import StringIO
root = etree.parse(StringIO(markup), parser)
print(etree.tostring(root, encoding="unicode", pretty_print=True))
---

Output:

---
<?xml version="1.0" encoding="utf-8"?>
<opf2:package xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:opf2="http://www.idpf.org/2007/opf">
<opf2:metadata>
<dc:identifier opf2:scheme="URI">http://www.gutenberg.org/2600</dc:identifier>
<dc:identifier opf2:scheme="URI">http://www.gutenberg.org/2600</dc:identifier>
<dc:identifier scheme="URI">http://www.gutenberg.org/2600</dc:identifier>
</opf2:metadata>
</opf2:package>
--------------------------------------------------------------------------------
<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:opf2="http://www.idpf.org/2007/opf" xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <metadata>
    <dc:identifier opf:scheme="URI">http://www.gutenberg.org/2600</dc:identifier>
    <dc:identifier opf2:scheme="URI">http://www.gutenberg.org/2600</dc:identifier>
    <dc:identifier scheme="URI">http://www.gutenberg.org/2600</dc:identifier>
  </metadata>
</package>
---

lxml keeps opf, opf2, and the default namespace straight, even though they're all mapped to the same prefix.

Beautiful Soup replaces opf with opf2, though it keeps the default namespace straight.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Fixing this would require a change to lxml's _SaxParserTarget interface. I filed https://bugs.launchpad.net/lxml/+bug/1915613 to track that request, but I suspect this is not a high priority and it won't happen anytime soon.

Changed in beautifulsoup:
importance: Undecided → Low
Revision history for this message
Leonard Richardson (leonardr) wrote :

It sounds like https://bugs.launchpad.net/lxml/+bug/1915613 won't be fixed, but I'll leave this ticket open until it's officially closed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.