When multiple prefixes are defined for a single namespace URI, a _SaxParserTarget can't know which prefix was originally used for a given element

Bug #1915613 reported by Leonard Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Low
Unassigned

Bug Description

This report comes from a bug reported against my project, Beautiful Soup: https://bugs.launchpad.net/beautifulsoup/+bug/1915583

Here's the output of running the attached script:

===
Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 6, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

lxml
<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:opf2="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <metadata>
    <dc:identifier opf:scheme="URI">a</dc:identifier>
    <dc:identifier opf2:scheme="URI">b</dc:identifier>
  </metadata>
</package>

--------------------------------------------------------------------------------
Beautiful Soup
<?xml version="1.0" encoding="utf-8"?>
<package xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:opf2="http://www.idpf.org/2007/opf">
<metadata>
<dc:identifier opf2:scheme="URI">a</dc:identifier>
<dc:identifier opf2:scheme="URI">b</dc:identifier>
</metadata>
</package>
===

The markup in the attached script defines two different prefixes for the namespace URI "http://www.idpf.org/2007/opf": "opf" and "opf2". When the markup is parsed through lxml, the original prefix of each tag and attribute is preserved. When the markup is parsed through Beautiful Soup, using lxml's _SaxParserTarget interface, this isn't always possible.

The "_SaxParserTarget._handleSaxStart" method is passed a namespace-qualified tag name, a dictionary mapping namespace-qualified attribute names to their values, and a dictionary mapping namespace prefixes to the corresponding URIs. Given this information, when multiple prefixes are defined for a given URI, it's not possible to determine which prefix was used in the original markup for a given tag or attribute name.

In this case, lxml gives me the attribute name '{http://www.idpf.org/2007/opf}scheme' and I have no way of knowing whether the original prefix was "opf" or "opf2"; I just have to pick one.

I don't consider this a serious problem, but I wanted to bring it to your attention; maybe I've missed something in the _SaxParserTarget interface that would make it an easy fix.

Revision history for this message
Leonard Richardson (leonardr) wrote :
Revision history for this message
scoder (scoder) wrote :

Thanks for the report. I don't think there is an easy way to improve this. The interface uses ElementTree's qualified tag names. It's intentional to resolve the prefixes here, in order to make it _easier_ for users to deal with namespaces. There is obviously a loss of parser available data in doing this, but since prefixes are not part of the XML information set, it's not a loss in the document information. Only round-trips suffer from this issue.

So, I do admit that it's an issue. But it seems a very rare issue and it isn't easy to do something about.
Sounds like a "won't fix".

Changed in lxml:
importance: Undecided → Low
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.