lxml

When multiple prefixes are defined for a single namespace URI, a _SaxParserTarget can't know which prefix was originally used for a given element

Bug #1915613 reported by Leonard Richardson on 2021-02-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Triaged	Low	Unassigned

Bug Description

This report comes from a bug reported against my project, Beautiful Soup: https://bugs.launchpad.net/beautifulsoup/+bug/1915583

Here's the output of running the attached script:

===
Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 6, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

lxml
<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:opf2="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <metadata>
    <dc:identifier opf:scheme="URI">a</dc:identifier>
    <dc:identifier opf2:scheme="URI">b</dc:identifier>
  </metadata>
</package>

--------------------------------------------------------------------------------
Beautiful Soup
<?xml version="1.0" encoding="utf-8"?>
<package xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:opf2="http://www.idpf.org/2007/opf">
<metadata>
<dc:identifier opf2:scheme="URI">a</dc:identifier>
<dc:identifier opf2:scheme="URI">b</dc:identifier>
</metadata>
</package>
===

The markup in the attached script defines two different prefixes for the namespace URI "http://www.idpf.org/2007/opf": "opf" and "opf2". When the markup is parsed through lxml, the original prefix of each tag and attribute is preserved. When the markup is parsed through Beautiful Soup, using lxml's _SaxParserTarget interface, this isn't always possible.

The "_SaxParserTarget._handleSaxStart" method is passed a namespace-qualified tag name, a dictionary mapping namespace-qualified attribute names to their values, and a dictionary mapping namespace prefixes to the corresponding URIs. Given this information, when multiple prefixes are defined for a given URI, it's not possible to determine which prefix was used in the original markup for a given tag or attribute name.

In this case, lxml gives me the attribute name '{http://www.idpf.org/2007/opf}scheme' and I have no way of knowing whether the original prefix was "opf" or "opf2"; I just have to pick one.

I don't consider this a serious problem, but I wanted to bring it to your attention; maybe I've missed something in the _SaxParserTarget interface that would make it an easy fix.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-02-13:

testparse.py Edit (1.1 KiB, text/x-python)

Revision history for this message

scoder (scoder) wrote on 2021-02-18:

Thanks for the report. I don't think there is an easy way to improve this. The interface uses ElementTree's qualified tag names. It's intentional to resolve the prefixes here, in order to make it _easier_ for users to deal with namespaces. There is obviously a loss of parser available data in doing this, but since prefixes are not part of the XML information set, it's not a loss in the document information. Only round-trips suffer from this issue.

So, I do admit that it's an issue. But it seems a very rare issue and it isn't easy to do something about.
Sounds like a "won't fix".

Changed in lxml:
importance:	Undecided → Low
status:	New → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

testparse.py Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.