`find` (and friends) handles `namespaces` badly

Bug #1318554 reported by Dieter Maurer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

LXML 2.3.6.0

I am working on an application where a tar archive containing xml-files using different schemas must be processed (and the results written to a database). In this application, a call `elem.find('N:tag', namespaces=ns)` returns `None` event though the first child of `elem` matches (and `elem.xpath('N:tag', namespaces=ns)` returns this element).

The analysis revealed that `find` (and friends) uses the cache `lxml._elementpath._cache`, indexed by `path` only (ignoring `namespaces`). This cache gives the wrong `select` function, when the namespace bound to the `"N"` prefix has changed. The cache should not only take the path but also the `namespaces` into account.

Following a small script showing the problem:
>>> from lxml.etree import XML
>>> e = XML("<root xmlns='ns'><a /></root>")
>>> e.find("N:a", namespaces=dict(N="x"))
>>> e.find("N:a", namespaces=dict(N="ns")) # wrongly returns `None`
>>> from lxml._elementpath import _cache; _cache.clear()
>>> e.find("N:a", namespaces=dict(N="ns")) # now the result is correct
<Element {ns}a at 0xb74f3e8c>

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
status: New → Invalid
Revision history for this message
Dieter Maurer (d.maurer) wrote :

I changed the status of this bug report to "new" again.

"find" should work correctly with its "namespaces" parameter -- even though some persons may like to avoid namespace prefixes. In my application, I use both "xpath" and "find" extensively and it is really convenient (to say the least) to have consistent path expressions (not use prefixes for "xpath" and "{...}" for "find".

Not working correctly with its `namespaces` parameter is a bug; a corresponding report is not invalid.

Changed in lxml:
status: Invalid → New
Revision history for this message
scoder (scoder) wrote :

Ah, thanks for the clarification. I actually consider the "namespaces" argument a bit of a quirk, as it doesn't really fit with the rest of the ElementPath API. I'd say it's mostly there for compatibility with ElementTree. Normal code should be using the nicer ElementPath syntax with its self-contained path expressions.

That being said, this bug has been fixed in lxml 3.1.

Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Medium
milestone: none → 3.1
status: New → Fix Released
Revision history for this message
Dieter Maurer (d.maurer) wrote :

scoder wrote:
> I actually consider the "namespaces" argument a bit of a quirk, as it doesn't really fit with the rest of the ElementPath API.

It fits well with `xpath` (which is also part of the "lxml.etree" API).

And the "namespaces" parameter is really handy, if you process XML documents with namespaces and more complex paths. Consider
`e.find("a/b/c", namespaces={None:URL})`; without `namespaces`, this would become
`e.find("{URL}a/{URL}b/{URL}c")` (which is far more difficult to read and understand).

In my case, I work with complex documents coverned by a modular DTD. I have
   ns = dict(
     N = "...", # top level namespace
     ce = "...", # "common elements
     m = "...", # mathml
     b = "...", # bibliography
    ...
    )
and then use paths of the form "e.find[text]("N:head/ce:author/ce:name", namespaces=ns)".
The elementary "{...}" would render those expressions unreadable -- I would need
something like `"%(N)shead/%(ce)sauthor/%(ce)sname" % ns` which is a bit better but
still far away from the namespace prefixes.

Revision history for this message
scoder (scoder) wrote :

I don't really see how this is so much better than, say:

  ns = dict(
     N = "{...}", # top level namespace
     ce = "{...}", # "common elements
     m = "{...}", # mathml
     b = "{...}", # bibliography
    ...
    )

  e.find("{N}head/{ce}author/{ce}name".format(ns))

Moving lengthy paths into a constant isn't completely unprecedented either.

In any case, find*() and xpath() support similar but different languages. That's by design. ElementPath gices you self-contained expressions that you can store away and reuse as you like. In XPath, you always need to remember what the actual meaning of each of the names was, in addition to how they are used in the expression. Leads to bugs like the one in this ticket, as well as confusion on user side about how to get at the prefixes that their XPath expressions can use (people actually ask that). Just another unnecessary and confusing level of indirection.

Revision history for this message
scoder (scoder) wrote :

Oh, and note that `e.find("a/b/c", namespaces={None:URL})` doesn't actually work. There is no such thing as a default namespace in XPath, nor is there one in find*().

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.