lxml

`find` (and friends) handles `namespaces` badly

Bug #1318554 reported by Dieter Maurer on 2014-05-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Medium	scoder	lxml 3.1

Bug Description

LXML 2.3.6.0

I am working on an application where a tar archive containing xml-files using different schemas must be processed (and the results written to a database). In this application, a call `elem.find('N:tag', namespaces=ns)` returns `None` event though the first child of `elem` matches (and `elem.xpath('N:tag', namespaces=ns)` returns this element).

The analysis revealed that `find` (and friends) uses the cache `lxml._elementpath._cache`, indexed by `path` only (ignoring `namespaces`). This cache gives the wrong `select` function, when the namespace bound to the `"N"` prefix has changed. The cache should not only take the path but also the `namespaces` into account.

Following a small script showing the problem:
>>> from lxml.etree import XML
>>> e = XML("<root xmlns='ns'><a /></root>")
>>> e.find("N:a", namespaces=dict(N="x"))
>>> e.find("N:a", namespaces=dict(N="ns")) # wrongly returns `None`
>>> from lxml._elementpath import _cache; _cache.clear()
>>> e.find("N:a", namespaces=dict(N="ns")) # now the result is correct
<Element {ns}a at 0xb74f3e8c>

Revision history for this message

scoder (scoder) wrote on 2014-05-12:

http://lxml.de/tutorial.html#namespaces

Changed in lxml:
status:	New → Invalid

Revision history for this message

Dieter Maurer (d.maurer) wrote on 2014-05-12:

I changed the status of this bug report to "new" again.

"find" should work correctly with its "namespaces" parameter -- even though some persons may like to avoid namespace prefixes. In my application, I use both "xpath" and "find" extensively and it is really convenient (to say the least) to have consistent path expressions (not use prefixes for "xpath" and "{...}" for "find".

Not working correctly with its `namespaces` parameter is a bug; a corresponding report is not invalid.

Changed in lxml:
status:	Invalid → New

Revision history for this message

scoder (scoder) wrote on 2014-05-12:

Ah, thanks for the clarification. I actually consider the "namespaces" argument a bit of a quirk, as it doesn't really fit with the rest of the ElementPath API. I'd say it's mostly there for compatibility with ElementTree. Normal code should be using the nicer ElementPath syntax with its self-contained path expressions.

That being said, this bug has been fixed in lxml 3.1.

Changed in lxml:
assignee:	nobody → scoder (scoder)
importance:	Undecided → Medium
milestone:	none → 3.1
status:	New → Fix Released

Revision history for this message

Dieter Maurer (d.maurer) wrote on 2014-05-12:

scoder wrote:
> I actually consider the "namespaces" argument a bit of a quirk, as it doesn't really fit with the rest of the ElementPath API.

It fits well with `xpath` (which is also part of the "lxml.etree" API).

And the "namespaces" parameter is really handy, if you process XML documents with namespaces and more complex paths. Consider
`e.find("a/b/c", namespaces={None:URL})`; without `namespaces`, this would become
`e.find("{URL}a/{URL}b/{URL}c")` (which is far more difficult to read and understand).

In my case, I work with complex documents coverned by a modular DTD. I have
   ns = dict(
     N = "...", # top level namespace
     ce = "...", # "common elements
     m = "...", # mathml
     b = "...", # bibliography
    ...
    )
and then use paths of the form "e.find[text]("N:head/ce:author/ce:name", namespaces=ns)".
The elementary "{...}" would render those expressions unreadable -- I would need
something like `"%(N)shead/%(ce)sauthor/%(ce)sname" % ns` which is a bit better but
still far away from the namespace prefixes.

Revision history for this message

scoder (scoder) wrote on 2014-05-12:

I don't really see how this is so much better than, say:

  ns = dict(
     N = "{...}", # top level namespace
     ce = "{...}", # "common elements
     m = "{...}", # mathml
     b = "{...}", # bibliography
    ...
    )

e.find("{N}head/{ce}author/{ce}name".format(ns))

Moving lengthy paths into a constant isn't completely unprecedented either.

In any case, find*() and xpath() support similar but different languages. That's by design. ElementPath gices you self-contained expressions that you can store away and reuse as you like. In XPath, you always need to remember what the actual meaning of each of the names was, in addition to how they are used in the expression. Leads to bugs like the one in this ticket, as well as confusion on user side about how to get at the prefixes that their XPath expressions can use (people actually ask that). Just another unnecessary and confusing level of indirection.

Revision history for this message

scoder (scoder) wrote on 2014-05-12:

Oh, and note that `e.find("a/b/c", namespaces={None:URL})` doesn't actually work. There is no such thing as a default namespace in XPath, nor is there one in find*().

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.