find_all fails on namespaced documents when using the lxml parser

Bug #1723783 reported by Staffan Malmgren
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

A minimal testcase:

It seems the optimization introduced after https://bugs.launchpad.net/beautifulsoup/+bug/1655332 made find_all() fail when searching for namespaced elements, in a way that find() does not.

    from bs4 import BeautifulSoup

    doc = """<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
      <w:body>
        <w:p>hello</w:p>
      </w:body>
    </w:document>
    """

    soup = BeautifulSoup(doc, "lxml")
    from pudb import set_trace; set_trace()
    assert(soup.find("w:p")) # will return the expected node
    assert(soup.find_all("w:p", limit=99999)) # returns a list with a single node
    assert(soup.find_all("w:p")) # returns empty list

As indicated, a workaround is to pass a limit parameter to find_all, to avoid the optimized branch in the code.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for filing this bug and pinpointing the point of failure. The optimization in revision 442 assumes that all namespaced tags belong to XML documents, and your example document was parsed with lxml's HTML parser. HTML documents in Beautiful Soup are not namespace-aware, so you ended up with an HTML document with funny-looking tag names. The optimization shouldn't be applied in that case.

The fix is in revision 466.

Changed in beautifulsoup:
status: New → Confirmed
status: Confirmed → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.