Beautiful Soup

find_all fails on namespaced documents when using the lxml parser

Bug #1723783 reported by Staffan Malmgren on 2017-10-15

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Undecided	Unassigned

Bug Description

A minimal testcase:

It seems the optimization introduced after https://bugs.launchpad.net/beautifulsoup/+bug/1655332 made find_all() fail when searching for namespaced elements, in a way that find() does not.

from bs4 import BeautifulSoup

    doc = """<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
      <w:body>
        <w:p>hello</w:p>
      </w:body>
    </w:document>
    """

    soup = BeautifulSoup(doc, "lxml")
    from pudb import set_trace; set_trace()
    assert(soup.find("w:p")) # will return the expected node
    assert(soup.find_all("w:p", limit=99999)) # returns a list with a single node
    assert(soup.find_all("w:p")) # returns empty list

As indicated, a workaround is to pass a limit parameter to find_all, to avoid the optimized branch in the code.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-07-19:

Thanks for filing this bug and pinpointing the point of failure. The optimization in revision 442 assumes that all namespaced tags belong to XML documents, and your example document was parsed with lxml's HTML parser. HTML documents in Beautiful Soup are not namespace-aware, so you ended up with an HTML document with funny-looking tag names. The optimization shouldn't be applied in that case.

The fix is in revision 466.

Leonard Richardson (leonardr) on 2018-07-19