find_all fails on namespaced documents when using the lxml parser
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
A minimal testcase:
It seems the optimization introduced after https:/
from bs4 import BeautifulSoup
doc = """<w:document xmlns:w="http://
<w:body>
</w:body>
</w:document>
"""
soup = BeautifulSoup(doc, "lxml")
from pudb import set_trace; set_trace()
assert(
assert(
assert(
As indicated, a workaround is to pass a limit parameter to find_all, to avoid the optimized branch in the code.
Changed in beautifulsoup: | |
status: | New → Confirmed |
status: | Confirmed → Fix Committed |
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
Thanks for filing this bug and pinpointing the point of failure. The optimization in revision 442 assumes that all namespaced tags belong to XML documents, and your example document was parsed with lxml's HTML parser. HTML documents in Beautiful Soup are not namespace-aware, so you ended up with an HTML document with funny-looking tag names. The optimization shouldn't be applied in that case.
The fix is in revision 466.