Superfluous data from a get_text
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
I'm simply listing all the Payees in an XML file, data is
<PAYEE matchingenabled="0" email="" name="Rodwells" reference="" id="P000727">
<ADDRESS street="" telephone="" state="" city="" postcode=""/>
</PAYEE>
This code works fine:
-------
#!/usr/bin/python
from bs4 import BeautifulSoup
with open('testxml.xml', 'r') as f:
file = f.read()
soup = BeautifulSoup(file, 'xml')
for tag in soup.find_
print(
-------------
BUT everytime I run the script it produces 2 superfluous lines at the end ..
=========
PAYEE: <bound method PageElement.
PAYEE: <bound method PageElement.
===========
yet those 2 payees have already been displayed/listed. The 2 payee records are not near the end of the XML file either, as the XML file has 1051 Payee records in it.
Thanks for taking the time to file this bug.
There are two problems here. The first is that get_text is a method, not a property, so you're not getting any text out of it. But that's probably an issue with the code you wrote to file this bug, and not the underlying issue.
The second problem is the one you reported, the duplicate tags at the end of the document. Repeating data (to avoid the risk of losing any) is a common parser technique when handling invalid markup. I associate this technique mainly with the html5lib parser, but lxml-xml might do it too.
Without seeing the original document I can't go further, but the solution might involve passing recursive=False into the find_all() call. The first <PAYEE id="P000242"> tag could be inside another PAYEE tag that wasn't closed properly, and passing in recursive=False would skip it. Another possibility would be writing your code defensively to track which PAYEE ids have been seen and skip duplicates.
Whatever it is, the issue probably can't be fixed by changing the Beautiful Soup codebase, but if you can share the document I can take a look.