Superfluous data from a get_text

Bug #2025944 reported by Peter R
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

I'm simply listing all the Payees in an XML file, data is

<PAYEE matchingenabled="0" email="" name="Rodwells" reference="" id="P000727">
   <ADDRESS street="" telephone="" state="" city="" postcode=""/>
  </PAYEE>

This code works fine:

-------------------------
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('testxml.xml', 'r') as f:
    file = f.read()

soup = BeautifulSoup(file, 'xml')

for tag in soup.find_all('PAYEE'):
    print(f'{tag.name}: {tag.get_text}')

-------------

BUT everytime I run the script it produces 2 superfluous lines at the end ..

=========

PAYEE: <bound method PageElement.get_text of <PAYEE id="P000242"/>>
PAYEE: <bound method PageElement.get_text of <PAYEE id="P000344"/>>

===========

yet those 2 payees have already been displayed/listed. The 2 payee records are not near the end of the XML file either, as the XML file has 1051 Payee records in it.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for taking the time to file this bug.

There are two problems here. The first is that get_text is a method, not a property, so you're not getting any text out of it. But that's probably an issue with the code you wrote to file this bug, and not the underlying issue.

The second problem is the one you reported, the duplicate tags at the end of the document. Repeating data (to avoid the risk of losing any) is a common parser technique when handling invalid markup. I associate this technique mainly with the html5lib parser, but lxml-xml might do it too.

Without seeing the original document I can't go further, but the solution might involve passing recursive=False into the find_all() call. The first <PAYEE id="P000242"> tag could be inside another PAYEE tag that wasn't closed properly, and passing in recursive=False would skip it. Another possibility would be writing your code defensively to track which PAYEE ids have been seen and skip duplicates.

Whatever it is, the issue probably can't be fixed by changing the Beautiful Soup codebase, but if you can share the document I can take a look.

Revision history for this message
Peter R (forums-oygle) wrote :

Thanks for your reply. I cut down the XML file to include only PAYEE's and the problem disappeared. Checked out the data in a hex editor as I could see no tags that were incomplete. Then found one of the payee's data right near the end of the XML file, the <PAYEE id="P000242"/> entry.

It was a 'child' of a different parent, so I will need to read up and learn some more about how to exclude other parents, and force the find to only do the 'find_all' through the 'parent' PAYEE, not any children or siblings.

Thanks for your help. It was a code problem, not catering for the expected data.

Revision history for this message
Peter R (forums-oygle) wrote :

Can you please close the bug, as it's not a bug.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Great, glad I could help.

Changed in beautifulsoup:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.