Comment 1 for bug 2025944

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for taking the time to file this bug.

There are two problems here. The first is that get_text is a method, not a property, so you're not getting any text out of it. But that's probably an issue with the code you wrote to file this bug, and not the underlying issue.

The second problem is the one you reported, the duplicate tags at the end of the document. Repeating data (to avoid the risk of losing any) is a common parser technique when handling invalid markup. I associate this technique mainly with the html5lib parser, but lxml-xml might do it too.

Without seeing the original document I can't go further, but the solution might involve passing recursive=False into the find_all() call. The first <PAYEE id="P000242"> tag could be inside another PAYEE tag that wasn't closed properly, and passing in recursive=False would skip it. Another possibility would be writing your code defensively to track which PAYEE ids have been seen and skip duplicates.

Whatever it is, the issue probably can't be fixed by changing the Beautiful Soup codebase, but if you can share the document I can take a look.