Inconsistent Results During Nested Find Operation

Bug #1520000 reported by Meawoppl
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

Here is a reproduction:
http://paste.ubuntu.com/13508506/

In short, when I used beautiful-soup to parse an svg file (inlined for convenience), it will occasionally drop some of the symbols that I am trying to extract. It always runs correctly the first time the function is called, but subsequent invocations can result in a bit of strangeness. The above link runs the same function six times. The function contains a pair of nested find_all commands and corresponding loops. The first three are all identical calls, but get different results:

(For reference "strike" here is 1:1 with a symbol/character in the svg file being "struck" to the surface)

----
32 unique symbols
44 strikes
----
32 unique symbols
38 strikes
----
32 unique symbols
0 strikes

WEIRD!!!!

Anyway, I discovered via various figiting that this didn't happen if I first transformed it into bytes objects. After that, the call always produces the same results.

Bytesified
----
32 unique symbols
44 strikes
----
32 unique symbols
44 strikes
----
32 unique symbols
44 strikes

The xml seems legit, and passes the simple checks I have run it through. Even it it wasn't I would expect stable output.
I have no idea where to dig on this one. Newest version of bs4 on ubuntu linux-64 via anaconda:

In [2]: bs4.__version__
Out[2]: '4.4.1'

Any thoughts/leads appreciated!

Revision history for this message
Meawoppl (meawoppl) wrote :

Also, happy thanksgiving!

Revision history for this message
Leonard Richardson (leonardr) wrote :

Unfortunately I can't duplicate this. When I run your script I always get 44 strikes. This is with Python 3.4.3 and lxml 3.4.4.

If the problem only happens with the lxml tree builder, it could be a problem with lxml. LXML is a C extension so there are many more chances for something mysterious to happen. I've run into problems with lxml's HTML parser handling Unicode data in the past.

Since SVG is XML, not HTML, you might try using the 'xml' tree builder rather than the 'lxml' tree builder (which uses LXML's HTML parser) and see if you get different results.

Changed in beautifulsoup:
status: New → Incomplete
Revision history for this message
Leonard Richardson (leonardr) wrote :

I'm closing this issue without providing a fix, for two reasons. First, lxml parses HTML using HTML 4 rules. SVG is XML, not HTML, and there are no rules for embedding it in HTML 4. So there's no reason to expect that the case under consideration should work correctly.

Of course, even if this doesn't work correctly, it should work consistently. The inconsistent behavior here strongly indicates a problem with lxml, not with Beautiful Soup.

Changed in beautifulsoup:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.