lxml parser behaves incorrectly in Windows

Bug #1417011 reported by wlnirvana
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

The following script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file.

By the way, I found that if I use 'html.parser' as the backend, beautifulsoup would correctly parse the document. Perhaps this is a lxml specific problem?

parse.py:
-----------------------------------------------
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

with open('foobar.html', 'r') as f:
    body = f.read()

soup = BeautifulSoup(body, 'lxml')
print len(soup.find_all('div', class_='order'))

Linux output (part):
-----------------------------------------------
benben@debian:desktop$ python parse.py
3

benben@debian:desktop$ uname -a
Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux

benben@debian:desktop$ python -V
Python 2.7.3

benben@debian:desktop$ pip list
beautifulsoup4 (4.3.2)
lxml (3.4.1)

Windows 7 32-bit output (part) -:
-----------------------------------------------
C:\Users\benben\Desktop>python parse.py
1 --> this line should be 3

C:\Users\benben\Desktop>python -V
Python 2.7.9

C:\Users\benben\Desktop>pip freeze
beautifulsoup4==4.3.2
lxml==3.4.1

Revision history for this message
wlnirvana (weilin1990) wrote :
description: updated
wlnirvana (weilin1990)
description: updated
Revision history for this message
Leonard Richardson (leonardr) wrote :

This certainly looks like an lxml specific problem, and I've definitely seen cross-platform issues on lxml before. Unfortunately since I don't have a Windows computer I can't test it. If you can duplicate this problem using lxml code (no Beautiful Soup) you can file a bug against lxml.

It would also be helpful if you could whittle down your HTML file to the minimal file that evinces the problem. All sorts of things can go wrong in a 231 kilobyte file.

Changed in beautifulsoup:
status: New → Incomplete
Revision history for this message
wlnirvana (weilin1990) wrote :

I just tested lxml, but it seems to behaves consistently across different platforms. And I tried to simplify the document, but the final problematic version is still very large, 97KB as attached. Below are my script and the output.

parse.py:
-----------------------------------------------
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import lxml.html as lh
from bs4 import BeautifulSoup

with open('foobar.html', 'r') as f:
    body = f.read()

ele = lh.fromstring(body).xpath(
    '//div[contains(concat(" ", normalize-space(@class), " "), " order ")]')
print 'lxml=', len(ele)

soup = BeautifulSoup(body, 'lxml')
print 'bs4=', len(soup.find_all('div', class_='order'))

Windows 7 64-bit output:
-----------------------------------------------
C:\Users\feifei\Desktop\dev\bs4_bug>python parse.py
lxml= 3
bs4= 1

Revision history for this message
Sean Hunt (seandhunt7) wrote :

Another thing I hate about the xml parser in bs4 is that you cant tell it to use the builtin to python xml parser. There is xml.dom.minidom. I would like if it fell back on that builtin xml package that is a pary of python 2 and 3's Standard library instead of relying on lxml if they cant install it.

This would help those users when the lxml install just always fails because they are using Windows.

Althouh I use aiohttp for getting the response data and if it is not json then I have it read through beautifulsoup4. But the problem is the xml parsing however it is not like I can tell it on the second arg "xml.dom.minidom" for it to use that instead of lxml.

information type: Public → Public Security
Revision history for this message
Leonard Richardson (leonardr) wrote :

I've filed the request for an XML tree builder based on the Python standard library as bug 1651251.

information type: Public Security → Public
Revision history for this message
Isaac Muse (facelessuser) wrote :

I have a windows machine, so I tired to replicate this issue, but I was unable to:

C:\Users\facelessuser\Desktop>py -2 parse.py
lxml= 3
bs4= 3

This was on Python 2.7.11. As this doesn't seem to be reproducible, and the issue is years old, I'd personally close it and see if it resurfaces. It doesn't appear to be an issue with BeautifulSoup logic, as it is not reproducible on the latest. It may have been an lxml bug that has since been fixed, but regardless, I cannot reproduce it.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for looking into this, Isaac.

Changed in beautifulsoup:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.