Beautiful Soup

lxml parser behaves incorrectly in Windows

Bug #1417011 reported by wlnirvana on 2015-02-02

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

The following script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file.

By the way, I found that if I use 'html.parser' as the backend, beautifulsoup would correctly parse the document. Perhaps this is a lxml specific problem?

parse.py:
-----------------------------------------------
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

with open('foobar.html', 'r') as f:
body = f.read()

soup = BeautifulSoup(body, 'lxml')
print len(soup.find_all('div', class_='order'))

Linux output (part):
-----------------------------------------------
benben@debian:desktop$ python parse.py
3

benben@debian:desktop$ uname -a
Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux

benben@debian:desktop$ python -V
Python 2.7.3

benben@debian:desktop$ pip list
beautifulsoup4 (4.3.2)
lxml (3.4.1)

Windows 7 32-bit output (part) -:
-----------------------------------------------
C:\Users\benben\Desktop>python parse.py
1 --> this line should be 3

C:\Users\benben\Desktop>python -V
Python 2.7.9

C:\Users\benben\Desktop>pip freeze
beautifulsoup4==4.3.2
lxml==3.4.1

See original description

Revision history for this message

wlnirvana (weilin1990) wrote on 2015-02-02:

the html file that bs fails to parse Edit (231.6 KiB, text/html)

description:

updated

wlnirvana (weilin1990) on 2015-02-03

description:

updated

Revision history for this message

Leonard Richardson (leonardr) wrote on 2015-06-24:

This certainly looks like an lxml specific problem, and I've definitely seen cross-platform issues on lxml before. Unfortunately since I don't have a Windows computer I can't test it. If you can duplicate this problem using lxml code (no Beautiful Soup) you can file a bug against lxml.

It would also be helpful if you could whittle down your HTML file to the minimal file that evinces the problem. All sorts of things can go wrong in a 231 kilobyte file.

Leonard Richardson (leonardr) on 2015-06-25

Changed in beautifulsoup:
status:	New → Incomplete

Revision history for this message

wlnirvana (weilin1990) wrote on 2015-07-19:

the problematic html file Edit (96.0 KiB, text/html)

I just tested lxml, but it seems to behaves consistently across different platforms. And I tried to simplify the document, but the final problematic version is still very large, 97KB as attached. Below are my script and the output.

parse.py:
-----------------------------------------------
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import lxml.html as lh
from bs4 import BeautifulSoup

with open('foobar.html', 'r') as f:
body = f.read()

ele = lh.fromstring(body).xpath(
'//div[contains(concat(" ", normalize-space(@class), " "), " order ")]')
print 'lxml=', len(ele)

soup = BeautifulSoup(body, 'lxml')
print 'bs4=', len(soup.find_all('div', class_='order'))

Windows 7 64-bit output:
-----------------------------------------------
C:\Users\feifei\Desktop\dev\bs4_bug>python parse.py
lxml= 3
bs4= 1

Revision history for this message

Sean Hunt (seandhunt7) wrote on 2016-12-17:

Another thing I hate about the xml parser in bs4 is that you cant tell it to use the builtin to python xml parser. There is xml.dom.minidom. I would like if it fell back on that builtin xml package that is a pary of python 2 and 3's Standard library instead of relying on lxml if they cant install it.

This would help those users when the lxml install just always fails because they are using Windows.

Althouh I use aiohttp for getting the response data and if it is not json then I have it read through beautifulsoup4. But the problem is the xml parsing however it is not like I can tell it on the second arg "xml.dom.minidom" for it to use that instead of lxml.

information type:

Public → Public Security

Revision history for this message

Leonard Richardson (leonardr) wrote on 2016-12-19:

I've filed the request for an XML tree builder based on the Python standard library as bug 1651251.

information type:

Public Security → Public

Revision history for this message

Isaac Muse (facelessuser) wrote on 2019-01-01:

I have a windows machine, so I tired to replicate this issue, but I was unable to:

C:\Users\facelessuser\Desktop>py -2 parse.py
lxml= 3
bs4= 3

This was on Python 2.7.11. As this doesn't seem to be reproducible, and the issue is years old, I'd personally close it and see if it resurfaces. It doesn't appear to be an issue with BeautifulSoup logic, as it is not reproducible on the latest. It may have been an lxml bug that has since been fixed, but regardless, I cannot reproduce it.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-01-01:

Thanks for looking into this, Isaac.

Changed in beautifulsoup:
status:	Incomplete → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.