lxml parser behaves incorrectly in Windows
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
The following script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file.
By the way, I found that if I use 'html.parser' as the backend, beautifulsoup would correctly parse the document. Perhaps this is a lxml specific problem?
parse.py:
-------
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
with open('foobar.html', 'r') as f:
body = f.read()
soup = BeautifulSoup(body, 'lxml')
print len(soup.
Linux output (part):
-------
benben@
3
benben@
Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux
benben@
Python 2.7.3
benben@
beautifulsoup4 (4.3.2)
lxml (3.4.1)
Windows 7 32-bit output (part) -:
-------
C:\Users\
1 --> this line should be 3
C:\Users\
Python 2.7.9
C:\Users\
beautifulsoup4=
lxml==3.4.1
description: | updated |
Changed in beautifulsoup: | |
status: | New → Incomplete |
This certainly looks like an lxml specific problem, and I've definitely seen cross-platform issues on lxml before. Unfortunately since I don't have a Windows computer I can't test it. If you can duplicate this problem using lxml code (no Beautiful Soup) you can file a bug against lxml.
It would also be helpful if you could whittle down your HTML file to the minimal file that evinces the problem. All sorts of things can go wrong in a 231 kilobyte file.