Beautiful Soup

Bug #1417011
Activity log

Activity log for bug #1417011

Date	Who	What changed	Old value	New value	Message
2015-02-02 08:54:37	wlnirvana	bug			added bug
2015-02-02 08:54:37	wlnirvana	attachment added		the html file that bs fails to parse https://bugs.launchpad.net/bugs/1417011/+attachment/4310360/+files/foobar.html
2015-02-02 08:56:59	wlnirvana	description	The attached script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file. By the way, I found that if I use 'html.parser' as the backend, beautifully would correctly parse the document. Perhaps this is a lxml specific problem? parse.py: ----------------------------------------------- #!/usr/bin/env python # -- coding: utf-8 -- from bs4 import BeautifulSoup with open('foobar.html', 'r') as f: body = f.read() soup = BeautifulSoup(body, 'lxml') print len(soup.find_all('div', class_='order')) Linux output (part): ----------------------------------------------- benben@debian:desktop$ python parse.py 3 benben@debian:desktop$ uname -a Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux benben@debian:desktop$ python -V Python 2.7.3 benben@debian:desktop$ pip list beautifulsoup4 (4.3.2) lxml (3.4.1) Windows 7 32-bit output (part) -: ----------------------------------------------- C:\Users\benben\Desktop>python parse.py 1 --> this line should be 3 C:\Users\benben\Desktop>python -V Python 2.7.9 C:\Users\benben\Desktop>pip freeze beautifulsoup4==4.3.2 lxml==3.4.1	The following script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file. By the way, I found that if I use 'html.parser' as the backend, beautifully would correctly parse the document. Perhaps this is a lxml specific problem? parse.py: ----------------------------------------------- #!/usr/bin/env python # -- coding: utf-8 -- from bs4 import BeautifulSoup with open('foobar.html', 'r') as f: body = f.read() soup = BeautifulSoup(body, 'lxml') print len(soup.find_all('div', class_='order')) Linux output (part): ----------------------------------------------- benben@debian:desktop$ python parse.py 3 benben@debian:desktop$ uname -a Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux benben@debian:desktop$ python -V Python 2.7.3 benben@debian:desktop$ pip list beautifulsoup4 (4.3.2) lxml (3.4.1) Windows 7 32-bit output (part) -: ----------------------------------------------- C:\Users\benben\Desktop>python parse.py 1 --> this line should be 3 C:\Users\benben\Desktop>python -V Python 2.7.9 C:\Users\benben\Desktop>pip freeze beautifulsoup4==4.3.2 lxml==3.4.1
2015-02-03 01:57:29	wlnirvana	description	The following script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file. By the way, I found that if I use 'html.parser' as the backend, beautifully would correctly parse the document. Perhaps this is a lxml specific problem? parse.py: ----------------------------------------------- #!/usr/bin/env python # -- coding: utf-8 -- from bs4 import BeautifulSoup with open('foobar.html', 'r') as f: body = f.read() soup = BeautifulSoup(body, 'lxml') print len(soup.find_all('div', class_='order')) Linux output (part): ----------------------------------------------- benben@debian:desktop$ python parse.py 3 benben@debian:desktop$ uname -a Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux benben@debian:desktop$ python -V Python 2.7.3 benben@debian:desktop$ pip list beautifulsoup4 (4.3.2) lxml (3.4.1) Windows 7 32-bit output (part) -: ----------------------------------------------- C:\Users\benben\Desktop>python parse.py 1 --> this line should be 3 C:\Users\benben\Desktop>python -V Python 2.7.9 C:\Users\benben\Desktop>pip freeze beautifulsoup4==4.3.2 lxml==3.4.1	The following script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file. By the way, I found that if I use 'html.parser' as the backend, beautifulsoup would correctly parse the document. Perhaps this is a lxml specific problem? parse.py: ----------------------------------------------- #!/usr/bin/env python # -- coding: utf-8 -- from bs4 import BeautifulSoup with open('foobar.html', 'r') as f: body = f.read() soup = BeautifulSoup(body, 'lxml') print len(soup.find_all('div', class_='order')) Linux output (part): ----------------------------------------------- benben@debian:desktop$ python parse.py 3 benben@debian:desktop$ uname -a Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux benben@debian:desktop$ python -V Python 2.7.3 benben@debian:desktop$ pip list beautifulsoup4 (4.3.2) lxml (3.4.1) Windows 7 32-bit output (part) -: ----------------------------------------------- C:\Users\benben\Desktop>python parse.py 1 --> this line should be 3 C:\Users\benben\Desktop>python -V Python 2.7.9 C:\Users\benben\Desktop>pip freeze beautifulsoup4==4.3.2 lxml==3.4.1
2015-06-25 10:41:50	Leonard Richardson	beautifulsoup: status	New	Incomplete
2015-07-19 07:18:34	wlnirvana	attachment added		the problematic html file https://bugs.launchpad.net/beautifulsoup/+bug/1417011/+attachment/4431122/+files/foobar.html
2016-12-17 16:28:22	Sean Hunt	information type	Public	Public Security
2016-12-19 21:48:18	Leonard Richardson	information type	Public Security	Public
2019-01-01 22:25:54	Leonard Richardson	beautifulsoup: status	Incomplete	Won't Fix