Activity log for bug #1417011

Date Who What changed Old value New value Message
2015-02-02 08:54:37 wlnirvana bug added bug
2015-02-02 08:54:37 wlnirvana attachment added the html file that bs fails to parse https://bugs.launchpad.net/bugs/1417011/+attachment/4310360/+files/foobar.html
2015-02-02 08:56:59 wlnirvana description The attached script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file. By the way, I found that if I use 'html.parser' as the backend, beautifully would correctly parse the document. Perhaps this is a lxml specific problem? parse.py: ----------------------------------------------- #!/usr/bin/env python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup with open('foobar.html', 'r') as f: body = f.read() soup = BeautifulSoup(body, 'lxml') print len(soup.find_all('div', class_='order')) Linux output (part): ----------------------------------------------- benben@debian:desktop$ python parse.py 3 benben@debian:desktop$ uname -a Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux benben@debian:desktop$ python -V Python 2.7.3 benben@debian:desktop$ pip list beautifulsoup4 (4.3.2) lxml (3.4.1) Windows 7 32-bit output (part) -: ----------------------------------------------- C:\Users\benben\Desktop>python parse.py 1 --> this line should be 3 C:\Users\benben\Desktop>python -V Python 2.7.9 C:\Users\benben\Desktop>pip freeze beautifulsoup4==4.3.2 lxml==3.4.1 The following script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file. By the way, I found that if I use 'html.parser' as the backend, beautifully would correctly parse the document. Perhaps this is a lxml specific problem? parse.py: ----------------------------------------------- #!/usr/bin/env python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup with open('foobar.html', 'r') as f:     body = f.read() soup = BeautifulSoup(body, 'lxml') print len(soup.find_all('div', class_='order')) Linux output (part): ----------------------------------------------- benben@debian:desktop$ python parse.py 3 benben@debian:desktop$ uname -a Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux benben@debian:desktop$ python -V Python 2.7.3 benben@debian:desktop$ pip list beautifulsoup4 (4.3.2) lxml (3.4.1) Windows 7 32-bit output (part) -: ----------------------------------------------- C:\Users\benben\Desktop>python parse.py 1 --> this line should be 3 C:\Users\benben\Desktop>python -V Python 2.7.9 C:\Users\benben\Desktop>pip freeze beautifulsoup4==4.3.2 lxml==3.4.1
2015-02-03 01:57:29 wlnirvana description The following script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file. By the way, I found that if I use 'html.parser' as the backend, beautifully would correctly parse the document. Perhaps this is a lxml specific problem? parse.py: ----------------------------------------------- #!/usr/bin/env python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup with open('foobar.html', 'r') as f:     body = f.read() soup = BeautifulSoup(body, 'lxml') print len(soup.find_all('div', class_='order')) Linux output (part): ----------------------------------------------- benben@debian:desktop$ python parse.py 3 benben@debian:desktop$ uname -a Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux benben@debian:desktop$ python -V Python 2.7.3 benben@debian:desktop$ pip list beautifulsoup4 (4.3.2) lxml (3.4.1) Windows 7 32-bit output (part) -: ----------------------------------------------- C:\Users\benben\Desktop>python parse.py 1 --> this line should be 3 C:\Users\benben\Desktop>python -V Python 2.7.9 C:\Users\benben\Desktop>pip freeze beautifulsoup4==4.3.2 lxml==3.4.1 The following script works well in Linux, but fails in Windows. I'll directly write my code and the output here since I want to leave the attachment for the HTML file. By the way, I found that if I use 'html.parser' as the backend, beautifulsoup would correctly parse the document. Perhaps this is a lxml specific problem? parse.py: ----------------------------------------------- #!/usr/bin/env python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup with open('foobar.html', 'r') as f:     body = f.read() soup = BeautifulSoup(body, 'lxml') print len(soup.find_all('div', class_='order')) Linux output (part): ----------------------------------------------- benben@debian:desktop$ python parse.py 3 benben@debian:desktop$ uname -a Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux benben@debian:desktop$ python -V Python 2.7.3 benben@debian:desktop$ pip list beautifulsoup4 (4.3.2) lxml (3.4.1) Windows 7 32-bit output (part) -: ----------------------------------------------- C:\Users\benben\Desktop>python parse.py 1 --> this line should be 3 C:\Users\benben\Desktop>python -V Python 2.7.9 C:\Users\benben\Desktop>pip freeze beautifulsoup4==4.3.2 lxml==3.4.1
2015-06-25 10:41:50 Leonard Richardson beautifulsoup: status New Incomplete
2015-07-19 07:18:34 wlnirvana attachment added the problematic html file https://bugs.launchpad.net/beautifulsoup/+bug/1417011/+attachment/4431122/+files/foobar.html
2016-12-17 16:28:22 Sean Hunt information type Public Public Security
2016-12-19 21:48:18 Leonard Richardson information type Public Security Public
2019-01-01 22:25:54 Leonard Richardson beautifulsoup: status Incomplete Won't Fix