Wrong feed result for same text for large file
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
I use BeautifulSoup to parse html. I think it may caused by lxml.feed interface, so report a bug on lxml project.
Python : sys.version_
lxml.etree : (4, 7, 0, 0)
libxml used : (2, 9, 12)
libxml compiled : (2, 9, 12)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)
Python version: 2.7.16
BeautifulSoup: 4.10.0
lxml: 4.6.0 4.8.0
When the input of BeautifulSoup is unicode, for same text, the parse result is different with 4.8.0 version. Why it's the same with version 4.6.0
This two line with text 'Lancôme Color Design Sensational Effects Eye Shadow', but the parse result is different.
<b> Lancôme Color Design Sensational Effects Eye Shadow </b>
<mark> Lancôme Color Design Sensational Effects Eye Shadow </mark>
But if the input file is smaller, the result is the same.
<b> Lancôme Color Design Sensational Effects Eye Shadow </b>
<mark> Lancôme Color Design Sensational Effects Eye Shadow </mark>
Also debug BeautifulSoup, found that the content from lxml has change when meet second text line.
The first value is u' Lanc\xf4me Color Design Sensational Effects Eye Shadow '
The Second value is u' Lanc\xc3\xb4me Color Design Sensational Effects Eye Shadow '
> bs4/builder/
193
11> 194 def data(self, content):
195 self.soup.
ipdb> name
*** NameError: name 'name' is not defined
ipdb> content
u' Lanc\xc3\xb4me Color Design Sensational Effects Eye Shadow '
ipdb> type(content)
<type 'unicode'>
ipdb> print(content)
Lancôme Color Design Sensational Effects Eye Shadow
Reproduce:
python tt.py test_lxml2.html
For small file, can delete line 7:1000 in test_lxml2.html.
tt.py:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from bs4 import BeautifulSoup
fname = sys.argv[1]
html = open(fname, 'r').read(
print(type(html))
soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_
if tag.name in ('b', 'mark'):
print(tag)