Wrong feed result for same text for large file

Bug #1969813 reported by jinlian
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

I use BeautifulSoup to parse html. I think it may caused by lxml.feed interface, so report a bug on lxml project.

Python : sys.version_info(major=2, minor=7, micro=16, releaselevel='final', serial=0)
lxml.etree : (4, 7, 0, 0)
libxml used : (2, 9, 12)
libxml compiled : (2, 9, 12)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Python version: 2.7.16
BeautifulSoup: 4.10.0
lxml: 4.6.0 4.8.0

When the input of BeautifulSoup is unicode, for same text, the parse result is different with 4.8.0 version. Why it's the same with version 4.6.0

This two line with text 'Lancôme Color Design Sensational Effects Eye Shadow', but the parse result is different.

<b> Lancôme Color Design Sensational Effects Eye Shadow </b>
<mark> Lancôme Color Design Sensational Effects Eye Shadow </mark>

But if the input file is smaller, the result is the same.
<b> Lancôme Color Design Sensational Effects Eye Shadow </b>
<mark> Lancôme Color Design Sensational Effects Eye Shadow </mark>

Also debug BeautifulSoup, found that the content from lxml has change when meet second text line.

The first value is u' Lanc\xf4me Color Design Sensational Effects Eye Shadow '
The Second value is u' Lanc\xc3\xb4me Color Design Sensational Effects Eye Shadow '

> bs4/builder/_lxml.py(194)data()
    193
11> 194 def data(self, content):
    195 self.soup.handle_data(content)

ipdb> name
*** NameError: name 'name' is not defined
ipdb> content
u' Lanc\xc3\xb4me Color Design Sensational Effects Eye Shadow '
ipdb> type(content)
<type 'unicode'>
ipdb> print(content)
 Lancôme Color Design Sensational Effects Eye Shadow

Reproduce:
python tt.py test_lxml2.html

For small file, can delete line 7:1000 in test_lxml2.html.

tt.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from bs4 import BeautifulSoup

fname = sys.argv[1]
html = open(fname, 'r').read().decode('utf-8')
print(type(html))
soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(True):
    if tag.name in ('b', 'mark'):
        print(tag)

Tags: feed
Revision history for this message
jinlian (jinlian) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.