lxml

Wrong feed result for same text for large file

Bug #1969813 reported by jinlian on 2022-04-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	New	Undecided	Unassigned

Bug Description

I use BeautifulSoup to parse html. I think it may caused by lxml.feed interface, so report a bug on lxml project.

Python : sys.version_info(major=2, minor=7, micro=16, releaselevel='final', serial=0)
lxml.etree : (4, 7, 0, 0)
libxml used : (2, 9, 12)
libxml compiled : (2, 9, 12)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Python version: 2.7.16
BeautifulSoup: 4.10.0
lxml: 4.6.0 4.8.0

When the input of BeautifulSoup is unicode, for same text, the parse result is different with 4.8.0 version. Why it's the same with version 4.6.0

This two line with text 'Lancôme Color Design Sensational Effects Eye Shadow', but the parse result is different.

Lancôme Color Design Sensational Effects Eye Shadow 
 LancÃ´me Color Design Sensational Effects Eye Shadow

But if the input file is smaller, the result is the same.
 Lancôme Color Design Sensational Effects Eye Shadow 
 Lancôme Color Design Sensational Effects Eye Shadow

Also debug BeautifulSoup, found that the content from lxml has change when meet second text line.

The first value is u' Lanc\xf4me Color Design Sensational Effects Eye Shadow '
The Second value is u' Lanc\xc3\xb4me Color Design Sensational Effects Eye Shadow '

> bs4/builder/_lxml.py(194)data()
193
11> 194 def data(self, content):
195 self.soup.handle_data(content)

ipdb> name
*** NameError: name 'name' is not defined
ipdb> content
u' Lanc\xc3\xb4me Color Design Sensational Effects Eye Shadow '
ipdb> type(content)
<type 'unicode'>
ipdb> print(content)
LancÃ´me Color Design Sensational Effects Eye Shadow

Reproduce:
python tt.py test_lxml2.html

For small file, can delete line 7:1000 in test_lxml2.html.

tt.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from bs4 import BeautifulSoup

fname = sys.argv[1]
html = open(fname, 'r').read().decode('utf-8')
print(type(html))
soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(True):
if tag.name in ('b', 'mark'):
print(tag)

Tags: