“lxml” parser breaks long strings into characters

Bug #1762514 reported by Dmitry Zinoviev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

LXML parser incorrectly parses strings longer than 2^14 characters (but correctly parses the same strings when encoded to bytes). The string after the 16384'th character is treated as individual characters, rather than words and tags.

Minimal example:

from bs4 import BeautifulSoup
import urllib

url = "http://shakespeare.mit.edu/othello/full.html"
html_raw = urllib.request.urlopen(url).read()
html_str = urllib.request.urlopen(url).read().decode("iso-8859-1")
type(html_raw), len(html_raw) #(<class 'bytes'>, 304769)
type(html_str), len(html_str) #(<class 'str'>, 304769)
repr(BeautifulSoup(html_raw, "lxml")) == repr(BeautifulSoup(html_str, "lxml")) # False

Package information:

Python 3.6.5
BeautifulSoup 4.6.0
lxml 4.2.1

Tags: lxml
Revision history for this message
Leonard Richardson (leonardr) wrote :

I reproduced this using only lxml code, so it's a bug in lxml. I filed the issue against lxml here: https://bugs.launchpad.net/lxml/+bug/1781797

Changed in beautifulsoup:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.