Beautiful Soup

“lxml” parser breaks long strings into characters

Bug #1762514 reported by Dmitry Zinoviev on 2018-04-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

LXML parser incorrectly parses strings longer than 2^14 characters (but correctly parses the same strings when encoded to bytes). The string after the 16384'th character is treated as individual characters, rather than words and tags.

Minimal example:

from bs4 import BeautifulSoup
import urllib

url = "http://shakespeare.mit.edu/othello/full.html"
html_raw = urllib.request.urlopen(url).read()
html_str = urllib.request.urlopen(url).read().decode("iso-8859-1")
type(html_raw), len(html_raw) #(<class 'bytes'>, 304769)
type(html_str), len(html_str) #(<class 'str'>, 304769)
repr(BeautifulSoup(html_raw, "lxml")) == repr(BeautifulSoup(html_str, "lxml")) # False

Package information:

Python 3.6.5
BeautifulSoup 4.6.0
lxml 4.2.1

Tags:

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-07-15:

I reproduced this using only lxml code, so it's a bug in lxml. I filed the issue against lxml here: https://bugs.launchpad.net/lxml/+bug/1781797

Changed in beautifulsoup:
status:	New → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.