“lxml” parser breaks long strings into characters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
LXML parser incorrectly parses strings longer than 2^14 characters (but correctly parses the same strings when encoded to bytes). The string after the 16384'th character is treated as individual characters, rather than words and tags.
Minimal example:
from bs4 import BeautifulSoup
import urllib
url = "http://
html_raw = urllib.
html_str = urllib.
type(html_raw), len(html_raw) #(<class 'bytes'>, 304769)
type(html_str), len(html_str) #(<class 'str'>, 304769)
repr(BeautifulS
Package information:
Python 3.6.5
BeautifulSoup 4.6.0
lxml 4.2.1
I reproduced this using only lxml code, so it's a bug in lxml. I filed the issue against lxml here: https:/ /bugs.launchpad .net/lxml/ +bug/1781797