bytes like regex failed on string like markup
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
# Script Start
from bs4 import BeautifulSoup
markup = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://新2网址(www.
<html xmlns="http://
<head>
<meta content="text/html; charset=utf-8" http-equiv=
<title>
<meta content=
<meta content="时时彩娱乐官网✅✅ 是全网最诚信,
<title>
</head>
<body>
<h1><a href="http://
</body>
</html>
"""
# Raises Exception TypeError: cannot use a bytes pattern on a string-like object
soup = BeautifulSoup(
soup = BeautifulSoup(
# Print empty string
print(str(soup))
# Script End
Above HTML markup is a small portion from large HTML file
System information
Uname Result: 5.0.0-23-generic #24~18.04.1-Ubuntu SMP Mon Jul 29 16:12:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Python Version: 3.6.8
Libraries
beautifulsoup4=
lxml==4.3.3
description: | updated |
description: | updated |
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
It looks like there are three problems here.
1. The TypeError. This is in Beautiful Soup code and easy to fix.
2. The lxml parser doesn't deal well with Unicode documents. It's rejecting your markup, with this exception:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 79: unexpected end of data
But you don't get any visibility into that exception. I fixed this by propagating the exception upwards so you can see it.
3. By encoding the data as UTF-8, you can get lxml to accept the markup without raising an exception. But whatever problem lxml is having with this particular document doesn't go away, and lxml still can't handle the document. it ignores the entire thing, because of whatever problem it perceives in the DOCTYPE, and you're left with an empty BeautifulSoup object.
The fixes to 1 and 2 are in revision 526. To actually parse the document I recommend using html5lib as the parser instead of lxml.