bytes like regex failed on string like markup

Bug #1838877 reported by Kamil Mahmood on 2019-08-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Undecided
Unassigned

Bug Description

# Script Start
from bs4 import BeautifulSoup

markup = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://新2网址(www.ydsjyj.com)-时时彩平台,(www.xinyushishicai.com)-澳门赌场(www.amdc999.com)">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
    <title>时时彩娱乐-首页</title>
    <meta content="时时彩娱乐,时时彩娱乐网址,时时彩娱乐平台,时时彩娱乐官网" name="keywords" />
    <meta content="时时彩娱乐官网✅✅ 是全网最诚信,口碑最好的彩票平台!提款速度最快,赔率高达9.999 极力为您提供注册、登陆、下载、测速等服务.时时彩娱乐祝您玩的愉快开心。" name="description" />
    <title>时时彩娱乐-首页</title>
</head>

<body>
    <h1><a href="http://4b2s.com/">时时彩娱乐</a></h1>
</body>
</html>
"""

# Raises Exception TypeError: cannot use a bytes pattern on a string-like object
soup = BeautifulSoup(markup, features="lxml")

soup = BeautifulSoup(markup.encode("utf-8"), features="lxml", from_encoding="utf-8")
# Print empty string
print(str(soup))

# Script End

Above HTML markup is a small portion from large HTML file

System information
Uname Result: 5.0.0-23-generic #24~18.04.1-Ubuntu SMP Mon Jul 29 16:12:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Python Version: 3.6.8

Libraries
beautifulsoup4==4.7.1
lxml==4.3.3

description: updated
description: updated
Leonard Richardson (leonardr) wrote :

It looks like there are three problems here.

1. The TypeError. This is in Beautiful Soup code and easy to fix.

2. The lxml parser doesn't deal well with Unicode documents. It's rejecting your markup, with this exception:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 79: unexpected end of data

But you don't get any visibility into that exception. I fixed this by propagating the exception upwards so you can see it.

3. By encoding the data as UTF-8, you can get lxml to accept the markup without raising an exception. But whatever problem lxml is having with this particular document doesn't go away, and lxml still can't handle the document. it ignores the entire thing, because of whatever problem it perceives in the DOCTYPE, and you're left with an empty BeautifulSoup object.

The fixes to 1 and 2 are in revision 526. To actually parse the document I recommend using html5lib as the parser instead of lxml.

Changed in beautifulsoup:
status: New → Fix Committed
Kamil Mahmood (kamilmahmood) wrote :

Thanks for fixing bug

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers