BeautifulSoup Selector Doesn't Support Non-ASCII characters

Bug #1455778 reported by Lumit
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

Platform: Win8.1 64bit, Spyder IDE with Anaconda distribution

Reproduce the bug:

1.Open the ipython pane in Spyder IDE

2.Type the following codes:

>>> import requests, bs4
>>> res = requests.get('http://nostarch.com')
>>> res.raise_for_status()
>>> noStarchSoup = bs4.BeautifulSoup(res.text)
>>> noStarchSoup.select('div')
...(lots of traceback )
...UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2746: ordinal not in range(128)

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for your bug report. This is potentially a very serious bug. Unfortunately there's not enough information here for me to solve the problem.

1. I need the actual markup that caused the problem. Websites change all the time. I can't duplicate your bug and I don't know if it's because http://nostarch.com has changed or if there's some other problem.

I realize that the instructions for filing bugs against Beautiful Soup said "at least mention the URL to the web page", so this is my fault. As of this bug I've changed those instructions to insist on the actual HTML.

If you still encounter this issue on http://nostarch.com, or you ever encounter it again, please upload an attachment containing the actual HTML you're feeding to Beautiful Soup.

2. I need to know which version of Beautiful Soup you're using (bs4.__version__). Maybe Anaconda has an old version of Beautiful Soup packaged and you're encountering a problem I've already fixed? I don't know.

3. Since you edited out the traceback, I have no idea where in Beautiful Soup the problem happened, so I can't go through the code to try and spot an obvious problem.

Thanks again for filing this bug.

Changed in beautifulsoup:
status: New → Incomplete
Revision history for this message
Florian (florianpilz) wrote :

Hi, I just found the same misbehaviour. The problem is the use of `shlex` in older Python versions. More specifically, in my case `select` breaks on a HTML DOM with Python 2.6.9. I found the issue when working on `zope.testbrowser` (the test https://github.com/zopefoundation/zope.testbrowser/blob/master/src/zope/testbrowser/tests/test_browser.py#L931).

Even though the test breaks on my machine due to the issue, it succeeds on Travis (https://travis-ci.org/zopefoundation/zope.testbrowser/builds/140734841).

Revision history for this message
Florian (florianpilz) wrote :

To give more information: `html.select(u'label[for="foo-field"]')` yields an `ValueError`, since `html` is an instance of beautifulsoup and calls `shlex.split(u'label[for="foo-field"]')` inside `select`. But Unicode is not supported by `shlex`. It works finde without a leading `u` for unicode.

Revision history for this message
Florian (florianpilz) wrote :

I saw that you dropped support for Python 2.6 in 4.5.0. When I pin beautifulsoup to version 4.4.1 everything works as expected. :)

Revision history for this message
Leonard Richardson (leonardr) wrote :

Closing this issue as fixed because it only shows up in an unsupported version of Python.

Changed in beautifulsoup:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.