Some tests fail with python3

Bug #1681115 reported by Aloysius
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

bs4 4.5.3
It seems to be unicode-related, but I'm not sure how it relates to prior bugs.

See attachment

Revision history for this message
Aloysius (aloysius-w) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :

I can't duplicate this on Python 3.6.1 or Python 3.5.2, and I can't make sense of the test failures.

The fact that SNOWMAN is rendering as "☃" in your terminal tells me that you may have your default output encoding set to Windows-1252, or that you otherwise have a Windows-like system. I can't square that with the UNIX paths in the tracebacks, but I think the root of the problem is that the test environment occasionally reaches for an encoding it thinks is UTF-8, and it gets Windows-1252 instead.

However, even assuming that's the case, I can't make sense of the test_simple_html_substitution failure, which is the simplest one. Since \u2200 is not being converted to ∀, there's a problem with the regular expression, which is generated from the code points listed in html.entities.codepoint2name.

The regular expression should be matching \u2200, but it isn't. This tells me that the regular expression was probably created out of garbled data--maybe Unicode strings encoded as UTF-8 and then decoded as Windows-1252. But I don't see a way that could happen. The Unicode strings are created with chr(), which takes a Unicode code point. Then they're immediately compiled into a regular expression.

I am not denying your experience, but I can't duplicate or diagnose this issue, so I'm closing it.

Some things that might be useful:

* The tests that are failing here are very old. Try running the tests on earlier versions of Beautiful Soup and see if there's a point where the test failures start.
* Duplicate the problem with standalone Python code that uses EntitySubstitution.substitute_html.
* Duplicate the problem with Python 2 code on the same machine that gives you the problem with Python 3.

Changed in beautifulsoup:
status: New → Won't Fix
status: Won't Fix → Incomplete
Revision history for this message
Aloysius (aloysius-w) wrote :

Thanks for your reply.

It is indeed a linux system. Concerning Unicode, I couldn't find anything strange among the shell variables; LANG is set to en_US.UTF-8.
Could this encoding problem be possibly coming from another module? If so, which one?

The problem with testing older versions is that with anything earlier than 4.5.0 I get a "ERROR: Failure: AttributeError ('module' object has no attribute '_base')" when launching nosetests.

And in regard to bullet points #2 and #3, I'm afraid my python-fu is not good and I would require some handholding.

Regards

Revision history for this message
Leonard Richardson (leonardr) wrote :

Forget about the Windows-1252 thing. My browser was rendering the logfile as Windows-1252. When I download the log and look at it in a terminal I see it as UTF-8 and it looks more like what I'd expect. I'll paste one of the test failures into this window so it will show up as part of a UTF-8 web page and we'll see the same thing when we look at it:

FAIL: test_simple_html_substitution (tests.test_soup.TestEntitySubstitution)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/abuild/rpmbuild/BUILDROOT/python-beautifulsoup4-4.5.3-0.x86_64/usr/lib/python3.4/site-packages/bs4/tests/test_soup.py", line 168, in test_simple_html_substitution
    "foo∀\N{SNOWMAN}õbar")
AssertionError: 'foo\\u2200☃\\u00f5bar' != 'foo∀☃õbar'
- foo\u2200☃\u00f5bar
+ foo∀☃õbar

The regular expression isn't matching because the input is wrong. This code is supposed to set up the nine-character string "foo∀☃õbar":

s = "foo\u2200\N{SNOWMAN}\u00f5bar"

Instead, on your system, it sets up the twenty-one character string 'foo\\u2200☃\\u00f5bar'. SNOWMAN is converted to ☃, but \u2200 is not converted to ∀. The Python string is being converted to a Unicode string (not a byte string -- otherwise you would get 'foo\\u2200\\N{SNOWMAN}\\u00f5bar') but the \u escape sequences are not being respected.

The other three test failures look like the same problem.

To start with, try running this code. The output should be: ∀

print("\u2200")

That will probably work, but if it doesn't, there's a problem with your Python installation and I have no clue what's going on.

Then, run the script I've attached to this comment. This script includes only the code necessary to do what test_simple_html_substitution tests. On my computer the output is "foo∀☃õbar", but on your computer I would expect the output to be 'foo\\u2200☃\\u00f5bar'. The script is only 50 lines of code, so it should be possible for you to use it as a tool to figure out what's going on on your machine.

Revision history for this message
Aloysius (aloysius-w) wrote :

Hi, sorry for the late reply, but I wanted to check things in different ways to be sure.

Both tests succeed, the script only when LANG=en_US.UTF-8 is exported or prepended.

Since I'm not running the test interactively and I'm apparently the only one seeing this problem, I'm starting to wonder if it's not the package build system (OBS) that's somehow introducing it.

Are the three failing tests in the log the only ones that could be affected by unicode issues? Can you think of a way of altering them so to verify that the relevant env variables are set as they should?

Revision history for this message
Leonard Richardson (leonardr) wrote :

The only other test in the system that uses the \u construct is Test.test_repr, and that is a conditional test which doesn't use the \u construct under Python 3. So something in your system doesn't like the \u construct.

Is there a specific value other than en_US.UTF8 that you have to set LANG to to make the script not work? Or is it any other value that makes the script fail?

Have you checked on what the value of LANG is during the test? Maybe the build system changes or unsets it. That said, I just ran the tests with LANG unset, with LANG=C, and with LANG=en_IN, and there was no problem on Python 2 or Python 3.

Revision history for this message
Aloysius (aloysius-w) wrote :

I created the attached silly patch and its output, even when nosetests is run through OBS, is en_US.UTF-8.

Am I missing something?

Revision history for this message
Aloysius (aloysius-w) wrote :

Removing a manual invocation of 2to3 did the trick. For some reason it was corrupting unicode test strings.

Sorry for wasting your time.

Regards

Changed in beautifulsoup:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.