Bug #1681115 “Some tests fail with python3” : Bugs : Beautiful Soup

Revision history for this message

Aloysius (aloysius-w) wrote on 2017-04-08:

#1

bs4-python3_test.log Edit (4.6 KiB, text/html)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2017-05-06:

#2

I can't duplicate this on Python 3.6.1 or Python 3.5.2, and I can't make sense of the test failures.

The fact that SNOWMAN is rendering as "â˜ƒ" in your terminal tells me that you may have your default output encoding set to Windows-1252, or that you otherwise have a Windows-like system. I can't square that with the UNIX paths in the tracebacks, but I think the root of the problem is that the test environment occasionally reaches for an encoding it thinks is UTF-8, and it gets Windows-1252 instead.

However, even assuming that's the case, I can't make sense of the test_simple_html_substitution failure, which is the simplest one. Since \u2200 is not being converted to ∀, there's a problem with the regular expression, which is generated from the code points listed in html.entities.codepoint2name.

The regular expression should be matching \u2200, but it isn't. This tells me that the regular expression was probably created out of garbled data--maybe Unicode strings encoded as UTF-8 and then decoded as Windows-1252. But I don't see a way that could happen. The Unicode strings are created with chr(), which takes a Unicode code point. Then they're immediately compiled into a regular expression.

I am not denying your experience, but I can't duplicate or diagnose this issue, so I'm closing it.

Some things that might be useful:

* The tests that are failing here are very old. Try running the tests on earlier versions of Beautiful Soup and see if there's a point where the test failures start.
* Duplicate the problem with standalone Python code that uses EntitySubstitution.substitute_html.
* Duplicate the problem with Python 2 code on the same machine that gives you the problem with Python 3.

Changed in beautifulsoup:
status:	New → Won't Fix
status:	Won't Fix → Incomplete

Revision history for this message

Aloysius (aloysius-w) wrote on 2017-05-07:

#3

Thanks for your reply.

It is indeed a linux system. Concerning Unicode, I couldn't find anything strange among the shell variables; LANG is set to en_US.UTF-8.
Could this encoding problem be possibly coming from another module? If so, which one?

The problem with testing older versions is that with anything earlier than 4.5.0 I get a "ERROR: Failure: AttributeError ('module' object has no attribute '_base')" when launching nosetests.

And in regard to bullet points #2 and #3, I'm afraid my python-fu is not good and I would require some handholding.

Regards

Revision history for this message

Leonard Richardson (leonardr) wrote on 2017-05-07:

#4

Minimal test_simple_html_substitution script Edit (1.9 KiB, text/x-python)

Forget about the Windows-1252 thing. My browser was rendering the logfile as Windows-1252. When I download the log and look at it in a terminal I see it as UTF-8 and it looks more like what I'd expect. I'll paste one of the test failures into this window so it will show up as part of a UTF-8 web page and we'll see the same thing when we look at it:

FAIL: test_simple_html_substitution (tests.test_soup.TestEntitySubstitution)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/abuild/rpmbuild/BUILDROOT/python-beautifulsoup4-4.5.3-0.x86_64/usr/lib/python3.4/site-packages/bs4/tests/test_soup.py", line 168, in test_simple_html_substitution
"foo∀\N{SNOWMAN}õbar")
AssertionError: 'foo\\u2200☃\\u00f5bar' != 'foo∀☃õbar'
- foo\u2200☃\u00f5bar
+ foo∀☃õbar

The regular expression isn't matching because the input is wrong. This code is supposed to set up the nine-character string "foo∀☃õbar":

s = "foo\u2200\N{SNOWMAN}\u00f5bar"

Instead, on your system, it sets up the twenty-one character string 'foo\\u2200☃\\u00f5bar'. SNOWMAN is converted to ☃, but \u2200 is not converted to ∀. The Python string is being converted to a Unicode string (not a byte string -- otherwise you would get 'foo\\u2200\\N{SNOWMAN}\\u00f5bar') but the \u escape sequences are not being respected.

The other three test failures look like the same problem.

To start with, try running this code. The output should be: ∀

print("\u2200")

That will probably work, but if it doesn't, there's a problem with your Python installation and I have no clue what's going on.

Then, run the script I've attached to this comment. This script includes only the code necessary to do what test_simple_html_substitution tests. On my computer the output is "foo∀☃õbar", but on your computer I would expect the output to be 'foo\\u2200☃\\u00f5bar'. The script is only 50 lines of code, so it should be possible for you to use it as a tool to figure out what's going on on your machine.

Forget about the Windows-1252 thing. My browser was rendering the logfile as Windows-1252. When I download the log and look at it in a terminal I see it as UTF-8 and it looks more like what I'd expect. I'll paste one of the test failures into this window so it will show up as part of a UTF-8  web page and we'll see the same thing when we look at it:

FAIL: test_simple_html_substitution (tests.test_soup.TestEntitySubstitution)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/abuild/rpmbuild/BUILDROOT/python-beautifulsoup4-4.5.3-0.x86_64/usr/lib/python3.4/site-packages/bs4/tests/test_soup.py", line 168, in test_simple_html_substitution
    "foo&forall;\N{SNOWMAN}&otilde;bar")
AssertionError: 'foo\\u2200☃\\u00f5bar' != 'foo&forall;☃&otilde;bar'
- foo\u2200☃\u00f5bar
+ foo&forall;☃&otilde;bar

The regular expression isn't matching because the input is wrong. This code is supposed to set up the nine-character string "foo∀☃õbar":

s = "foo\u2200\N{SNOWMAN}\u00f5bar"

Instead, on your system, it sets up the twenty-one character string 'foo\\u2200☃\\u00f5bar'. SNOWMAN is converted to ☃, but \u2200 is not converted to ∀. The Python string is being converted to a Unicode string (not a byte string -- otherwise you would get 'foo\\u2200\\N{SNOWMAN}\\u00f5bar') but the \u escape sequences are not being respected.

The other three test failures look like the same problem.

To start with, try running this code. The output should be: ∀

print("\u2200")

That will probably work, but if it doesn't, there's a problem with your Python installation and I have no clue what's going on.

Then, run the script I've attached to this comment. This script includes only the code necessary to do what test_simple_html_substitution tests. On my computer the output is "foo&forall;☃&otilde;bar", but on your computer I would expect the output to be 'foo\\u2200☃\\u00f5bar'. The script is only 50 lines of code, so it should be possible for you to use it as a tool to figure out what's going on on your machine.

Revision history for this message

Aloysius (aloysius-w) wrote on 2017-05-12:

#5

Hi, sorry for the late reply, but I wanted to check things in different ways to be sure.

Both tests succeed, the script only when LANG=en_US.UTF-8 is exported or prepended.

Since I'm not running the test interactively and I'm apparently the only one seeing this problem, I'm starting to wonder if it's not the package build system (OBS) that's somehow introducing it.

Are the three failing tests in the log the only ones that could be affected by unicode issues? Can you think of a way of altering them so to verify that the relevant env variables are set as they should?

Revision history for this message

Leonard Richardson (leonardr) wrote on 2017-05-12:

#6

The only other test in the system that uses the \u construct is Test.test_repr, and that is a conditional test which doesn't use the \u construct under Python 3. So something in your system doesn't like the \u construct.

Is there a specific value other than en_US.UTF8 that you have to set LANG to to make the script not work? Or is it any other value that makes the script fail?

Have you checked on what the value of LANG is during the test? Maybe the build system changes or unsets it. That said, I just ran the tests with LANG unset, with LANG=C, and with LANG=en_IN, and there was no problem on Python 2 or Python 3.

Revision history for this message

Aloysius (aloysius-w) wrote on 2017-05-16:

#7

testenv.patch Edit (495 bytes, text/plain)

I created the attached silly patch and its output, even when nosetests is run through OBS, is en_US.UTF-8.

Am I missing something?

Revision history for this message

Aloysius (aloysius-w) wrote on 2017-05-22:

#8

Removing a manual invocation of 2to3 did the trick. For some reason it was corrupting unicode test strings.

Sorry for wasting your time.

Regards

Leonard Richardson (leonardr) on 2018-07-14

Changed in beautifulsoup:
status:	Incomplete → Invalid

Beautiful Soup

Some tests fail with python3

Bug Description

Other bug subscribers

Patches

Bug attachments

Remote bug watches