__repr__ of a Tag object should return ascii encoded string on python2

Bug #1420131 reported by Roman Bolshakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

beautifulsoup currently returns utf-8 encoded string if __repr__ method is called on Tag object.
That causes issues if __repr__ is invoked implicitly as part of str(list) execution flow, in case of %r format specifier or within an application/library where non-ascii characters are not expected from __repr__.

A rule of thumb for __repr__ is to return ascii encoded string to avoid ambiguity in repr string content and mismatch of expectation on clients. beautifulsoup should expose unicode representation of tag object via __unicode__ method and leave __repr__ and __str__ with ascii-only symbols on python2. Probably it could return safe escaped byte string instead of utf-8 encoded one to carry the same meaning it had.

The issue caused a side bug in pytest:
https://bitbucket.org/hpk42/pytest/issue/678/pytest-cannot-deal-with-utf-8-encoded

While the issue above could be fixed in pytest, there could be another cases where beautifulsoup causes similar issues.

Related reading:
http://kmike.ru/python-with-strings-attached/

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for filing this bug report.

I agree with you about __repr__. I've modified Tag.__repr__ so that the default encoding is 'unicode-escape' instead of 'utf-8'. In Python 2, repr(tag) will now return an ASCII-encoded bytestring instead of a UTF-8 bytestring. In Python 3, repr() will return a Unicode string instead of a UTF-8 bytestring.

Here's the code and the test I wrote for repr(). I'd appreciate it if you could sanity-check this.

    def __repr__(self, encoding="unicode-escape"):
        """Renders this tag as a string."""
        if PY3K:
            # "The return value must be a string object", i.e. Unicode
            return self.decode()
        else:
            # "The return value must be a string object", i.e. a bytestring.
            # By convention, the return value of __repr__ should also be
            # an ASCII string.
            return self.encode(encoding)

    def test_repr(self):
        html = u"<b>\N{SNOWMAN}</b>"
        soup = self.soup(html)
        if PY3K:
            self.assertEqual(html, repr(soup))
        else:
            self.assertEqual(b'<b>\\u2603</b>', repr(soup))

I don't agree with you that __str__ is supposed to contain only ASCII symbols. I'm happy to be corrected if you have a reference, but even if it's true, I think changing the behavior of __str__ would break too much existing code, so I don't think I'll be changing that.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Fix is in revision 365. I'm not closing this issue yet because I'm considering how much code it would break to make str() return a Unicode object in Python 3.

Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.