__repr__ of a Tag object should return ascii encoded string on python2
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
beautifulsoup currently returns utf-8 encoded string if __repr__ method is called on Tag object.
That causes issues if __repr__ is invoked implicitly as part of str(list) execution flow, in case of %r format specifier or within an application/library where non-ascii characters are not expected from __repr__.
A rule of thumb for __repr__ is to return ascii encoded string to avoid ambiguity in repr string content and mismatch of expectation on clients. beautifulsoup should expose unicode representation of tag object via __unicode__ method and leave __repr__ and __str__ with ascii-only symbols on python2. Probably it could return safe escaped byte string instead of utf-8 encoded one to carry the same meaning it had.
The issue caused a side bug in pytest:
https:/
While the issue above could be fixed in pytest, there could be another cases where beautifulsoup causes similar issues.
Related reading:
http://
Changed in beautifulsoup: | |
status: | New → Fix Committed |
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
Thanks for filing this bug report.
I agree with you about __repr__. I've modified Tag.__repr__ so that the default encoding is 'unicode-escape' instead of 'utf-8'. In Python 2, repr(tag) will now return an ASCII-encoded bytestring instead of a UTF-8 bytestring. In Python 3, repr() will return a Unicode string instead of a UTF-8 bytestring.
Here's the code and the test I wrote for repr(). I'd appreciate it if you could sanity-check this.
def __repr__(self, encoding= "unicode- escape" ): encoding)
"""Renders this tag as a string."""
if PY3K:
# "The return value must be a string object", i.e. Unicode
return self.decode()
else:
# "The return value must be a string object", i.e. a bytestring.
# By convention, the return value of __repr__ should also be
# an ASCII string.
return self.encode(
def test_repr(self): SNOWMAN} </b>"
self. assertEqual( html, repr(soup))
self. assertEqual( b'<b>\\ u2603</ b>', repr(soup))
html = u"<b>\N{
soup = self.soup(html)
if PY3K:
else:
I don't agree with you that __str__ is supposed to contain only ASCII symbols. I'm happy to be corrected if you have a reference, but even if it's true, I think changing the behavior of __str__ would break too much existing code, so I don't think I'll be changing that.