analyse_oops_reports raised an UnicodeEncodeError loading an oops from the filesystem

Bug #891186 reported by Diogo Matsubara
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
python-oops-tools
Fix Released
Critical
Martin Packman

Bug Description

Traceback (most recent call last):
  File "bin/analyse_error_reports", line 49, in <module>
    oopstools.scripts.analyse_error_reports.main()
  File "/srv/lp-oops.canonical.com/cgi-bin/lpoops/src/oopstools/scripts/analyse_error_reports.py", line 137, in main
    summary.renderHTML(fp)
  File "/srv/lp-oops.canonical.com/cgi-bin/lpoops/src/oopstools/oops/dbsummaries.py", line 681, in renderHTML
    section.renderHTML(fp)
  File "/srv/lp-oops.canonical.com/cgi-bin/lpoops/src/oopstools/oops/dbsummaries.py", line 307, in renderHTML
    self._renderGroups(fp, html=True)
  File "/srv/lp-oops.canonical.com/cgi-bin/lpoops/src/oopstools/oops/dbsummaries.py", line 281, in _renderGroups
    group.renderHTML(fp)
  File "/srv/lp-oops.canonical.com/cgi-bin/lpoops/src/oopstools/oops/dbsummaries.py", line 177, in renderHTML
    % (self.count, _escape(self.etype), _escape(self.evalue)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 78: ordinal not in range(128)

The oops that caused the error: OOPS-fc231a0a7471d55e72e091f0bcc135eb

Related branches

Changed in python-oops-tools:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Martin Packman (gz) wrote :

This is a pretty simple problem.

  oopstools/oops/models.py line 153
    exception_value = models.CharField(max_length=MAX_EVALUE_LEN)

Django deals in unicode.

  oopstools/scripts/analyse_error_reports.py lines 136-138
    fp = open(options.html, 'wb')
    summary.renderHTML(fp)
    fp.close()

Opens a file with no encoding specified.

  oopstools/oops/dbsummaries.py lines 175-176
    fp.write('<div class="exc">%d <b>%s</b>: %s</div>\n'
             % (self.count, _escape(self.etype), _escape(self.evalue)))

Here `_escape()` is just a dumb cgi.escape wrapper that handles '&<>' and nothing else, and returns unicode when given unicode. Any value being written that can't be coerced to ascii will throw.

Using a better means of creating html here would be nice, but just settling on an encoding and making string values written conform to it would help.

no longer affects: oops-tools
Revision history for this message
Martin Packman (gz) wrote :
Revision history for this message
Robert Collins (lifeless) wrote :

That fix assumes all values will be unicode - something that isn't (currently) guaranteed. referrer is probably (or should be) bytestrings [its an http header: they are bytestrings on the wire, and for all upstream python posturing, a unicode http header is ill defined; particularly as there is now a bis recommendation for how to put unicode into a header: headers that are badly encoded will still exist].

Essentially we have a mix of data sources, so I think this needs to be type checking.

Revision history for this message
Robert Collins (lifeless) wrote :

What would really help here is a failing test. The attached patch passes all existing tests, so isn't a regression.

Revision history for this message
Robert Collins (lifeless) wrote :

I've attached a branch as well, I won't propose for merge because its lacking a test - that might be something to add. I'd create an OOPS on the fly for this test I think.

Revision history for this message
Martin Packman (gz) wrote :

Adding an isinstance there is one way Robert, but you still get a page out at the end with no well-defined encoding, which may display incorrectly in browsers and such like. If any current inputs aren't unicode, or may have other problematic bytes such as control characters, it would be best to escape those to a form that will actually display usefully in a webpage before writing them out. That could either be done in the _escape function, or ideally much earlier.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 891186] Re: analyse_oops_reports raised an UnicodeEncodeError loading an oops from the filesystem

On Thu, Nov 17, 2011 at 2:35 PM, Martin Packman
<email address hidden> wrote:
> Adding an isinstance there is one way Robert, but you still get a page
> out at the end with no well-defined encoding, which may display

Thats not (at all) true AFAICT.

Revision history for this message
Selene ToyKeeper (toykeeper) wrote :

I tried to merge a fix for the same problem a while back... it got rejected because it allows oops-tools to accept out-of-spec data (which was kind of the point).

Anyway, the implementation was very simple, and I've attached it.

Revision history for this message
Martin Packman (gz) wrote :

Yes Selene, your merge proposal was mentioned on IRC and is basically the same just from the other end. Will try and make sure something gets landed this time around. :)

Revision history for this message
Diogo Matsubara (matsubara) wrote :

I added a branch with a test case that reproduces the problem: lp:~matsubara/oops-tools/891186-unicodeencodeerror

Revision history for this message
Diogo Matsubara (matsubara) wrote :

When I merge in your patch Robert, I get another failure, this time with the Contains matcher.

======================================================================
ERROR: test_renderHTML_with_unicode_data (test_dbsummaries.TestWebAppErrorSummary)
test_dbsummaries.TestWebAppErrorSummary.test_renderHTML_with_unicode_data
----------------------------------------------------------------------
_StringException: Traceback (most recent call last):
  File "/home/matsubara/devel/canonical/oops-tools/891186-unicodeencodeerror/src/oopstools/oops/test/test_dbsummaries.py", line 52, in test_renderHTML_with_unicode_data
    self.assertThat(fp.read(), Contains(u'some unicode char (\xe1)'))
  File "/home/matsubara/devel/canonical/oops-tools/trunk/eggs/testtools-0.9.12-py2.7.egg/testtools/testcase.py", line 400, in assertThat
    mismatch = matcher.match(matchee)
  File "/home/matsubara/devel/canonical/oops-tools/trunk/eggs/testtools-0.9.12-py2.7.egg/testtools/matchers.py", line 606, in match
    if self.needle not in matchee:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2272: ordinal not in range(128)

Revision history for this message
Martin Packman (gz) wrote :

You can only read and write bytes from cStringIO.StringIO, so the problem with this assertion is it's comparing bytes with unicode:

    fp.seek(0)
    self.assertThat(fp.read(), Contains(u'some unicode char (\xe1)'))

I'd write it instead asserting the UTF-8 byte string:

    self.assertThat(fp.getvalue() Contains("some unicode char (\xc3\xa1)"))

Or with `.decode("utf-8")` after the getvalue if you want to keep the unicode Contains which is neater, especially with older testtools versions.

Revision history for this message
Martin Packman (gz) wrote :

Robert, the problem with your version of the change is you're baking in the bug for later. If non-ascii bytestrings need to be output (which doesn't seem to be the case currently), you get the following situation:

>>> "%s %s" % (_escape(u"\xa7"), _escape("\xa7"))
'\xc2\xa7 \xa7'
>>> _.decode("utf-8")
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa7 in position 3: invalid start byte

Creating a html with any non-utf-8 inputs will result in a page that is likely to display incorrectly, because it contains a mix of different encodings. Avoiding that by ensuring text inputs are all unicode is neater.

Revision history for this message
Robert Collins (lifeless) wrote :

On Tue, Nov 22, 2011 at 3:54 AM, Martin Packman
<email address hidden> wrote:
> Robert, the problem with your version of the change is you're baking in
> the bug for later. If non-ascii bytestrings need to be output (which
> doesn't seem to be the case currently), you get the following situation:
>
>>>> "%s %s" % (_escape(u"\xa7"), _escape("\xa7"))
> '\xc2\xa7 \xa7'
>>>> _.decode("utf-8")
> Traceback (most recent call last):
>  ...
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa7 in position 3: invalid start byte
>
> Creating a html with any non-utf-8 inputs will result in a page that is
> likely to display incorrectly, because it contains a mix of different
> encodings. Avoiding that by ensuring text inputs are all unicode is
> neater.

The reality is that they aren't all unicode, and we need to escape
them in some sensible fashion. I don't want to go around in circles on
this - I will have a look today at the escaping more closely; the
basic constraints I see are;
 - crap data should be understandable (e.g. repr() style output)
 - the report should generate
 - we don't want any xss attack vectors.

Right now, because we're *not* using the django template engine for
this, its quite possible that all three goals are not reached.

-Rob

Martin Packman (gz)
Changed in python-oops-tools:
assignee: nobody → Martin Packman (gz)
status: Triaged → In Progress
Martin Packman (gz)
Changed in python-oops-tools:
status: In Progress → Fix Committed
Changed in python-oops-tools:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.