Bug #501166 “testtools should be able to cope with non-ascii tra...” : Bugs : testtools

Revision history for this message

Martin Packman (gz) wrote on 2009-12-28:

#1

Example script demonstrating the issue Edit (200 bytes, text/x-python)

Revision history for this message

Martin Packman (gz) wrote on 2009-12-28:

#2

Result of running the example script Edit (2.6 KiB, text/plain)

Revision history for this message

Robert Collins (lifeless) wrote on 2009-12-29: Re: [Bug 501166] [NEW] testtools should be able to cope with non-ascii tracebacks

#3

On Mon, 2009-12-28 at 23:35 +0000, Martin [gz] wrote:
> Public bug reported:
>
> The testtools.content.TracebackContent class is not robust against
> tracebacks where either the exception value, a script filename, or a
> script line contains non-ascii text. The particular problem is with the
> lines:
>
> value = self._result._exc_info_to_string(err, test)
> super(TracebackContent, self).__init__(content_type,
> lambda:[value.encode("utf8")])
>
> Here value is a str object, trying to encode it uses the default
> encoding (generally ascii) to decode it to unicode before encoding as
> UTF-8 and throws if the str is not 7-bit clean.

So, for value to be a 'str' and not 7-bit clean:
- you need an ascii file system encoding
- you need non-ascii paths in the file system
- and ascii symbol names throughout

or
- test data that raises an AssertionError which has a str that is not
ascii.

Are there other causes for a non-ascii 'str' value there?

Anyhow, clearly we should fix this

status triaged
importance critical

I suggest that fixing it 'the right way' will be tricky for Python2.x,
because up till to Python 2.7 str(Exception) blows up badly with unicode
args in a few ways. So we probably need to change the stringification
logic to handle this case and return a 'unicode' value not a 'str'
value. However I think we currently use that largely as-is from Python
core. I can't look at this right now, but I think its very important to
fix.

> This code is fixable, however I think the mime-types model for test
> metadata is overly complicated, fragile, and generally a bad idea.

Well, its been hugely successful for me already in the few places I've
deployed it, so with respect: I disagree. For instance, its already
solved a major bug with subunit + bzr selftest which shows up readily in
selftest --parallel. The code is simple, nearly trivial, and the
specifications (for e.g. MIME) are readily available allowing rampant
reuse by other tools.

Backtraces and test data in /general/ are for showing to humans but also
need to be machine processable (e.g. to look for common patterns or
encode and transmit to remote machines) and MIME has had many many
successes at bridging these two needs including both HTTP and SMTP.

-Rob

On Mon, 2009-12-28 at 23:35 +0000, Martin [gz] wrote:
> Public bug reported:
> 
> The testtools.content.TracebackContent class is not robust against
> tracebacks where either the exception value, a script filename, or a
> script line contains non-ascii text. The particular problem is with the
> lines:
> 
>     value = self._result._exc_info_to_string(err, test)
>     super(TracebackContent, self).__init__(content_type,
>         lambda:[value.encode("utf8")])
> 
> Here value is a str object, trying to encode it uses the default
> encoding (generally ascii) to decode it to unicode before encoding as
> UTF-8 and throws if the str is not 7-bit clean.

So, for value to be a 'str' and not 7-bit clean:
 - you need an ascii file system encoding
 - you need non-ascii paths in the file system
 - and ascii symbol names throughout

or
 - test data that raises an AssertionError which has a str that is not
ascii.

Are there other causes for a non-ascii 'str' value there?

Anyhow, clearly we should fix this

status triaged
 importance critical

I suggest that fixing it 'the right way' will be tricky for Python2.x,
because up till to Python 2.7 str(Exception) blows up badly with unicode
args in a few ways. So we probably need to change the stringification
logic to handle this case and return a 'unicode' value not a 'str'
value. However I think we currently use that largely as-is from Python
core. I can't look at this right now, but I think its very important to
fix.

> This code is fixable, however I think the mime-types model for test
> metadata is overly complicated, fragile, and generally a bad idea.

Well, its been hugely successful for me already in the few places I've
deployed it, so with respect: I disagree. For instance, its already
solved a major bug with subunit + bzr selftest which shows up readily in
selftest --parallel. The code is simple, nearly trivial, and the
specifications (for e.g. MIME) are readily available allowing rampant
reuse by other tools.

Backtraces and test data in /general/ are for showing to humans but also
need to be machine processable (e.g. to look for common patterns or
encode and transmit to remote machines) and MIME has had many many
successes at bridging these two needs including both HTTP and SMTP.

-Rob

Changed in testtools:
importance:	Undecided → Critical
status:	New → Triaged

Revision history for this message

Martin Packman (gz) wrote on 2009-12-29:

#4

As mentioned, I can also get testtools to break with a non-ascii test script name (no other requirements), or non-ascii text on a line of the script where an exception is raised (either latin-1 or as per the encoding set at the top of the file).

This is python 2 specific, but assuming that every stringified traceback conforms to a single encoding is incorrect, even if this particular mistake with trying to encode a str object is fixed. As there is neither a meaningful encoding, content-type, or anything else mime-y, the envelope is pointless. Also note, the 'language' header used by this code is 1) deprecated 2) should be a two-letter code 3) is for actual languages, not programming languages.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-05-13:

#5

From the duplicate:

Martin [gz] says:
- str(traceback) is bytes
in those bytes [here]:
- filenames are in mbcs
- code lines are in the encoding of the file that the source is from
- and the bytes for the error are in the encoding of the error - e.g. totally arbitrary.

We have to defend against three sets of different encodings.

Robert Collins wrote 20 hours ago: #2
Looks like we need to port a copy of the underlying things, and get the encoding etc sorted out in the root data.

Revision history for this message

Martin Packman (gz) wrote on 2010-05-25:

#6

Ugh, okay, I've got this. The fix discussed with Robert as UDS is to essentially backport the Python 3 string semantics to Python 2, so tracebacks are formatted as unicode. This involves messing with a lot of core modules, but can be done without monkey patching.

Changed in testtools:
assignee:	nobody → Martin [gz] (gz)

Jonathan Lange (jml) on 2010-06-19

Changed in testtools:
status:	Triaged → In Progress

Robert Collins (lifeless) on 2010-06-24

Changed in testtools:
milestone:	none → next
status:	In Progress → Fix Released

testtools

testtools should be able to cope with non-ascii tracebacks

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches