Comment 75 for bug 893091

Revision history for this message
In , Barry Warsaw (barry) wrote :

On Dec 12, 2011, at 04:58 PM, Barry Warsaw wrote:

>>@@ +593,5 @@
>>> }
>>> + y = *(unsigned char *)PyBytes_AS_STRING(obj);
>>> + }
>>> + else if (PyUnicode_Check(obj)) {
>>> + PyObject *obj_as_bytes = PyUnicode_AsUTF8String(obj);
>>
>>This logic (to make a Byte from a single-byte unicode string) seems really
>>strange, and a bit inefficient.
>>
>>If you're using UTF-8 (but why is UTF-8 special here?), then it'd be way more
>>efficient to check that the length of the Unicode is 1 and the first (and
>>only) character is < 128. (Otherwise, it'd encode to more than one UTF-8
>>byte.)
>>
>>But I think more sensible semantics would be to check that the length of the
>>Unicode is 1, and use the numeric value of the first *codepoint* - so it's an
>>error if it isn't in the latin-1 range, between U+0000 and U+00FF?
>>
>>(Optimization: even if Python is encoding its strings in UTF-16 like it does
>>on Windows, it's enough to get the first-and-only Py_UNICODE - either a UCS-4
>>32-bit codepoint, or a 16-bit unit of UTF-16 which is either itself or half
>>of a surrogate pair - and check that it's in the range 0 to 255. If it is,
>>the right answer is that byte; if not, error.)

Finally coming back to this issue.

The (possible in)efficiency doesn't bother me at all. I don't think this is
performance critical and while I haven't measured it, I'll bet the normal
Python overhead will outweigh any conversions from unicode to bytes.

The semantic question is more interesting though. Just what should it mean to
append a byte signature with a unicode object?

As an experiment, I commented out the PyUnicode_Check() stanza, and there were
a few test failures in my current github branch. Looking at it more carefully
though, I think it's better to disallow appending a bytes signature ('y') with
anything other than a length-1 Python bytes object or an integer. Given that
the semantics you outline above are questionable, "in the face of ambiguity,
refuse the temptation to guess".

So I'm all for disallowing unicode objects here. The downside is that users
will have to change their code when porting from Python 2 to Python 3, because
a native string (i.e. unadorned 8-bit string in Python 2) will not work in
Python 3. It means prepending a b-prefix to specify byte strings.

This doesn't seem bad to me, especially because we're already requiring a
similar change for ByteArray instantiations. In the optimistic hope that you
agree, I'll make this change (disallow unicodes with 'y' signatures), and
update the user-visible changes section to reflect this. It's easy enough to
back out of course.

Cheers.