Comment 11 for bug 1233501

Revision history for this message
Matthew S (matts8484) wrote : Re: Security group names cannot contain at characters

The only problem I see is this: you're right that in most of the ISO 8859-x encodings (including -1), bytes with values 0x7f - 0x9f are control characters, in the UTF-8 encoding of Unicode, bytes with the high bit set ( > 0x80) are the lead byte of a multi-byte character sequence, except for 0xc0, 0xc1, 0xfe and 0xff, which are guaranteed never to appear in UTF-8 encoded data.

So if the regex disallows the ISO 8859-1 control characters, some valid UTF-8 data might also be rejected.

But all of this is assuming that the code is operating on this string as bytes, not as characters. If something has already converted this into a Python wide string a la u'foo', then characters with these values really are control characters, and not bytes that could be part of a UTF-8 multi-byte sequence. I don't know Django well enough to know what's going on here (sorry).

The key thing to grok is that 7-bit ASCII is a proper subset of UTF-8 - all ASCII is valid UTF-8.
But not all 8-bit 'extended ASCII' such as ISO 8859-1 is valid UTF-8.

On my Debian box, 'man utf-8' explains this much better than my attempt above.

This all is already quite confusing to me. I apologise if I have increased the complexity/confusion yet further!