Comment 23 for bug 605543

Revision history for this message
Captain Chaos (launchpad-chaos) wrote :

The tweets are mainly in Dutch, English and Japanese.

You say that "before the data is parsed by htmllib.HTMLParser it must be unicode", but your modification actually turns the string into a UTF-8-encoded 8-bit string, not unicode. What's more, a "print type(s)" in unescape() reveals that *without* the modification the type of the string passed in *is* unicode.

I'm pretty sure that what happens is this:

* unescape() invokes HTMLParser.save_bgn(), which initialises HTMLParser.savedata to an empty 8-bit string
* unescape() invokes HTMLParser.feed (inherited from SGMLParser) with a unicode string (m["text"], verified with a "print type(s)")
* the string is concatenated to SGMLParser.rawdata, which started out as an empty 8-bit string but now becomes unicode
* feed() invokes goahead()
* goahead() searches rawdata for HTML tags and invokes handle_data() (implemented by HTMLParser) for the text parts in between
* handle_data() concatenates the unicode string to savedata, which started out as an empty 8-bit string but now becomes unicode
* when goahead() encounters an entity tag, it invokes handle_entityref()
* handle_entityref() invokes convert_entityref() to convert the tag name to the corresponding character. it uses the entitydefs table for this. HTMLParser has imported entitydefs from htmlentitydefs.py. It contains each entity tag's corresponding character as an 8-bit string in the latin-1 encoding, or a character reference if the character is not contained in the latin-1 encoding
* handle_entityref() then invokes handle_data() to append the character referenced by the entity tag. it passes in the latin-1 encoded 8-bit string it got from convert_entityref()
* handle_data() does this:

self.savedata = self.savedata + data

at this point savedata is unicode, but data is an 8-bit string. Python therefore has to convert the 8-bit string to unicode in order to be able to append it. it uses the "default encoding" for this. on my system the default encoding at this point appears to be utf8 (this is borne out by the error message). the utf8 codec tries to interpret the latin-1 encoded character as utf8 and (correctly) fails

The questions that need answering at this point are:

* Why is the default encoding utf8? Could it have to do with my locale setting (which is en_US.utf8)?
* Interestingly, according to the Python documentation the regular default encoding is ascii, which would also fail, so why doesn't everyone have this problem?
* HTMLParser doesn't work correctly when: 1) the default encoding is not latin-1, 2) you offer it unicode strings and 3) the strings contain entity tags. My fix remedies this. Is this not a bug which needs fixing?

I'm reverting back to my original fix. It's the only one so far which results in no error messages at all (at least for twitter). As much as I would like to, I don't have the time to learn Python and become a Gwibber developer and unicode expert to get this bug fixed, especially since I don't think the problem is actually in Gwibber itself.

It would be great if an Ubuntu & Python expert could look at my reasoning above and see if it holds water and if sgmllib.py and htmllib.py need to be fixed.