Comment 31 for bug 191199

Revision history for this message
Gavin Panella (allenap) wrote :

On 11 Feb 2008, at 14:17, James Henstridge wrote:
> On 11/02/2008, Gavin Panella <email address hidden> wrote:
>> On 9 Feb 2008, at 21:26, Christian Robottom Reis wrote:
>>> Because I don't have anything better to do I also looked at the
>>> checkwatches failures today. They are not all bad. But there's one
>>> which
>>> bothered me; elinks.cz fails when pulling information for bug 987.
>>>
>>> 10:58:23 INFO Updating 1 watches on http://
>>> bugzilla.elinks.cz
>>>
>>> 10:58:26 ERROR Failed to parse XML description for
>>> http://bugzilla.elinks.cz bugs [u'987']: syntax error: line 10,
>>> column 62
>>>
>>> Now this is failing because somebody decided it would be a good
>>> idea to
>>> put a non-printable character in a bug comment:
>>>
>>> http://bugzilla.elinks.cz/show_bug.cgi?id=987#c2
>>>
>>> What should our long-term plan be for this sort of situation? Get it
>>> fixed upstream? Or replace unprintables when importing comments? Or
>>> blacklisting bugs so they stop spamming our logs?
>>
>> Bugzilla shouldn't create invalid XML, so it should ideally be fixed
>> there. See:
>>
>> https://bugzilla.mozilla.org/show_bug.cgi?id=105960
>>
>> But... it's been open since 2001 and has actually been commented on
>> by James H as recently as April 2002. Looks like this one is not
>> getting fixed, and we should probably try to unfuck the XML from
>> Bugzilla ourselves.
>
> My opinion is that since Bugzilla does not guarantee that it will
> produce valid XML, we should not treat said data as XML.
>
> I'd suggest using the BeautifulSoup.BeautifulStoneSoup class
> (BeautifulSoup minus HTML specific tweaks) to do the parsing. This
> should give us some data even for invalid pages:
>
>>>> import urllib2
>>>> from BeautifulSoup import BeautifulStoneSoup
>>>> data = urllib2.urlopen(
> ... 'http://bugzilla.elinks.cz/xml.cgi?id=987').read()
>>>> soup = BeautifulStoneSoup(data)
>>>> for comment in soup.findAll('long_desc'):
> ... print repr(comment.find('thetext').renderContents())
> ...
> 'Patch against elinks-0.11 GIT based on https://bugs.launchpad.net/
> bugs/64590'
> 'Created an attachment (id=423)\nTypos and language corrections in
> ELinks strings\n'
> 'Looks good. Should I credit\x01Malcolm Parsons in the AUTHORS file?'
> 'Yes, please credit Malcolm.'
>
> We still need to work out what to do about character encodings, but
> that is necessary anyway: as mentioned in the bug report old Bugzilla
> had no concept of character encoding, so old bug data can be
> misencoded (one of the sources of invalid XML from bugzilla).
>
> James.