On 11 Feb 2008, at 14:17, James Henstridge wrote:
> On 11/02/2008, Gavin Panella <email address hidden> wrote:
>> On 9 Feb 2008, at 21:26, Christian Robottom Reis wrote:
>>> Because I don't have anything better to do I also looked at the
>>> checkwatches failures today. They are not all bad. But there's one
>>> which
>>> bothered me; elinks.cz fails when pulling information for bug 987.
>>>
>>> 10:58:23 INFO Updating 1 watches on http://
>>> bugzilla.elinks.cz
>>>
>>> 10:58:26 ERROR Failed to parse XML description for
>>> http://bugzilla.elinks.cz bugs [u'987']: syntax error: line 10,
>>> column 62
>>>
>>> Now this is failing because somebody decided it would be a good
>>> idea to
>>> put a non-printable character in a bug comment:
>>>
>>> http://bugzilla.elinks.cz/show_bug.cgi?id=987#c2
>>>
>>> What should our long-term plan be for this sort of situation? Get it
>>> fixed upstream? Or replace unprintables when importing comments? Or
>>> blacklisting bugs so they stop spamming our logs?
>>
>> Bugzilla shouldn't create invalid XML, so it should ideally be fixed
>> there. See:
>>
>> https://bugzilla.mozilla.org/show_bug.cgi?id=105960
>>
>> But... it's been open since 2001 and has actually been commented on
>> by James H as recently as April 2002. Looks like this one is not
>> getting fixed, and we should probably try to unfuck the XML from
>> Bugzilla ourselves.
>
> My opinion is that since Bugzilla does not guarantee that it will
> produce valid XML, we should not treat said data as XML.
>
> I'd suggest using the BeautifulSoup.BeautifulStoneSoup class
> (BeautifulSoup minus HTML specific tweaks) to do the parsing. This
> should give us some data even for invalid pages:
>
>>>> import urllib2
>>>> from BeautifulSoup import BeautifulStoneSoup
>>>> data = urllib2.urlopen(
> ... 'http://bugzilla.elinks.cz/xml.cgi?id=987').read()
>>>> soup = BeautifulStoneSoup(data)
>>>> for comment in soup.findAll('long_desc'):
> ... print repr(comment.find('thetext').renderContents())
> ...
> 'Patch against elinks-0.11 GIT based on https://bugs.launchpad.net/
> bugs/64590'
> 'Created an attachment (id=423)\nTypos and language corrections in
> ELinks strings\n'
> 'Looks good. Should I credit\x01Malcolm Parsons in the AUTHORS file?'
> 'Yes, please credit Malcolm.'
>
> We still need to work out what to do about character encodings, but
> that is necessary anyway: as mentioned in the bug report old Bugzilla
> had no concept of character encoding, so old bug data can be
> misencoded (one of the sources of invalid XML from bugzilla).
>
> James.
OOPS-830CCW7 shows the problem and the Exception type is UnparseableBugData
A newer instance: OOPS-1633CCW184
We expect proper escaping for HTML templates, we should do the same for XML
templates. Template Toolkit has the 'html' filter, does it have an 'xml'
filter, or are they the same?