2008-02-12 09:17:26 |
Gavin Panella |
bug |
|
|
added bug |
2008-02-12 10:41:36 |
Gavin Panella |
bug |
|
|
added subscriber Graham Binns |
2008-02-27 13:16:48 |
Diogo Matsubara |
malone: status |
New |
Confirmed |
|
2008-04-16 21:45:34 |
Diogo Matsubara |
description |
On 11 Feb 2008, at 14:17, James Henstridge wrote:
> On 11/02/2008, Gavin Panella <gavin.panella@canonical.com> wrote:
>> On 9 Feb 2008, at 21:26, Christian Robottom Reis wrote:
>>> Because I don't have anything better to do I also looked at the
>>> checkwatches failures today. They are not all bad. But there's one
>>> which
>>> bothered me; elinks.cz fails when pulling information for bug 987.
>>>
>>> 10:58:23 INFO Updating 1 watches on http://
>>> bugzilla.elinks.cz
>>>
>>> 10:58:26 ERROR Failed to parse XML description for
>>> http://bugzilla.elinks.cz bugs [u'987']: syntax error: line 10,
>>> column 62
>>>
>>> Now this is failing because somebody decided it would be a good
>>> idea to
>>> put a non-printable character in a bug comment:
>>>
>>> http://bugzilla.elinks.cz/show_bug.cgi?id=987#c2
>>>
>>> What should our long-term plan be for this sort of situation? Get it
>>> fixed upstream? Or replace unprintables when importing comments? Or
>>> blacklisting bugs so they stop spamming our logs?
>>
>> Bugzilla shouldn't create invalid XML, so it should ideally be fixed
>> there. See:
>>
>> https://bugzilla.mozilla.org/show_bug.cgi?id=105960
>>
>> But... it's been open since 2001 and has actually been commented on
>> by James H as recently as April 2002. Looks like this one is not
>> getting fixed, and we should probably try to unfuck the XML from
>> Bugzilla ourselves.
>
> My opinion is that since Bugzilla does not guarantee that it will
> produce valid XML, we should not treat said data as XML.
>
> I'd suggest using the BeautifulSoup.BeautifulStoneSoup class
> (BeautifulSoup minus HTML specific tweaks) to do the parsing. This
> should give us some data even for invalid pages:
>
>>>> import urllib2
>>>> from BeautifulSoup import BeautifulStoneSoup
>>>> data = urllib2.urlopen(
> ... 'http://bugzilla.elinks.cz/xml.cgi?id=987').read()
>>>> soup = BeautifulStoneSoup(data)
>>>> for comment in soup.findAll('long_desc'):
> ... print repr(comment.find('thetext').renderContents())
> ...
> 'Patch against elinks-0.11 GIT based on https://bugs.launchpad.net/
> bugs/64590'
> 'Created an attachment (id=423)\nTypos and language corrections in
> ELinks strings\n'
> 'Looks good. Should I credit\x01Malcolm Parsons in the AUTHORS file?'
> 'Yes, please credit Malcolm.'
>
> We still need to work out what to do about character encodings, but
> that is necessary anyway: as mentioned in the bug report old Bugzilla
> had no concept of character encoding, so old bug data can be
> misencoded (one of the sources of invalid XML from bugzilla).
>
> James. |
On 11 Feb 2008, at 14:17, James Henstridge wrote:
> On 11/02/2008, Gavin Panella <gavin.panella@canonical.com> wrote:
>> On 9 Feb 2008, at 21:26, Christian Robottom Reis wrote:
>>> Because I don't have anything better to do I also looked at the
>>> checkwatches failures today. They are not all bad. But there's one
>>> which
>>> bothered me; elinks.cz fails when pulling information for bug 987.
>>>
>>> 10:58:23 INFO Updating 1 watches on http://
>>> bugzilla.elinks.cz
>>>
>>> 10:58:26 ERROR Failed to parse XML description for
>>> http://bugzilla.elinks.cz bugs [u'987']: syntax error: line 10,
>>> column 62
>>>
>>> Now this is failing because somebody decided it would be a good
>>> idea to
>>> put a non-printable character in a bug comment:
>>>
>>> http://bugzilla.elinks.cz/show_bug.cgi?id=987#c2
>>>
>>> What should our long-term plan be for this sort of situation? Get it
>>> fixed upstream? Or replace unprintables when importing comments? Or
>>> blacklisting bugs so they stop spamming our logs?
>>
>> Bugzilla shouldn't create invalid XML, so it should ideally be fixed
>> there. See:
>>
>> https://bugzilla.mozilla.org/show_bug.cgi?id=105960
>>
>> But... it's been open since 2001 and has actually been commented on
>> by James H as recently as April 2002. Looks like this one is not
>> getting fixed, and we should probably try to unfuck the XML from
>> Bugzilla ourselves.
>
> My opinion is that since Bugzilla does not guarantee that it will
> produce valid XML, we should not treat said data as XML.
>
> I'd suggest using the BeautifulSoup.BeautifulStoneSoup class
> (BeautifulSoup minus HTML specific tweaks) to do the parsing. This
> should give us some data even for invalid pages:
>
>>>> import urllib2
>>>> from BeautifulSoup import BeautifulStoneSoup
>>>> data = urllib2.urlopen(
> ... 'http://bugzilla.elinks.cz/xml.cgi?id=987').read()
>>>> soup = BeautifulStoneSoup(data)
>>>> for comment in soup.findAll('long_desc'):
> ... print repr(comment.find('thetext').renderContents())
> ...
> 'Patch against elinks-0.11 GIT based on https://bugs.launchpad.net/
> bugs/64590'
> 'Created an attachment (id=423)\nTypos and language corrections in
> ELinks strings\n'
> 'Looks good. Should I credit\x01Malcolm Parsons in the AUTHORS file?'
> 'Yes, please credit Malcolm.'
>
> We still need to work out what to do about character encodings, but
> that is necessary anyway: as mentioned in the bug report old Bugzilla
> had no concept of character encoding, so old bug data can be
> misencoded (one of the sources of invalid XML from bugzilla).
>
> James.
OOPS-830CCW7 shows the problem and the Exception type is UnparseableBugData |
|
2008-05-06 14:04:21 |
Graham Binns |
bug |
|
|
assigned to bugzilla |
2008-05-06 14:10:25 |
Bug Watch Updater |
bugzilla: status |
Unknown |
Confirmed |
|
2008-12-30 02:34:21 |
Bug Watch Updater |
bugzilla: status |
Confirmed |
Fix Released |
|
2008-12-31 11:29:13 |
Gavin Panella |
malone: statusexplanation |
|
The root bug has been marked as Fix Released in Bugzilla, so we should revisit this issue, if only to prioritise and schedule it. |
|
2008-12-31 11:29:13 |
Gavin Panella |
malone: milestone |
|
2.2.1 |
|
2009-02-05 20:47:29 |
Björn Tillenius |
malone: statusexplanation |
The root bug has been marked as Fix Released in Bugzilla, so we should revisit this issue, if only to prioritise and schedule it. |
|
|
2009-02-05 20:47:29 |
Björn Tillenius |
malone: milestone |
2.2.1 |
|
|
2010-01-25 10:47:54 |
Gavin Panella |
tags |
bugwatch oops |
bugwatch oops story-reliable-bug-syncing |
|
2010-09-18 18:28:33 |
Bug Watch Updater |
bugzilla: importance |
Unknown |
Medium |
|
2010-09-18 18:28:38 |
Bug Watch Updater |
bug watch added |
|
https://bugzilla-test.mozilla.org/show_bug.cgi?id=384 |
|
2010-09-18 18:28:38 |
Bug Watch Updater |
bug watch added |
|
https://bugzilla-test.mozilla.org/show_bug.cgi?id=267 |
|
2010-09-18 18:28:38 |
Bug Watch Updater |
bug watch added |
|
http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=5032 |
|
2010-09-18 18:28:38 |
Bug Watch Updater |
bug watch added |
|
https://bugzilla.gnome.org/show_bug.cgi?id=417196 |
|
2010-09-18 18:28:38 |
Bug Watch Updater |
bug watch added |
|
https://bugs.eclipse.org/bugs/show_bug.cgi?id=140108 |
|
2010-11-26 06:15:28 |
Curtis Hovey |
malone: status |
Confirmed |
Triaged |
|
2010-11-26 06:15:33 |
Curtis Hovey |
malone: importance |
Undecided |
Low |
|
2011-01-12 18:15:06 |
Robert Collins |
launchpad: importance |
Low |
Critical |
|
2011-03-05 15:27:41 |
Curtis Hovey |
description |
On 11 Feb 2008, at 14:17, James Henstridge wrote:
> On 11/02/2008, Gavin Panella <gavin.panella@canonical.com> wrote:
>> On 9 Feb 2008, at 21:26, Christian Robottom Reis wrote:
>>> Because I don't have anything better to do I also looked at the
>>> checkwatches failures today. They are not all bad. But there's one
>>> which
>>> bothered me; elinks.cz fails when pulling information for bug 987.
>>>
>>> 10:58:23 INFO Updating 1 watches on http://
>>> bugzilla.elinks.cz
>>>
>>> 10:58:26 ERROR Failed to parse XML description for
>>> http://bugzilla.elinks.cz bugs [u'987']: syntax error: line 10,
>>> column 62
>>>
>>> Now this is failing because somebody decided it would be a good
>>> idea to
>>> put a non-printable character in a bug comment:
>>>
>>> http://bugzilla.elinks.cz/show_bug.cgi?id=987#c2
>>>
>>> What should our long-term plan be for this sort of situation? Get it
>>> fixed upstream? Or replace unprintables when importing comments? Or
>>> blacklisting bugs so they stop spamming our logs?
>>
>> Bugzilla shouldn't create invalid XML, so it should ideally be fixed
>> there. See:
>>
>> https://bugzilla.mozilla.org/show_bug.cgi?id=105960
>>
>> But... it's been open since 2001 and has actually been commented on
>> by James H as recently as April 2002. Looks like this one is not
>> getting fixed, and we should probably try to unfuck the XML from
>> Bugzilla ourselves.
>
> My opinion is that since Bugzilla does not guarantee that it will
> produce valid XML, we should not treat said data as XML.
>
> I'd suggest using the BeautifulSoup.BeautifulStoneSoup class
> (BeautifulSoup minus HTML specific tweaks) to do the parsing. This
> should give us some data even for invalid pages:
>
>>>> import urllib2
>>>> from BeautifulSoup import BeautifulStoneSoup
>>>> data = urllib2.urlopen(
> ... 'http://bugzilla.elinks.cz/xml.cgi?id=987').read()
>>>> soup = BeautifulStoneSoup(data)
>>>> for comment in soup.findAll('long_desc'):
> ... print repr(comment.find('thetext').renderContents())
> ...
> 'Patch against elinks-0.11 GIT based on https://bugs.launchpad.net/
> bugs/64590'
> 'Created an attachment (id=423)\nTypos and language corrections in
> ELinks strings\n'
> 'Looks good. Should I credit\x01Malcolm Parsons in the AUTHORS file?'
> 'Yes, please credit Malcolm.'
>
> We still need to work out what to do about character encodings, but
> that is necessary anyway: as mentioned in the bug report old Bugzilla
> had no concept of character encoding, so old bug data can be
> misencoded (one of the sources of invalid XML from bugzilla).
>
> James.
OOPS-830CCW7 shows the problem and the Exception type is UnparseableBugData |
On 11 Feb 2008, at 14:17, James Henstridge wrote:
> On 11/02/2008, Gavin Panella <gavin.panella@canonical.com> wrote:
>> On 9 Feb 2008, at 21:26, Christian Robottom Reis wrote:
>>> Because I don't have anything better to do I also looked at the
>>> checkwatches failures today. They are not all bad. But there's one
>>> which
>>> bothered me; elinks.cz fails when pulling information for bug 987.
>>>
>>> 10:58:23 INFO Updating 1 watches on http://
>>> bugzilla.elinks.cz
>>>
>>> 10:58:26 ERROR Failed to parse XML description for
>>> http://bugzilla.elinks.cz bugs [u'987']: syntax error: line 10,
>>> column 62
>>>
>>> Now this is failing because somebody decided it would be a good
>>> idea to
>>> put a non-printable character in a bug comment:
>>>
>>> http://bugzilla.elinks.cz/show_bug.cgi?id=987#c2
>>>
>>> What should our long-term plan be for this sort of situation? Get it
>>> fixed upstream? Or replace unprintables when importing comments? Or
>>> blacklisting bugs so they stop spamming our logs?
>>
>> Bugzilla shouldn't create invalid XML, so it should ideally be fixed
>> there. See:
>>
>> https://bugzilla.mozilla.org/show_bug.cgi?id=105960
>>
>> But... it's been open since 2001 and has actually been commented on
>> by James H as recently as April 2002. Looks like this one is not
>> getting fixed, and we should probably try to unfuck the XML from
>> Bugzilla ourselves.
>
> My opinion is that since Bugzilla does not guarantee that it will
> produce valid XML, we should not treat said data as XML.
>
> I'd suggest using the BeautifulSoup.BeautifulStoneSoup class
> (BeautifulSoup minus HTML specific tweaks) to do the parsing. This
> should give us some data even for invalid pages:
>
>>>> import urllib2
>>>> from BeautifulSoup import BeautifulStoneSoup
>>>> data = urllib2.urlopen(
> ... 'http://bugzilla.elinks.cz/xml.cgi?id=987').read()
>>>> soup = BeautifulStoneSoup(data)
>>>> for comment in soup.findAll('long_desc'):
> ... print repr(comment.find('thetext').renderContents())
> ...
> 'Patch against elinks-0.11 GIT based on https://bugs.launchpad.net/
> bugs/64590'
> 'Created an attachment (id=423)\nTypos and language corrections in
> ELinks strings\n'
> 'Looks good. Should I credit\x01Malcolm Parsons in the AUTHORS file?'
> 'Yes, please credit Malcolm.'
>
> We still need to work out what to do about character encodings, but
> that is necessary anyway: as mentioned in the bug report old Bugzilla
> had no concept of character encoding, so old bug data can be
> misencoded (one of the sources of invalid XML from bugzilla).
>
> James.
OOPS-830CCW7 shows the problem and the Exception type is UnparseableBugData
A newer instance: OOPS-1633CCW184 |
|
2011-06-07 21:26:09 |
Benji York |
launchpad: assignee |
|
Benji York (benji) |
|
2011-06-09 17:14:10 |
Benji York |
branch linked |
|
lp:~benji/launchpad/bug-191199 |
|
2011-06-15 21:20:48 |
Launchpad QA Bot |
tags |
bugwatch lp-bugs oops story-reliable-bug-syncing |
bugwatch lp-bugs oops qa-needstesting story-reliable-bug-syncing |
|
2011-06-15 21:20:50 |
Launchpad QA Bot |
launchpad: status |
Triaged |
Fix Committed |
|
2011-06-16 22:40:54 |
Benji York |
tags |
bugwatch lp-bugs oops qa-needstesting story-reliable-bug-syncing |
bugwatch lp-bugs oops qa-untestable story-reliable-bug-syncing |
|
2011-06-17 02:01:33 |
William Grant |
launchpad: status |
Fix Committed |
Fix Released |
|