py-lp-bugs fails on parsing bugreports containing strage chars

Bug #137574 reported by Markus Korn
12
Affects Status Importance Assigned to Milestone
python-launchpad-bugs
Fix Released
Medium
Markus Korn

Bug Description

"[...]traceTop:�� () fro[...]" causes libxml2 to stop parsing a bugreport.

Traceback (most recent call last):
  File "./bughelper", line 200, in <module>
    main()
  File "./bughelper", line 165, in main
    cl.options.case_sensitive):
  File "/home/daniel/bzr/bughelper.main/bugHelper/infoFiles.py", line 110, in clue_matches
    return self.condition_matches(condition, bug_or_attachment, case_sensitive)
  File "/home/daniel/bzr/bughelper.main/bugHelper/infoFiles.py", line 129, in condition_matches
    search_text = bug_or_attachment.text
  File "/var/www/daniel.holba.ch/htdocs/sponsoring/launchpadbugs/bugbase.py", line 17, in <lambda>
    if fget : fget = lambda s, n=fget.__name__ : getattr(s, n)()
  File "/var/www/daniel.holba.ch/htdocs/sponsoring/launchpadbugs/html_bug.py", line 1554, in get_text
    return "%s\n%s" %(self.description,"\n".join([c.text for c in self.comments]))
  File "/var/www/daniel.holba.ch/htdocs/sponsoring/launchpadbugs/bugbase.py", line 17, in <lambda>
    if fget : fget = lambda s, n=fget.__name__ : getattr(s, n)()
  File "/var/www/daniel.holba.ch/htdocs/sponsoring/launchpadbugs/html_bug.py", line 152, in func
    x.parse()
  File "/var/www/daniel.holba.ch/htdocs/sponsoring/launchpadbugs/html_bug.py", line 1191, in parse
    c.set_attr(__nr,__user,__date,__attachments)
  File "/var/www/daniel.holba.ch/htdocs/sponsoring/launchpadbugs/html_bug.py", line 1091, in set_attr
    get_bug(self._anker_list).attachments._ref_comment((self.__nr, attachments))
  File "/var/www/daniel.holba.ch/htdocs/sponsoring/launchpadbugs/html_bug.py", line 997, in _ref_comment
    self[i]._comment = nr
  File "/var/www/daniel.holba.ch/htdocs/sponsoring/launchpadbugs/html_bug.py", line 944, in __getitem__
    raise IndexError, "could not find '%s' in attachments ('%s')" %(key, self.__url) #list index out of range"
IndexError: could not find '8563184' in attachments ('https://bugs.launchpad.net/ubuntu/+source/rhythmbox/+bug/128162')

Revision history for this message
Daniel Holbach (dholbach) wrote :

What was the outcome of discussing this with libxml2 upstream? Maybe we should forward the bug for it to http://bugzilla.gnome.org/enter_bug.cgi?product=libxml2

Revision history for this message
Markus Korn (thekorn) wrote :

According to this mail [1] this seems to be the expected result of a 'fix' :(
The attached patch adds a workaround to this issue, it just removes all un-parseable chars from the bugpage.
I did not get any feedback from upstream yet.
IMHO it is more a bug in libxml2.

[1] http://mail.gnome.org/archives/xml/2001-October/msg00122.html

Revision history for this message
Daniel Holbach (dholbach) wrote :

I uploaded the workaround, thanks a lot for finding it and liaising with upstream.

Changed in python-launchpad-bugs:
status: New → Fix Released
Revision history for this message
Markus Korn (thekorn) wrote :

according to the current entries in the logfile we need to add another character to the workaround, done by this patch.
I created an upstream bugreport, hope to find a better solution there

Markus

Revision history for this message
Markus Korn (thekorn) wrote :

As reported by Brian this is still an issue

Changed in python-launchpad-bugs:
assignee: nobody → thekorn
importance: Undecided → Medium
status: Fix Released → In Progress
Revision history for this message
Markus Korn (thekorn) wrote :

the attached patch fixes this issue, finally (?), by replacing all chars with ord(c)>128 in a bugreport with "??".
As I understood Daniel Veillard this seems to be the only possibility to prevent libxml2 to stop parsing of html-pages.

I hope that we do not lose that many information with this workaround.

Markus

Changed in python-launchpad-bugs:
status: In Progress → Fix Committed
Revision history for this message
Brian Murray (brian-murray) wrote :

I applied the patch and received the following error message:

bdmurray@flash:~/source_trees/bughelper.main$ ./bugnumbers -p hal --ns ">5"
Traceback (most recent call last):
  File "./bugnumbers", line 170, in <module>
    main()
  File "./bugnumbers", line 129, in main
    bugs_dict[x] = Bug(x)
  File "/home/bdmurray/source_trees/bughelper.main/launchpadbugs/connector.py", line 56, in __call__
    return self.cls(bug, url, connection=self.connection)
  File "/home/bdmurray/source_trees/bughelper.main/launchpadbugs/html_bug.py", line 1710, in __init__
    self.xmldoc = libxml2.htmlParseDoc(self.__sen_text, "UTF-8")
  File "/var/lib/python-support/python2.5/libxml2.py", line 763, in htmlParseDoc
    ret = libxml2mod.htmlParseDoc(cur, encoding)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0161' in position 6744: ordinal not in range(128)

Revision history for this message
Markus Korn (thekorn) wrote :

I remove the attachment from comment 6 because it did not fix the issue

Markus

Changed in python-launchpad-bugs:
status: Fix Committed → In Progress
Revision history for this message
Markus Korn (thekorn) wrote :

I added a branch with a fix to this bugreport as this fix needs some more testing, so please test it :)

Markus

Revision history for this message
Markus Korn (thekorn) wrote :

fixed upstream, http://bugzilla.gnome.org/show_bug.cgi?id=474205

Once this fix landed in the ubuntu universe we won't need to workaround this issue anymore.

Markus

Revision history for this message
Markus Korn (thekorn) wrote :

This should not be an issue anymore, because:
 * the fix in libxml2 landed into ubuntu
 * py-lp-bugs has an internal workaround if libxml2 still fails

please open a new bugreport if something like this happen again

Markus

Changed in python-launchpad-bugs:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.