Plot isn't being pulled from IMDB for some movies

Bug #224546 reported by Neil Burlock
4
Affects Status Importance Assigned to Milestone
Griffith
Fix Released
Medium
Michael

Bug Description

Get From Web skips the plot for some movies on the final release version of Hardy, Griffith 0.9.6 with the following error in the terminal, while attempting to fetch info on a movie:

/usr/share/griffith/lib/add.py:516: GtkWarning: gtk_text_buffer_emit_insert: assertion `g_utf8_validate (text, len, NULL)' failed
  plot_buffer.set_text(gutils.convert_entities(self.movie.plot))

Example movies that are causing this problem:

Corpse Bride
The Hitcher
The Departed

My locale is set to en_AU.UTF-8

Revision history for this message
Neil Burlock (malone) wrote :

I've tried the latest release and this problem still exists, so I did some investigating. What I've found is that the gutils.convert_entities function isn't working the way I think it's supposed to.

Certain IMDB plot summaries contain non-ascii characters, for example Aeon Flux, which has a latin "AE" charcter at index 3399/3400. The plot is being read correctly, then it's passed to convert_entities which returns it unchanged, then assigns it to the plot_buffer via set_text causing the error.

I've been able to get plots to import correctly on my system by changing the code to no longer call convert_entities - this is what I've done.

In populate_with_results, around line 207, I changed it from:

    if 'plot' in fields_to_fetch:
        plot_buffer = w['plot'].get_buffer()
        plot_buffer.set_text(gutils.convert_entities(self.movie.plot))
        fields_to_fetch.pop(fields_to_fetch.index('plot'))

to (adding import unicodedata to the top of the file):
    if 'plot' in fields_to_fetch:
        plot_buffer = w['plot'].get_buffer()
        plot_buffer.set_text( unicodedata.normalize('NFD',self.movie.plot.decode('latin-1')).encode('utf-8'))
        fields_to_fetch.pop(fields_to_fetch.index('plot'))

This works without error on my system and correctly converts the AE character so the plot is now saved in the DB. I'm not really familiar with Python, so I can't decipher what convert_entities is doing, but after sticking in some print statements to trace the flow I can see that for some reason the function thinks that nothing needs to be done to the plot so returns it unchanged, in what would appear to be it's original latin-1 format, which is why the set_text function fails whenever there are non-ascii characters in the string, since it is expecting utf-8.

I found the fix on the following page, where it is explained in more detail why this works:

http://blog.magnetk.com/2008/05/06/finessing-international-characters-out-of-python

I don't have any idea why this is happening, but it has affected 3 different 64 bit Ubuntu, running Feisty and Hardy since I started using Griffith early this year. The only thing I can think of is that the way convert_entities currently works, it doesn't handle my Australian locale.

Revision history for this message
Owyn (i-leacy) wrote :

Problem also occurs on Windows 0.9.9, e.g. Match Point (2005)

Revision history for this message
Michael (mikej06) wrote :

Fixed in revision 1178.
Web page encoding was wrong for the plugin, changed from utf8 to iso8859-1.

Changed in griffith:
assignee: nobody → mikej06
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
Owyn (i-leacy) wrote :

Thanks. Pulled 1179 plugin (to get country newline fix as well) and moved to 0.10-beta2 plugins.

Fixed this and several other lookup problems.

Revision history for this message
Piotr Ożarowski (piotr) wrote :

0.10-rc1 released

Changed in griffith:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.