Files cache data extracted for text indexing

Bug #1309688 reported by Tres Seaver
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Christian Theune

Bug Description

For hysterical raisins, the file objects cache the text returned by an extractor program as an attribute.

This caching defeats the point of using blobs, making the instances' pickles (and RAM consumption) huge.

Rip it out.

Revision history for this message
Tres Seaver (tseaver) wrote :

Added a script to delete the attribute from extant instances. Run via:

  $ bin/karlserve remove_extracted_data <instance>

Changed in karl3:
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
Tres Seaver (tseaver) wrote :

Software for doing this is now on production. I'd like to get some
before-cleanup stats (zoid / state of biggest objects, aggregate sizes,
etc.) before doing the actual pruning, so as to compare with afterward.
At the moment, I'm stuck with anyting besides:

 select zoid, state_size from public.object_state
  order by state_size desc
  limit 20;
   zoid | state_size
----------+------------
  5847658 | 6832938
 35866538 | 4938964
 52181521 | 3407603
 49659668 | 2129789
 49659829 | 2124742
 53705386 | 2098508
 49582361 | 2097863
  2872333 | 2097743
   181176 | 2097728
 33765744 | 2097726
  4722173 | 2097724
 30799170 | 2097721
   325183 | 2097720
  4828707 | 2097709
  4248059 | 2097688
  4393028 | 2097688
  3359491 | 2097687
  2231968 | 2097667
 22349334 | 2097637
 13698547 | 2097634
(20 rows)

When I try to look at the actual pickle for the biggest zoid, I'm seeing binary guff:

  select substr(state, 1, 30) from public.object_state
    where zoid = 5847658;
                             substr
----------------------------------------------------------------
 \x636b61726c2e636f6e74656e742e6d6f64656c732e66696c65730a436f6d
(1 row)

Theune, can you help?

Changed in karl3:
assignee: Tres Seaver (tseaver) → Christian Theune (ct-gocept)
status: Fix Committed → Fix Released
Revision history for this message
Christian Theune (ctheune) wrote :

I got to the actual pickle by accessing the database with a raw 5-line psycopg script - the string value of this result will be a good pickle. The PostgreSQL CLI client is doing some "magic" there ...

Revision history for this message
Tres Seaver (tseaver) wrote :

FTR, here is the psycopg2 script I am using::

    import psycopg2
    db = psycopg2.connect(host=<HOST>, database=<DATABASE>,
                          user=<USER>, password=<PASSWORD>)
    c = db.cursor()
    c.execute('select substr(state, 1, 500) from object_state '
              'where state_size > %d' % threshold)
    rows = c.fetchall()
    print(rows[0][0][:]) # we get a buffer, instead of a bytes object

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.