Implement 64k limit on extracted_text

Bug #1340295 reported by Paul Everitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Low
Chris Rossi

Bug Description

As the followup to lp:1338271 let's take the following steps:

- Change the software to have a configurable size (set at 64k) for the upper limit on when we choose to store the extracted text on the resource as extracted_text

- *AS AN OPTION** gzip the string

- Write a console script which, in a production-friendly way(*), wipes out extracted_text on existing objects

(*) Production-friendly means:

a. Can be run without keeping a site-update from taking forever for an evolve, getting stuck on evolve errors, etc.

b. Doesn't accidentally generate a bunch of email alerts or feed entries

Tags: r3.131
Changed in karl3:
milestone: m138 → m139
Changed in karl3:
status: New → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Committed on branch: better-extracted-text-caching-lp1340295

Changed in karl3:
status: In Progress → Fix Committed
tags: added: r3.130
tags: added: r3.131
removed: r3.130
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Ran for while, hit a UnicodeError for zlib.

Traceback (most recent call last):
  File "bin/karlserve", line 112, in <module>
    karlserve.scripts.main.main()
  File "/srv/karlstaging/staging/48/eggs/karlserve-1.27-py2.6.egg/karlserve/scripts/main.py", line 86, in main
    func(args)
  File "/srv/karlstaging/staging/48/eggs/karlserve-1.27-py2.6.egg/karlserve/scripts/main.py", line 204, in wrapper
    return func(args)
  File "/srv/karlstaging/staging/48/eggs/karlserve-1.27-py2.6.egg/karlserve/scripts/reindex_text.py", line 51, in main
    reindex_text(args, site)
  File "/srv/karlstaging/staging/48/eggs/karlserve-1.27-py2.6.egg/karlserve/scripts/reindex_text.py", line 131, in reindex_text
    reindex_batch(args, site)
  File "/srv/karlstaging/staging/48/eggs/karlserve-1.27-py2.6.egg/karlserve/scripts/reindex_text.py", line 199, in reindex_batch
    new_index.index_doc(docid, doc)
  File "/srv/karlstaging/staging/48/eggs/perfmetrics-2.0-py2.6.egg/perfmetrics/__init__.py", line 133, in call_with_metric
    return f(*args, **kw)
  File "/srv/karlstaging/staging/48/eggs/repoze.pgtextindex-1.2-py2.6.egg/repoze/pgtextindex/index.py", line 140, in index_doc
    value = self.discriminator(obj, _missing)
  File "/srv/karlstaging/staging/48/eggs/karl-3.131-py2.6.egg/karl/models/site.py", line 232, in get_weighted_textrepr
    texts = _get_texts(obj, default)
  File "/srv/karlstaging/staging/48/eggs/karl-3.131-py2.6.egg/karl/models/site.py", line 188, in _get_texts
    texts = adapter()
  File "/srv/karlstaging/staging/48/eggs/karl-3.131-py2.6.egg/karl/content/models/adapters.py", line 55, in __call__
    value = attr(self.context)
  File "/srv/karlstaging/staging/48/eggs/karl-3.131-py2.6.egg/karl/content/models/adapters.py", line 113, in _extract_and_cache_file_data
    context._extracted_data = cached_data = _CachedData(data)
  File "/srv/karlstaging/staging/48/eggs/karl-3.131-py2.6.egg/karl/content/models/adapters.py", line 91, in __init__
    self.data = zlib.compress(data, 1)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 218: ordinal not in range(128)

Changed in karl3:
status: Fix Committed → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Just committed a fix on master. Sorry about that.

Changed in karl3:
status: In Progress → Fix Committed
Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.