Write script to reindex extracted text based on updated support for Office 2010

Bug #1045401 reported by Paul Everitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Won't Fix
Medium
Carlos de la Guardia

Bug Description

Most likely you can also focus it only on File objects.

JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m117 → m118
Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → Carlos de la Guardia (cguardia)
Revision history for this message
Carlos de la Guardia (cguardia) wrote :

Just a question here. Why can't we just use bin/karlserve reindex_text? Seems to work.

Changed in karl3:
status: New → In Progress
Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1045401] Re: Write script to reindex extracted text based on updated support for Office 2010

That's a question for Chipp.

--Paul

On Sep 12, 2012, at 4:54 PM, Carlos de la Guardia <email address hidden> wrote:

> Just a question here. Why can't we just use bin/karlserve reindex_text?
> Seems to work.
>
> ** Changed in: karl3
> Status: New => In Progress
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1045401
>
> Title:
> Write script to reindex extracted text based on updated support for
> Office 2010
>
> Status in KARL3:
> In Progress
>
> Bug description:
> Most likely you can also focus it only on File objects.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1045401/+subscriptions

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Oops, I mean, Chris.

--Paul

On Sep 12, 2012, at 4:54 PM, Carlos de la Guardia <email address hidden> wrote:

> Just a question here. Why can't we just use bin/karlserve reindex_text?
> Seems to work.
>
> ** Changed in: karl3
> Status: New => In Progress
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1045401
>
> Title:
> Write script to reindex extracted text based on updated support for
> Office 2010
>
> Status in KARL3:
> In Progress
>
> Bug description:
> Most likely you can also focus it only on File objects.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1045401/+subscriptions

Revision history for this message
Carlos de la Guardia (cguardia) wrote :

bin/karlserve reindex_text does work

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Carlos, did you see the questions I asked in IRC when we talked about the existing script? Basically, we want something that:

- Does *not* respect the cached version of the text extraction

- Runs only against File objects

- Can run in batches, otherwise we have to shut KARL down to avoid conflict errors. Perhaps something that does all the docids ending in 0, then commits. Then those ending in 1, then commits. Etc. Or perhaps just date-based. From the time you start, do the first 10 that haven't been updated since the time you started. Then the next 10.

JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m118 → m119
JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m119 → m120
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Carlos, got a status report on this one?

Revision history for this message
Carlos de la Guardia (cguardia) wrote :

I have the script but need write access to karldev to commit and deploy my branch. Already emailed Chris Rossi about this tonight.

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1045401] Write script to reindex extracted text based on updated support for Office 2010

Just to be clear, did you add the optimizations that we discussed? E.g. only re-index File objects that are Office files ending in "x", making sure we do the text re-extraction, etc., doing in chunks to avoid conflict errors, etc.

--Paul

On Oct 9, 2012, at 3:00 AM, Carlos de la Guardia <email address hidden> wrote:

> I have the script but need write access to karldev to commit and deploy
> my branch. Already emailed Chris Rossi about this tonight.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1045401
>
> Title:
> Write script to reindex extracted text based on updated support for
> Office 2010
>
> Status in KARL3:
> In Progress
>
> Bug description:
> Most likely you can also focus it only on File objects.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1045401/+subscriptions

Revision history for this message
Carlos de la Guardia (cguardia) wrote :

Currently it reindexes all community files. not just 'x'. It ignores cached text and supports both an 'interval' parameter to commit every 'n' files and a 'path' parameter to only reindex files on a given path, e.g. '/communities'.

We had said all files, but I can change it easily to look only for 'x files.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

If you have the interval parameter then I suppose we can skip the "x" part.

What happens if someone adds a new file while this thing is running?

Perhaps we should change it so that it can be run multiple times without repeating the same work. Can you think of a strategy for that? Perhaps date ranges on modification date.

--Paul

On Oct 9, 2012, at 10:04 AM, Carlos de la Guardia <email address hidden> wrote:

> Currently it reindexes all community files. not just 'x'. It ignores
> cached text and supports both an 'interval' parameter to commit every
> 'n' files and a 'path' parameter to only reindex files on a given path,
> e.g. '/communities'.
>
> We had said all files, but I can change it easily to look only for 'x
> files.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1045401
>
> Title:
> Write script to reindex extracted text based on updated support for
> Office 2010
>
> Status in KARL3:
> In Progress
>
> Bug description:
> Most likely you can also focus it only on File objects.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1045401/+subscriptions

Revision history for this message
Carlos de la Guardia (cguardia) wrote :

Chris says scripts no longer configured inside karlserve, so changed the script accordingly. Deployed to cguardia-1045401-reindex-docs. Testing will require shell access, however.

Changed in karl3:
status: In Progress → Fix Committed
Revision history for this message
Carlos de la Guardia (cguardia) wrote :

If someone adds a new file while this runs it won't be picked up, because the list of paths is generated before the loop starts.

For running multiple times, an easy thing to do would be to add a 'newer' parameter with a date and only index files added after that date. Other strategies would require keeping some sort of state each time the script is run.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

I suppose we can bail on the running-multiple-times thing then. We'll set the "interval" low enough that we should be able to avoid a conflict error.

--Paul

On Oct 9, 2012, at 7:13 PM, Carlos de la Guardia <email address hidden> wrote:

> If someone adds a new file while this runs it won't be picked up,
> because the list of paths is generated before the loop starts.
>
> For running multiple times, an easy thing to do would be to add a
> 'newer' parameter with a date and only index files added after that
> date. Other strategies would require keeping some sort of state each
> time the script is run.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1045401
>
> Title:
> Write script to reindex extracted text based on updated support for
> Office 2010
>
> Status in KARL3:
> Fix Committed
>
> Bug description:
> Most likely you can also focus it only on File objects.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1045401/+subscriptions

Revision history for this message
Carlos de la Guardia (cguardia) wrote :

The 'path' parameter can also be useful to reduce the possibility of conflict by concentrating on just an area of the site.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

We did this another way.

Changed in karl3:
milestone: m120 → m122
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.