authority_control_fields script's all option should pull a list of IDs from the database

Bug #1650409 reported by Jane Sandberg on 2016-12-15
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Undecided
Unassigned
3.1
Undecided
Unassigned
3.2
Undecided
Unassigned

Bug Description

If you run the authority_control_fields.pl script with the -a or --all flag, it will connect to OpenSRF to get a list of all the undeleted bib ids in the consortium. This is undesirable, because it can be too much data for OpenSRF to handle, resulting in the following error message:

"Can't use an undefined value as an ARRAY reference at /openils/bin/authority_control_fields.pl line 77."

If you run the same script with the --days_back flag, it connects directly to the database to get a list of IDs. This is pretty handy, as it won't time out; it would be helpful if the --all flag also retrieved these IDs directly from the database.

Rogan Hamby (rogan-hamby) wrote :

Jane,

I'm willing to work on this. What is the volume of authorities you are seeing this die on? I ask for testing purposes because the standard test database doesn't have the volume so I'll want to load up a VM to the point that I know will break.

Jane Sandberg (sandbej) wrote :

Fantastic! Thanks, Rogan. We have ~250,000 undeleted authorities in our system.

Changed in evergreen:
assignee: nobody → Rogan Hamby (rogan-hamby)
Rogan Hamby (rogan-hamby) wrote :

Jane,

I'm about ready to submit my patch but I wanted to mention a couple of other features. One is a bit more robust output if you uncomment the print line in the loop so that it says something like:

record: 4000012 20 of 63979
record: 4000011 21 of 63979
record: 4000010 22 of 63979
record: 4000009 23 of 63979
record: 4000008 24 of 63979
record: 4000007 25 of 63979
record: 4000006 26 of 63979

That way you can capture output to a log or it gives more information if you're doing it manually in a screen session.

I've also added a bit of a tweak to have it do bibs newest to oldest under the assumption that newest tend to be most important for a lot of users and that some order is better than random. If people are fine with random of course it really doesn't make any difference to them.

Patch forthcoming.

Rogan Hamby (rogan-hamby) wrote :

Patch is here:

/user/rogan/lp_1650409_auth_ctrl_linking_script

tags: added: pull
tags: added: pullrequest
removed: pull

Rogan, it looks like your patch also fixes the days back logic, thanks. Before if you gave it 2 days, it would grab the records that were updated today along with those that were updated 2 days ago, skipping records that were updated 1 day ago. That was very confusing when I started using the script.

There is another fix that I would like to see in here, a check to see if a record is marked deleted before it is processed, or a new way of handling start-id and end-id arguments. Currently if you give it a start-id and an end-id it puts all the ids between those two numbers into the records list, and tries to load each one and process them, including deleted records.

So that could be handled by using a db query to get the range of non deleted bibs between the start and end, or by checking for deletedness after loading each record. I modified our script locally to check each record after it is loaded, but it probably makes more sense to just use another query to only grab a list of non-deleted queries.

One last thing, the help message for the script states that -r can be used for both --refresh and --record, but I don't think that works. So maybe the docs should be updated to -ref and -rec for short versions of those arguments, showing that they can only be abbreviated to a unique value.

Josh

One more feature that I would like to see is a way to grab records based on what matches new authority records that have been loaded.

I think a common occurrence is that new authority records get loaded for headings that are currently not controlled.

So it would be nice to have a mode that grabs the new/edited authority records for the past x days and then process all bibs that have headings that match.

An easier/alternate way might be to have a --heading option that selects records based on metabib.browse_entry and metabib.browse_entry_def_map. That way I could manually grab all the bibs related to a particular heading and process them. For my purposes I'm fine ignoring the definition id, if some extra bibs get re-processed that isn't a big deal.

Josh

Hi Josh,

Those sound like good ideas. I'd even be willing to work on some of them
as I have time though I think they should be separate launchpad tickets.
I'm tempted to reply with thoughts to some of your ideas but I'll reserve
that for their launchpad entries.

On Tue, Jun 5, 2018 at 12:28 PM, Josh Stompro <email address hidden> wrote:

> One more feature that I would like to see is a way to grab records based
> on what matches new authority records that have been loaded.
>
> I think a common occurrence is that new authority records get loaded for
> headings that are currently not controlled.
>
> So it would be nice to have a mode that grabs the new/edited authority
> records for the past x days and then process all bibs that have headings
> that match.
>
> An easier/alternate way might be to have a --heading option that selects
> records based on metabib.browse_entry and metabib.browse_entry_def_map.
> That way I could manually grab all the bibs related to a particular
> heading and process them. For my purposes I'm fine ignoring the
> definition id, if some extra bibs get re-processed that isn't a big
> deal.
>
> Josh
>
> --
> You received this bug notification because you are a bug assignee.
> Matching subscriptions: evergreenbugs
> https://bugs.launchpad.net/bugs/1650409
>
> Title:
> authority_control_fields script's all option should pull a list of IDs
> from the database
>
> Status in Evergreen:
> New
>
> Bug description:
> If you run the authority_control_fields.pl script with the -a or --all
> flag, it will connect to OpenSRF to get a list of all the undeleted
> bib ids in the consortium. This is undesirable, because it can be too
> much data for OpenSRF to handle, resulting in the following error
> message:
>
> "Can't use an undefined value as an ARRAY reference at
> /openils/bin/authority_control_fields.pl line 77."
>
> If you run the same script with the --days_back flag, it connects
> directly to the database to get a list of IDs. This is pretty handy,
> as it won't time out; it would be helpful if the --all flag also
> retrieved these IDs directly from the database.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/evergreen/+bug/1650409/+subscriptions
>

Changed in evergreen:
assignee: Rogan Hamby (rogan-hamby) → nobody
Mark (markmg) on 2018-10-16
Changed in evergreen:
assignee: nobody → Mark (markmg)
Rogan Hamby (rogan-hamby) wrote :

Josh,

I'm finally looping back around to this. Were you doing to do a sign off on this and maybe file hose other tickets?

Jane Sandberg (sandbej) wrote :

Thanks for your work on this, Rogan. I ran your version of the script on our training server, as well as the one that is currently in master. Your version totally fixes the ARRAY reference error.

Also, uncommenting the print line and watching it work is very satisfying.

I have tested this code and consent to signing off on it with my name, Jane Sandberg and my email address, <email address hidden>.

tags: added: signedoff
Changed in evergreen:
assignee: Mark (markmg) → nobody
Michele Morgan (mmorgan) on 2019-04-16
Changed in evergreen:
milestone: none → 3.3.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers