Network Error when deleting all copies in bucket

Bug #921812 reported by Jason Stephenson on 2012-01-25
This bug affects 6 people
Affects Status Importance Assigned to Milestone

Bug Description

Evergreen version: master 20111205 + mvlc modifications (affects all versions)
OpenSRF version: 2.0.1ish
PostgreSQL version: 9.1.2
Linux distribution: Ubuntu 10.04.3 LTS

When staff try to delete all copies from a bucket using the client, they very often get a network error when there are more than a few hundred copies. Specifically, I've seen it this week with two buckets each having 2,000+ copies each.

While looking through the client code, I see that a fleshed retrieve is done before a call to update fleshed copies. I imagine that the network error comes from all of the fleshed data being sent from the client to the server. The resulting XMPP message likely exceeds our max_stanza_size parameter, which I believe is presently set to 2,000,000.

In lieu of setting the max_stanza_size even higher, I would like to modify the client to use a more space efficient deletion algorithm. A couple of options come to mind:

1. Create a batch copy delete by id method in that accepts a list of copy ids. The client could then just send the list of ids back to the server, thus sending the bare minimum of information.

2. Retrieve only the acp information for each copy and send that back, thus avoiding all of the fleshed data taking up extra space in transit.

I prefer number 1, but can see that number 2 might be made to work with fewer to no back end additions.

Also, I wonder if there is some reason to continue using the update fleshed copies method here that I don't see.

I'll start a branch to work on solution #1 unless/until someone with more insight says its a bad idea because....

Jason Stephenson (jstephenson) wrote :

Well, I came up with a quicker and probably better solution.

I've added a FM_ACP_UNFLESHED_RETRIEVE to the client to just retrieve the unfleshed copy information. This works with the batch copy update call already being used.

Along the way, I've noticed the string for cat.batch_operation_failed has a %1$2 placeholder tacked onto the end that isn't used anywhere in the client code. I'll remove that as well.

I still sometimes get a network error when doing a bucket with over 200 copies, but the deletion still succeeds. I imagine that this network error comes from a timeout value somewhere. I'll see if I can't also do something about that.

Jason Stephenson (jstephenson) wrote :

I have pushed code to working/user/dyrcon/lp921812

This code makes the first changes, but doesn't attempt to track down and fix any time out values. If I do make any changes for that, I'll add another commit to the above branch and add a comment here.

Jason Stephenson (jstephenson) wrote :

Doh...... Stupid distractions.....

I have pushed code to working/user/dyrcona/lp921812

tags: added: pullrequest
Changed in evergreen:
milestone: none → 2.2.0alpha2
assignee: Jason Stephenson (jstephenson) → nobody
status: New → In Progress
Jason Stephenson (jstephenson) wrote :

I've decided on a new approach. I'll an method to delete the underlying database entries for the contents of a container. The client can call this when deleting copies and the heavy lifting is done on the backend.

If done right, this should for any kind of container and allow one to clear the bucket itself or keep the deleted entries in the bucket: $authtoken, $container_type, $container_id, $remove_from_bucket

Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
tags: removed: pullrequest
Jason Stephenson (jstephenson) wrote :

"Here we go 'round in circles!"

I put the original branch back....

tags: added: pullrequest
Jason Stephenson (jstephenson) wrote :

All right. We're fighting with this a lot this week. I am going to make a new backend method after all.

The new one will be like described in comment #4. It will take the bucket id and delete the copies from the bucket. I will not try to make it generic enough to handle other kinds of buckets.

tags: removed: pullrequest
Jason Stephenson (jstephenson) wrote :

We have also seen this occur when uploading a CSV file to item status and deleting from there.

Still working on a solution, but the nature of the solution is evolving as the problem evolves.

Jason Stephenson (jstephenson) wrote :

I am posting my wip branch at;a=shortlog;h=refs/heads/lp921812

If anyone feels like taking a look. I have not signed off on it, yet, as I'm not done. I'm not too satisfied with the client's behavior during batch operations like this, but it also may be my ignorance of the client's design.

Jason Stephenson (jstephenson) wrote :

I force pushed an update to my wip branch above today.

I have added code to use the new batch delete in copy_status.js.

For the most part, I like this solution better than the way client currently handles these things. It still has some rough edges at the moment, though, so I've not signed off on the code or applied a pull request tag, yet.

I would appreciate it if someone else could take a look and make some suggestions for improvements, such as not having the items deleted message popup twice when deleting from item status, etc.

I may just try the atomic version of the delete to see what happens.

Jason Stephenson (jstephenson) wrote :

Rebased and pushed to working/user/dyrcona/lp921812.

From the commit message:

    Add as a streaming method.

    Use this method with appropriate callbacks in server/cat/util.js,
    server/cat/copy_buckets.js, and circ/copy_status.js. This should
    improve the performance of batch deletes in the client and allow
    staff to delete more items at once.

    This patch requires that working/user/dyrcona/lp939535 be applied
    to OpenSRF in order to function properly.

Also, I think some irrelevant whitespace changes crept in as I was cleaning up some that I accidentally made. Sorry, Dan!

tags: added: pullrequest
Changed in evergreen:
status: In Progress → New
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Changed in evergreen:
milestone: 2.2.0alpha2 → 2.2.0beta1
Jason Stephenson (jstephenson) wrote :

I have forced pushed a new version of the branch that removes the irrelevant whitespace changes. I decided to do this after having merge conflicts with some other branches that I believe are related to the whitespace changes that I didn't mean to introduce.

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
Changed in evergreen:
milestone: 2.2.0beta1 → 2.2.0rc1

It's regrettable that this has been neglected. This addresses a real problem, but is the solution a new feature or a bugfix (and therefore a-OK for rel_2_2)? Seeking opinions.

Changed in evergreen:
milestone: 2.2.0rc1 → 2.2.0

Jason, are you guys running this code today? Is it solving the original problem? Having recently heard the problem reported from other sites, I'm more inclined to look at this as bug fix rather than new feature, and to get it tested by a second signer-offer.

Jason Stephenson (jstephenson) wrote :

We have not actually used it in production. I have used it in development/testing.

If you want to look at it as a bug fix, feel free. Despite having the pullrequest tag, I think it might be a bit rough around the edges.

For instance, I've seen the client timeout and throw a network error while the server is still chugging away on the deletions with this patch. If you wait a few minutes and refresh the view, you'll that everything did in fact delete.

I think the client and backend issues with batch processing go much deeper than batch deletes, but this is where we noticed it first, when someone tried to delete all 5,000 copies in a bucket.

Changed in evergreen:
milestone: 2.2.0 → none

We are experiencing this same type of behavior with small numbers of records in copy buckets, i.e. greater than 25 records produces the network error when deleting all records from a copy bucket. This is a consistent number of records and the limit greatly reduces our work effectiveness.

Jason Stephenson (jstephenson) wrote :


Do you have a chance to try the code in the working/user/dyrcona/lp921812 branch and report on your findings? If not, that's ok, but if you have a test server where you could try it out, that would be very helpful.


Jason Stephenson (jstephenson) wrote :

I removed the pullrequest tag because I was never completely satisfied with my "solution." It would delete everything from the bucket, but the client would still recieve a timeout error and display a dialog to the user. However, if you're patient and wait a little while, then referesh the view, you'd see that everything was deleted.

I have no idea if the branch merges cleanly today or not.

tags: removed: pullrequest
Terran McCanna (tmccanna) wrote :

I believe this is working in the web staff client, so marking Won't Fix. If it is still a problem, it probably deserves a new bug with web client details.

Changed in evergreen:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers