Acquisitions EDI Fetch Should Have Option to Delete Remote Files
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Confirmed
|
Wishlist
|
Unassigned |
Bug Description
Evergreen Versions: 3.11, 3.12, main
OpenSRF Version: N/A
PostgreSQL version: N/A
When fetching EDI files for acquisitions, the remote files are left on the server for the vendor to clean up. Several vendors do not remove fetched files so they linger on the server for years. This means that the fetcher must handle an ever-increasing backlog of previously fetched files with each run. (C/W MARS has one account that currently reports over 1,700 files that get skipped in each fetcher run.) Given how the EDI accounts are set up there could be multiple accounts looking at the same files, thus further increasing the amount of work to be done (cf. bug 1836908).
The EDI fetch process should at least allow the option to delete files from the remote server when they are retrieved. This could work in a number of ways:
1. Delete all remote files determined to be duplicates by the fetcher.
2. Delete all remote files after they are picked up.
3. A combination of 1 & 2.
The above could be made the de facto behavior of the fetcher, or it could be controlled by one or more actor org_unit settings or command line switches to the edi_fetcher.pl program.
I'd like some feedback on these options before beginning work on the code.
Changed in evergreen: | |
status: | New → Confirmed |
This would be a great improvement. Right now I'm just manually going into the vendor FTP folders periodically and deleting things older than X weeks/months.
Personally I lean toward this being de facto behavior. Not only do I have a desire to limit more org unit settings, but I also just think this is logical behavior for the fetcher to do.
I'm leaning toward option #1, deleting files that are determined to be duplicates. IIRC, the fetcher is accessing the folder, reading the filename, and determining now that it's a duplicate so it skips. So maybe instead of skip, it could delete instead?
I also wonder if there's benefit to adding a time component. So checking if it's a duplicate, plus if it's older than X. Sometimes if someone says they're missing invoices or something, I'll access the FTP folders and see when the last files were placed there and if I can see the missing invoices, and then I know whether something is up in Evergreen where it didn't create the invoice, or just that the vendor hasn't put them on the server yet.
If that sounds like a good idea to anyone else, my vote would be for things older than 1 month. That's plenty of time, IMO, and it would still drastically cut down on the files that need to be skipped as dupes, which is the original intent of the bug, I think.