Aged transactions can be de-anonymized using post code and birth year

Bug #1861239 reported by Jeff Davis
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Evergreen
New
Undecided
Unassigned

Bug Description

EG 3.3

Aged circs and aged holds include the user's postal/zip code and birth year. This is enough to uniquely identify the user in many cases. For example, over 10% of user accounts here at Sitka have a unique combination of postal code and birth year.

In my opinion, aged transactions should not retain this information at all. If there is a need for aggregate information on transactions by postal region or birth year, that information should be aggregated by other means before transactions are aged.

Revision history for this message
Anna Goben (agoben) wrote :

If this info is going to be pulled, it needs to be on an opt-in basis. We use this info for demographic analysis relating to circulations.

Changed in evergreen:
assignee: nobody → Rogan Hamby (rogan-hamby)
Revision history for this message
Rogan Hamby (rogan-hamby) wrote :

Patch at user/rogan/lp1861239_anonymize_options

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=commit;h=45c9598a5f0cc47cd586241b67679563ef654222

Adds four org unit settings, post code and birth year for holds and circs respectively as well as updating the aging functions to use them. Normally I'd prefer that exporting information be opt in but given that there is an existing behavior and changing it would disrupt things I took the less disruptive route and the default behavior is to have no value set and behave as it does currently.

Changed in evergreen:
assignee: Rogan Hamby (rogan-hamby) → nobody
tags: added: privacy
tags: added: pullrequest
Revision history for this message
Jeff Davis (jdavis-sitka) wrote :

Thanks, Rogan.

I think retention of post code/birth year needs to be opt-in, even though it means sites will need to take active measures to preserve existing behavior. If it's opt-out, any sites that aren't aware of the setting (or don't realize that post code and birth year are enough to de-anonymize) will end up unintentionally retaining personally identifying information, which is contrary to the purpose of aging transactions. We should err on the side of privacy.

For sites that are using this data, how important is it to retain at the level of individual transactions? Could the same goal be achieved by recording aggregate data instead, e.g. monthly circ counts by birth year range, monthly circ counts by postal region?

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Following the principle of least surprise, I agree with the way Rogan has described it working.

We're already trying to figure out what to do with another feature that had no pre-upgrade opt-in or opt-out before our upgrade to 3.4.

Revision history for this message
Anna Goben (agoben) wrote :

We purge patron records regularly, so deanonymizing would be rather difficult to do with any degree of certainty. But we regularly need demographic circulation information over time for things like grants and other funding, which is why wiping that info out would be problematic.

Revision history for this message
Jane Sandberg (sandbergja) wrote :

I just learned about Latanya Sweeney's research in a training. According to her paper, 83% of people in the United States can be uniquely identified with a combination of Zip code, date of birth, and gender: https://dataprivacylab.org/projects/identifiability/paper1.pdf -- we don't usually store gender in Evergreen, but it certainly suggests that we should be careful with the postal code and date of birth.

If we do use the opt-in approach, we really need to publicize it well!

Revision history for this message
Rogan Hamby (rogan-hamby) wrote :

If we were adding the data as a new feature I would agree that inclusion should be opt-in. In fact I wish that had been the case. However, when there is an existing feature with an existing behavior my opinion is that we should be as least disruptive as possible. The scenario of the data being required for state reports and funding being at risk when a library/libraries doesn't realize it's missing for an extended period of time after an upgrade is the kind of possibility that Anna is mentioning. I don't think we should ignore that.

All of that said I'll be glad to write something up for release notes that says in screaming caps "hey, if you don't need it turn this off" and we can provide recommended SQL for people to wipe what is there.

tags: added: needsdiscussion
Changed in evergreen:
milestone: none → 3.next
Revision history for this message
Rogan Hamby (rogan-hamby) wrote :

Following up on the discussion during the Hack-A-Way I will redo this branch with release notes and it needs rebasing against current master. So I'm going to remove the request tag until I add the new branch.

tags: removed: pullrequest
Revision history for this message
Rogan Hamby (rogan-hamby) wrote :

after a couple of thinkos new branch rebased against master and with sparkly new release notes https://git.evergreen-ils.org/?p=working/Evergreen.git;a=commit;h=489c3fc6a9f927d993140a8dacc35a3076ad2378

lp1861239_anonymize_year_and_postcode

tags: added: pullrequest
tags: removed: needsdiscussion
Revision history for this message
Ted Peterson (devted) wrote :

Commit 489c3fc6a9f927d993140a8dacc35a3076ad2378 had a conflict and
I included Rogan's addition on the MOBIUS bugsquash server for file
950.data.seed-values.sql

Revision history for this message
Rogan Hamby (rogan-hamby) wrote :

Following up on the discussion at the Hack-A-Way I'm going to keep the methodology as opt-in as the support was strong for that and I will test it newly against a current build of Evergreen and see what conflict has snuck in over time. For now I'm removing the pull request until I can re-test it.

tags: removed: pullrequest
Revision history for this message
Rogan Hamby (rogan-hamby) wrote :

I've pushed an update to 006ba3987768d6cf06e14d0a32c93df6ba754eee to resolve the merge conflict (which was seed data that had been added since this was posted).

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=commit;h=006ba3987768d6cf06e14d0a32c93df6ba754eee

tags: added: pullrequest
Revision history for this message
Kathy Lussier (klussier) wrote :

Rogan,

I'm jumping into this discussion 2 1/2 years later (sorry, I've been busy), and I just want to clarify what you said in comment #11. Jeff was asking that retention of those data points be opt in, and in comment 11 you said you were keeping the methodology as opt-in, but I *think* what you meant is that sites will need to opt into the new feature. In other words, it will continue to behave as it currently behaves unless a site decides to make an adjustment to this setting. Is my understanding correct? Also, despite my desire to minimize the collection of data where we can, I agree with that approach.

Also, would you be willing to rebase it again? I'm guessing more merge conflicts have developed since 2021, and I'm hoping I'll generate some enthusiasm for getting this tested during my conference privacy session.

Revision history for this message
Rogan Hamby (rogan-hamby) wrote :

I will check to see if it needs rebasing.

Revision history for this message
Rogan Hamby (rogan-hamby) wrote :

Well, I cloned a fresh copy of the repo and cherry-picked the branch and it picked cleanly, even on the seed data.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.