pcrud search can fail to retrieve rows that the user has access to

Bug #1847805 reported by Galen Charlton
42
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
High
Unassigned
3.7
Confirmed
High
Unassigned
3.8
Confirmed
High
Unassigned

Bug Description

PCRUD search works by constructing an executing a query, then scanning the results and skipping any rows that the user does not have permission to retrieve.

In cases where the user is likely to have permission to access all of the rows retrieved by the query, this isn't a problem as far as accuracy of the results are concerned.

However, it's often the case that the user may not have permission to retrieve a significant number of rows in the results. This can lead to incomplete results in AngularJS and Angular grids, which directly or (via Flattener) indirectly run broad PCRUD queries with offsets and limits. It is particularly problematic for grids (e.g., the EDI messages viewer) that don't have an OU filter that's automatically set to the user's workstation.

For example, imagine a consortium where one system generates 95% of the EDI message traffic and another generates only 5%. A user from the second system whose provider retrieval permissions are restricted to only their own system might open the EDI message page... and not see any results because all of the rows in the initial query window belong to the first system.

It would be better if PCRUD search could construct the base SQL query so it includes the joins and filter conditions necessary to restrict the results to rows that the user has permission to access. While this probably wouldn't obviate the need for post-query permissions checks for each row in the results, it would make it much less likely that the user wouldn't see results that they are entitled to.

Evergreen 3.2+

Galen Charlton (gmc)
Changed in evergreen:
importance: Undecided → High
tags: added: pcrud
Revision history for this message
Galen Charlton (gmc) wrote :

Some brainstorming Mike Rylander and I had about this:

<miker> gmcharlt: off my call now, looking at your bug. yeah, empty first page... gotcha. another option is to 1) move to cursors (so we don't retrieve ALL THE COPIES into pcrud RAM up front) 2) fetch a row or 10 at a time 3) implement limit/offset in pcrud, post-perm-check
<miker> some pros/cons for that:
<miker> pros: generally PG picks a "fast-start" plan for cursors (conditions apply), first page of results never empty, first page generally faster, lower RAM use (no chance of blowing out the backend osrf object slab allocator)
<miker> cons: subsequent pages /might/ take longer to return, especially when perm+location caching is defeated by a very broad object ownership spread
<gmcharlt> as well as the degenerate case that only a handful of rows meet the permissions conditions and you've effectively done a full scan of the table, though at least not without pooching pcrud's memory usage unnecessarily
<gmcharlt> so the cursor-based approach might be a win regardless
<miker> biggest problem I see with embedding perm checking in the query is plan stability/optimization ... we're introducing a significant variable to PG planning that we haven't really done before
<gmcharlt> yeah, at minimum I figure we'd turn up cases where we would want additonal indexes on context OU columns
<miker> aye. and user-ownership is tricky to OR into that...
<miker> see: subscription buckets :)
<gmcharlt> and as yet another edge case, object-owership
<gmcharlt> not, fortunately, that we have a ton of it in active use, but we do have some
<gmcharlt> of course, for specific cases, a workaround is to embed the additional OU-ownership conditions in the client's source pcrud query
<gmcharlt> but I'd hate for that to be done more than absolutely necessary
<miker> yeah, not a fan of that... we can make the tools better

Revision history for this message
Mike Rylander (mrylander) wrote :

I have a branch implementing what Galen and I chatted about in IRC, available at:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1847805-pcrud_cursor_implementation

Time for more eyeballs, please!

tags: added: pullrequest
Revision history for this message
Mike Rylander (mrylander) wrote :

I've updated the above-linked branch to avoid using cursors when we know we're only going to retrieve one row. This should reduce any measurable, new overhead.

From the first commit message, in case it will produce some interest:

Because we check authz for rows retrieved by open-ils.pcrud calls after the underlying query is executed, including any limit/offset-based paging, we can fail to return results when the first "page" is full of rows that the calling user cannot see.

With this commit, we switch to SQL cursors and page through results in the application logic, which guarantees that the situation above will not impact legitimate row visibility.

More generally, this will should improve memory use by pcrud (along with cstore and other similar services) by avoiding initial full retrieval of results. Only those that are needed for the final result will consume memory for the full lifetime of the request.

Additionally, this may prove to increase speed of some queries, especially because Postgres has the option to optimize for "fast-start" plans.

Changed in evergreen:
milestone: none → 3.next
status: New → Confirmed
Changed in evergreen:
milestone: 3.next → 3.5-alpha
Revision history for this message
Galen Charlton (gmc) wrote :

I've done some initial testing and have some comments:

- using the following query on a Concerto database with a BR3 circulator, I got the following results:

  request open-ils.pcrud open-ils.pcrud.search.au "$SES", {"id":{"!=":null}}, {}

  * the number of records returned were the same both pre- and post- patch
  * pre-patch, it took ~2 seconds on my test system; post-patch, ~3.5 seconds
  * bumping the default cursor window size from 5 to 50 brought it back down to ~2.2 seconds
  * it made no difference to the timing whether or not the query to fetch each row for permissions checks used an interior cursor or not; the additional time it took seemed to be caused strictly by the outer cursor

- I have a major concern about opening the cursor "WITH HOLD". Per https://www.postgresql.org/docs/9.6/sql-declare.html, "In the current implementation, the rows represented by a held cursor are copied into a temporary file or memory area so that they remain available for subsequent transactions." I've verified that creating such cursors does indeed create files in pgsql_tmp for large files; a glitch (or attack) that fired off a number of cstore or pcrud searches of large tables could cause disk usage and/or I/O problems on a database server.

Creating a cursor WITHOUT HOLD inside of a transaction should work without creating temporary files as often as WITH HOLD could, but if that were done, cursors either become something that gets used only within atomic cstore and pcrud calls or making all searches effectively atomic, which would have its own tradeoffs.

- as a minor point, I think it's leaking the cursor_name string

Revision history for this message
Mike Rylander (mrylander) wrote :

Thanks for testing, Galen.

I've pushed an update that addresses your WITH HOLD concern by making sure that we're either inside an explicit transaction initiated by the caller, or creating a transaction to wrap the DECLARE/FETCH/CLOSE portion of the logic. With a transaction in place, we now use WITHOUT HOLD.

I also confirmed that cursor_name was leaking, and fixed that. The overall simplest thing to do is make it a global, since we don't currently interleave fetching and fleshing, though cursors do open that option, and could mean a faster TTFB with immediate streaming responses. That's a patch for another day, though.

If testing the two commits works well, I'd recommend we squash them into one before merging to master.

Thanks!

Michele Morgan (mmorgan)
Changed in evergreen:
milestone: 3.5-beta → 3.5.0
Changed in evergreen:
milestone: 3.5.0 → none
Changed in evergreen:
milestone: none → 3.5.1
no longer affects: evergreen/3.5
Changed in evergreen:
milestone: 3.5.1 → 3.5.2
Revision history for this message
Mike Rylander (mrylander) wrote :

Just a note about the forced atomic-ness Galen mentioned above (a transaction is required for WITHOUT HOLD cursors), we should actually be fine there.

If we're already inside a transaction, we're already OK, and if we're not, we'll start one before the main query and roll it back after. That will be a read-only transaction, which is allowed on readonly replicas (if anyone's using those for load balancing) and won't consume transaction IDs (XIDs) in modern PG (using XIDs here could lead to premature forced full vacuums, but that only happens for read-write transactions).

The only overhead that the "internal" transactions will create is the round-trip to issue the BEGIN.

Changed in evergreen:
milestone: 3.5.2 → 3.6.1
Changed in evergreen:
milestone: 3.6.1 → 3.6.2
Revision history for this message
Mike Rylander (mrylander) wrote :

I've added a commit to the above branch that protects the cursor from recursive calls used in open-ils.pcrud to check permissions on retrieved objects.

Changed in evergreen:
milestone: 3.6.2 → 3.6.3
Changed in evergreen:
milestone: 3.6.3 → 3.6.4
Changed in evergreen:
milestone: 3.6.4 → 3.7.2
no longer affects: evergreen/3.2
no longer affects: evergreen/3.3
no longer affects: evergreen/3.4
no longer affects: evergreen/3.5
Changed in evergreen:
milestone: 3.7.2 → 3.7.3
Changed in evergreen:
milestone: 3.7.3 → 3.9-beta
Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
tags: added: needswork
removed: pullrequest
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I was able to verify the bug with produciton data using the following in srfsh:

request open-ils.pcrud open-ils.pcrud.search.au "authtoken", {"id": {"!=": null}}, {"limit": 10}

I get 0 results back. If I change the search to use the home_ou of patrons equal to the work/permission ou of the logged in user, then I get 10 results.

After I apply the branch, neither prcrud search works for me. I get a database error:

Apr 6 09:12:46 egdev open-ils.pcrud: [ERR :16277:oils_sql.c:6134:16492507311661
12] open-ils.pcrud: Error retrieving actor::user with query [DECLARE "kaSXSIwPNO
2lNblD" NO SCROLL CURSOR WITHOUT HOLD FOR SELECT "au".active, "au".alert_messag
e, "au".barred, "au".billing_address, "au".card, "au".claims_returned_count, "au
".claims_never_checked_out_count, "au".create_date, "au".credit_forward_balance,
 "au".day_phone, "au".dob, "au".email, "au".evening_phone, "au".expire_date, "au
".family_name, "au".first_given_name, "au".home_ou, "au".id, "au".ident_type, "a
u".ident_type2, "au".ident_value, "au".ident_value2, "au".last_xact_id, "au".mai
ling_address, "au".master_account, "au".net_access_level, "au".other_phone, "au"
.photo_url, "au".prefix, "au".profile, "au".second_given_name, "au".standing, "a
u".suffix, "au".super_user, "au".usrgroup, "au".usrname, "au".alias, "au".juveni
le, "au".last_update_time, "au".pref_prefix, "au".pref_first_given_name, "au".pr
ef_second_given_name, "au".pref_family_name, "au".pref_suffix, "au".guardian, "a
u".name_keywords, "au".name_kw_tsvector, "au".deleted FROM actor.usr AS "au" WHERE "au".id = '424484';]: 6844179 6844179: ERROR: cursor "kaSXSIwPNO2lNblD" already exists

and

Apr 6 09:13:18 egdev open-ils.pcrud: [ERR :16277:oils_sql.c:6257:1649250731166113] open-ils.pcrud: Error retrieving actor::user with query [DECLARE "rXfnIqKJukEGhAaM" NO SCROLL CURSOR WITHOUT HOLD FOR SELECT "au".active, "au".alert_message, "au".barred, "au".billing_address, "au".card, "au".claims_returned_count, "au".claims_never_checked_out_count, "au".create_date, "au".credit_forward_balance, "au".day_phone, "au".dob, "au".email, "au".evening_phone, "au".expire_date, "au".family_name, "au".first_given_name, "au".home_ou, "au".id, "au".ident_type, "au".ident_type2, "au".ident_value, "au".ident_value2, "au".last_xact_id, "au".mailing_address, "au".master_account, "au".net_access_level, "au".other_phone, "au".photo_url, "au".prefix, "au".profile, "au".second_given_name, "au".standing, "au".suffix, "au".super_user, "au".usrgroup, "au".usrname, "au".alias, "au".juvenile, "au".last_update_time, "au".pref_prefix, "au".pref_first_given_name, "au".pref_second_given_name, "au".pref_family_name, "au".pref_suffix, "au".guardian, "au".name_keywords, "au".name_kw_tsvector, "au".deleted FROM actor.usr AS "au" WHERE "au".id IS NOT NULL;]: 5225472 5225472: ERROR: cursor "(null)" does not exist

The patch was applied to both Evergreen 3.7.2 and 3.5.3 with production data on a PostgreSQL 10 database. The above errors are from the Evergreen 3.7.2 installation, but 3.5.3 gave similar errors.

Changed in evergreen:
milestone: 3.9-beta → none
no longer affects: evergreen/3.6
Revision history for this message
Galen Charlton (gmc) wrote :

Mike and I had a discussion about an alternative implementation and landed on teaching open-ils.pcrud how to automatically filter by permissions along these lines:

- Add an option that's peer to flesh (et al) in PCRUD search call to specify that the query should get filter conditions automatically added to whatever the client has supplied.
- This would include a dedicated function that can generate an appropriate where clause addition ( {$context_field : { 'in' : $perms_at_array }} or { -exists : { $linked_context_subquery }} or, for no-perms required, just TRUE ) to be unconditionally ANDed by inserting {'-and':[$generated,$user_supplied}] above the user-supplied filter hash.
- All this should require is the core class hint and the user context.
- The list of OUs could be generated by a version of permission.usr_has_perm_at_all() that knows how to check multiple permissions at once. Depending on how this performs, it could either be invoked right before running the main query or directly included in the where conditions.
- The most complication will come from the few linked perm contexts that use "jump". hold notification, acq admin, payments/billings, surveys, link checking, and serials are the users of that construct. acq LID in particular specifies 2 "jump"ed context fields and 4 possible perms, so a general solution would have to support multiple OR'd generated filters.
- We should also support the owning_user perm construct, but that seems a natural extension to the above.
- Even the user-object perm map should be easy to support, and since the table behind that in normally empty or tiny, it shouldn't have a performance impact in practice.
- It would be good, rather than doing all the work on every call, to invent pcrud-mode startup logic that generates an appropriate set of check templates for each class. The problem with that is knowing how to stick the user id and org list into all the right places of the cloned perm-checking templates at call time. Likely we can record the jsonPath for those when needed and use jsonObjectFindPath() or findMultiPath[Recurse]() from libosrf_json (osrf_json_tools.c) at call time.

Another thing to look at during the implementation would be to see if adding this permission-based filtering logic can be cheaply done when retrieving rows for fleshing.

Revision history for this message
Galen Charlton (gmc) wrote :

Noting that doing permissions-based prefiltering in open-ils.pcrud could obviate a need for a patch for bug 2007322, but I can imagine client-side implementations coming sooner, as adding this to PCRUD will require at least medium-sized surgery.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.