Propose SQL ACLs and filtered queries
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
KARL4 |
Fix Released
|
Low
|
Jim Fulton |
Bug Description
Jim will do some R&D investigation on karlstaging:
- Access to the snapshot database
- Access to the application so he can get to the relstorage.
summary: |
- Propose SQL ACLs and filteried queries + Propose SQL ACLs and filtered queries |
Paul Everitt (paul-agendaless) wrote : | #1 |
Jim Fulton (jim-zope) wrote : | #2 |
OK, I have something working and it seems to provide reasonable performance.
I tested with queries as I believe they'd be generated by the pgtextindex. With a query for text containing "zuma", and also limiting to the african-advocacy community, the basic search takes around 10ms and yields 519 rows on the staging server.
Adding security filtering adds 10-20ms, depending on the filter.
If I don't filter for community, the raw search is ~18ms and the security-filtered search takes around 105ms. Note though that I only added ACE data for the african-advocacy community. It's possible/likely the filtered search would be faster for the larger data set if the full ACE data were present.
I added 2 (temporary for my testing) tables:
parents (docid int, parent_docid int)")
ace (docid int, allowed bool, who varchar, permission varchar, ord int)
I used a fairly straightforward recursive query, which I won't include here. Currently, I have a Python function that generates the query. I have some basic tests using psycopg2 and that connect to a (unspecified, so local, or controlled via environment variables) postgres database.
Note that pgtextindex doesn't use sqlalchemy, so that shouldn't be an issue. :)
Would you like me to check what I have in somewhere? Wanna get together online to discuss?
Jim Fulton (jim-zope) wrote : | #3 |
BTW, I chose an approach that required minimal database updates.
In particular, ACEs are only stored on the objects for which there were defined in the app. This requires the recursive queries.
Another option would be to flatten them, to avoid the recursive query. This would require a lot of updates when ACLs are updated on nodes with lot's of descendents.
The materialized path idea sounds interesting. I didn't explore that, but would be happy to. Let me know if you want me to pursue that.
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1638295] Re: Propose SQL ACLs and filtered queries | #4 |
Some questions:
1) How would you envision getting class ACL information into the query?
2) Let’s say you do a search that says, give me the BlogEntry types in SomeCommunity with a text search of “africa”, ranked by relevance, that I have permission to see, and just the first 20. Are you confident that the PG optimizer does the right thing at the right time? Meaning, the expensive step (ACL lookup) is done late?
3) Did you manage to think of any kind of index, e.g. an expression index, that could be used?
4) When you say “the raw search is ~18ms”, does that include the relevance ranking ordering? (FWIW, that’s the part that kills us in PG, requires a table scan.)
5) For the materialized path, I think we might have you look at it, just to have in our back pocket, but after we’ve gone through some other things.
6) Does your query do anything like “cache” ACL lookups on intermediate nodes? (Or that might be my stupid procedural thinking still kicking around.)
7) Re: your idea about flattening…rather than store the actual ACL on all the children, could you store a single ACL, and point all the children at it?
- parent1 has ACL with an ACL of root-parent1
- child1, child2, grandchild3 all point at ACL root-parent1
- Later, parent1 changes its ACL information, but child1, etc. don’t have to change anything and just have 1 join to do, no recursion
Yep, I can chat any time this afternoon. I just hopped on #freenode, poke zopepaul when you’re around.
—Paul
> On Nov 3, 2016, at 1:54 PM, Jim Fulton <email address hidden> wrote:
>
> OK, I have something working and it seems to provide reasonable
> performance.
>
> I tested with queries as I believe they'd be generated by the
> pgtextindex. With a query for text containing "zuma", and also limiting
> to the african-advocacy community, the basic search takes around 10ms
> and yields 519 rows on the staging server.
>
> Adding security filtering adds 10-20ms, depending on the filter.
>
> If I don't filter for community, the raw search is ~18ms and the
> security-filtered search takes around 105ms. Note though that I only
> added ACE data for the african-advocacy community. It's possible/likely
> the filtered search would be faster for the larger data set if the full
> ACE data were present.
>
> I added 2 (temporary for my testing) tables:
>
> parents (docid int, parent_docid int)")
>
> ace (docid int, allowed bool, who varchar, permission varchar, ord
> int)
>
> I used a fairly straightforward recursive query, which I won't include
> here. Currently, I have a Python function that generates the query. I
> have some basic tests using psycopg2 and that connect to a (unspecified,
> so local, or controlled via environment variables) postgres database.
>
> Note that pgtextindex doesn't use sqlalchemy, so that shouldn't be an
> issue. :)
>
> Would you like me to check what I have in somewhere? Wanna get together
> online to discuss?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https:/
>
> Title:
> Propose SQL ACLs and filtered queries
>
> Status in KARL4:
> New
>
> Bug description:
> Jim will do some R&D...
Jim Fulton (jim-zope) wrote : | #5 |
On Thu, Nov 3, 2016 at 2:41 PM, Paul Everitt <email address hidden> wrote:
> Some questions:
>
> 1) How would you envision getting class ACL information into the query?
>
I don't. Tres said you weren't using class information.
>
> 2) Let’s say you do a search that says, give me the BlogEntry types in
> SomeCommunity with a text search of “africa”, ranked by relevance, that
> I have permission to see, and just the first 20. Are you confident that
> the PG optimizer does the right thing at the right time? Meaning, the
> expensive step (ACL lookup) is done late?
>
Yes.
>
> 3) Did you manage to think of any kind of index, e.g. an expression
> index, that could be used?
>
Yes, I used an index for docid on the parent table and on docid, permission
and principal, on the ace table. I don't see a need for an expression index.
>
> 4) When you say “the raw search is ~18ms”, does that include the
> relevance ranking ordering? (FWIW, that’s the part that kills us in PG,
> requires a table scan.)
>
Yes. Here's the query I used for the "raw" search not limited to the one
community:
WITH _filtered AS (
SELECT docid, coefficient, text_vector
FROM pgtextindex
WHERE text_vector @@ 'zuma'::tsquery),
_counter AS (SELECT count(1) AS n FROM _filtered),
_ranked AS (
SELECT docid, coefficient * (
FROM _filtered, _counter)
SELECT docid, rank
FROM _ranked;
This is more or less the query used by pgtextindex.
>
> 5) For the materialized path, I think we might have you look at it, just
> to have in our back pocket, but after we’ve gone through some other
> things.
>
I can imagine it will be a win over the recursive approach. I wish I'd
thought of it first.
> 6) Does your query do anything like “cache” ACL lookups on intermediate
> nodes? (Or that might be my stupid procedural thinking still kicking
> around.)
>
No. Here's template I use for the filtered query:
with recursive
text_results as (%(search)s),
allowed(docid, id, parent_docid, allowed %(extra)s) as (
select docid, id, parent_docid, allowed %(extra)s from (
select distinct on (r.docid)
from text_results r
join parents p using (docid)
left join ace a on (
a.docid = r.docid and
a.who in %(principals)s
)
order by r.docid, a.ord
) base
union all
select docid, id, parent_docid, allowed %(extra)s from (
select distinct on (p.docid)
from (select allowed.docid, p.docid as id, p.parent_docid %(extra)s
from allowed, parents p
left join ace a on (
Paul Everitt (paul-agendaless) wrote : | #6 |
Two non-search queries to measure the impact of ACL filtering:
1) Look for all content authored by plonepaul, that I'm allowed to see (as group:KarlStaff and as group:KarlAffil
** This brings up an interesting point...the user making the request might have lots of groups assigned, one for each community they are in. This affects the size of the assembled query. Does that need to be measured?
2) Show me the first 20 BlogEntry resources in a community, that I'm allowed to see, reverse sorted on creation date.
Paul Everitt (paul-agendaless) wrote : | #7 |
Out of curiosity, you did each ACE in the ACL as a row in a table for acls. Why not do them directly on the object as an array column, and not have to do the join (and risk a dangling reference)?
Jim Fulton (jim-zope) wrote : | #8 |
FTR, I spend 13 hours and 20 minutes on this
Jim Fulton (jim-zope) wrote : | #9 |
On Thu, Nov 3, 2016 at 4:36 PM, Paul Everitt <email address hidden> wrote:
> Out of curiosity, you did each ACE in the ACL as a row in a table for
> acls.
Yes.
Why not do them directly on the object
What object? Do you mean the parents table? The pgtextindex table?
> as an array column, and not
> have to do the join
Most of the joins are done to walk the hierarchy. I might have avoided
some joins by putting the parent and acl information into the pgtextindex
table. But then I'd need to do the filtering in sql expressions. I doubt
that would be faster than the join-based approach. Perhaps it would have
been. I chose to go with a more relational model.
The joins are supported by indexes and are fast. Relational databases are
good at this. :)
Where an array-based approach might be a win would be to avoid using select
distinct on and order by to deal with the ordered evaluate of ACLs. IOW
another design might be for the ace table to look like:
ace (docid int, permission varchar, access Access[])
where Access is a custom type having principal ids and boolean access flags.
But then I'd have to figure out how to search the array in order and stop
when I found something. Skimming the documentation doesn't reveal anything.
Assuming I figured this out, it might not be any faster than what I'm doing
now. <shrug>
> (and risk a dangling reference)?
>
I'm not sure what you mean. Most objects don't acls. In the
african-advocacy community, almost 2/3 of the objects lack acls. The
algorithm explicitly handles missing ACLs and would have to do so
regardless of whether they were represented as empty arrays or as null's
resulting from outer joins.
Jim
--
Jim Fulton
http://
Jim Fulton (jim-zope) wrote : | #10 |
On Fri, Nov 4, 2016 at 10:48 AM, Jim Fulton <email address hidden> wrote:
...
> But then I'd have to figure out how to search the array in order and stop
> when I found something. Skimming the documentation doesn't reveal anything.
> Assuming I figured this out,
>
This would involve a PL/pgSQL function containing a for loop, apparently.
Paul Everitt (paul-agendaless) wrote : | #11 |
I think my comments about doing ACL as a column on the catalog entry, instead of rows in a table that are joined, show that I continue to think procedurally about this. I think your points have helped me see it better.
Paul Everitt (paul-agendaless) wrote : | #12 |
FWIW, Kotti (a Zope-style CMS in SQL and Pyramid) also does ACLs as rows in another table, I believe.
Jim Fulton (jim-zope) wrote : | #13 |
On Sat, Nov 5, 2016 at 10:06 AM, Paul Everitt <email address hidden> wrote:
> I think my comments about doing ACL as a column on the catalog entry,
> instead of rows in a table that are joined, show that I continue to
> think procedurally about this. I think your points have helped me see it
> better.
>
Cool, but <shrug> :). You ask very good questions.
Arrays are very tempting for folks like us, because we want databases to be
higher level. It's rather surprising how hard it is to leverage postgres
arrays. Makes me suspect my ignorance, but consider:
http://
which suggests to search based on array ordering, you need to effectively
disassemble them.
Jim
--
Jim Fulton
http://
Paul Everitt (paul-agendaless) wrote : | #14 |
> On Nov 5, 2016, at 10:26 AM, Jim Fulton <email address hidden> wrote:
>
> On Sat, Nov 5, 2016 at 10:06 AM, Paul Everitt <email address hidden>
> wrote:
>
>> I think my comments about doing ACL as a column on the catalog entry,
>> instead of rows in a table that are joined, show that I continue to
>> think procedurally about this. I think your points have helped me see it
>> better.
>>
>
> Cool, but <shrug> :). You ask very good questions.
>
> Arrays are very tempting for folks like us, because we want databases to be
> higher level. It's rather surprising how hard it is to leverage postgres
> arrays. Makes me suspect my ignorance, but consider:
> http://
> which suggests to search based on array ordering, you need to effectively
> disassemble them.
Let’s say you stored docids, instead of __name__. Thus, your path index would do /22342323/42355.
If you wanted to find anything at /a/b, then you’d need to get the docid of b. At that point, your query is a contains, as 42355 is guaranteed to be unique. I suspect an index on the array, to support asking if the array contains 42355, would be fast.
—Paul
Jim Fulton (jim-zope) wrote : | #15 |
See https:/
In addition to exploring materialized paths, I looked at using Postgres arrays ACEs, which seemed to be the best approach, mainly because ur tied for being the fastest and provided the cleanest database representation. https:/
Paul Everitt (paul-agendaless) wrote : | #16 |
I’m not a fan of extending tasks indefinitely. You’ve produced exactly (in fact, even better) than I hoped. I propose we close this one and do some planning for next work.
—Paul
> On Nov 7, 2016, at 8:58 AM, Jim Fulton <email address hidden> wrote:
>
> See https:/
>
> In addition to exploring materialized paths, I looked at using Postgres
> arrays ACEs, which seemed to be the best approach, mainly because ur
> tied for being the fastest and provided the cleanest database
> representation. https:/
> #recursive-
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https:/
>
> Title:
> Propose SQL ACLs and filtered queries
>
> Status in KARL4:
> New
>
> Bug description:
> Jim will do some R&D investigation on karlstaging:
>
> - Access to the snapshot database
>
> - Access to the application so he can get to the relstorage.
>
> To manage notifications about this bug go to:
> https:/
Jim Fulton (jim-zope) wrote : | #17 |
On Mon, Nov 7, 2016 at 10:02 AM, Paul Everitt <email address hidden> wrote:
> I’m not a fan of extending tasks indefinitely. You’ve produced exactly (in
> fact, even better) than I hoped. I propose we close this one and do some
> planning for next work.
>
Sounds good. Let me know about planning.
--
Jim Fulton
http://
Changed in karl4: | |
status: | New → Fix Committed |
Paul Everitt (paul-agendaless) wrote : | #18 |
I believe this research project completed.
Changed in karl4: | |
status: | Fix Committed → Fix Released |
Jim, one thing I wasn't clear about...I pointed you at https:/ /github. com/pauleveritt /pyramid_ sqltraversal/ tree/master/ docs which uses the Adjacency List approach for hierarchies. There are other approaches, e.g. materialized path. I used adjacency list because it was well supported and seemed to have good performance.
Point is, you can use whatever you want. These things all have deep implications for which I'm not a good arbiter.