Propose SQL ACLs and filtered queries

Bug #1638295 reported by Paul Everitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL4
Fix Released
Low
Jim Fulton

Bug Description

Jim will do some R&D investigation on karlstaging:

- Access to the snapshot database

- Access to the application so he can get to the relstorage.

summary: - Propose SQL ACLs and filteried queries
+ Propose SQL ACLs and filtered queries
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Jim, one thing I wasn't clear about...I pointed you at https://github.com/pauleveritt/pyramid_sqltraversal/tree/master/docs which uses the Adjacency List approach for hierarchies. There are other approaches, e.g. materialized path. I used adjacency list because it was well supported and seemed to have good performance.

Point is, you can use whatever you want. These things all have deep implications for which I'm not a good arbiter.

Revision history for this message
Jim Fulton (jim-zope) wrote :

OK, I have something working and it seems to provide reasonable performance.

I tested with queries as I believe they'd be generated by the pgtextindex. With a query for text containing "zuma", and also limiting to the african-advocacy community, the basic search takes around 10ms and yields 519 rows on the staging server.

Adding security filtering adds 10-20ms, depending on the filter.

If I don't filter for community, the raw search is ~18ms and the security-filtered search takes around 105ms. Note though that I only added ACE data for the african-advocacy community. It's possible/likely the filtered search would be faster for the larger data set if the full ACE data were present.

I added 2 (temporary for my testing) tables:

  parents (docid int, parent_docid int)")

  ace (docid int, allowed bool, who varchar, permission varchar, ord int)

I used a fairly straightforward recursive query, which I won't include here. Currently, I have a Python function that generates the query. I have some basic tests using psycopg2 and that connect to a (unspecified, so local, or controlled via environment variables) postgres database.

Note that pgtextindex doesn't use sqlalchemy, so that shouldn't be an issue. :)

Would you like me to check what I have in somewhere? Wanna get together online to discuss?

Revision history for this message
Jim Fulton (jim-zope) wrote :

BTW, I chose an approach that required minimal database updates.

In particular, ACEs are only stored on the objects for which there were defined in the app. This requires the recursive queries.

Another option would be to flatten them, to avoid the recursive query. This would require a lot of updates when ACLs are updated on nodes with lot's of descendents.

The materialized path idea sounds interesting. I didn't explore that, but would be happy to. Let me know if you want me to pursue that.

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1638295] Re: Propose SQL ACLs and filtered queries
Download full text (3.4 KiB)

Some questions:

1) How would you envision getting class ACL information into the query?

2) Let’s say you do a search that says, give me the BlogEntry types in SomeCommunity with a text search of “africa”, ranked by relevance, that I have permission to see, and just the first 20. Are you confident that the PG optimizer does the right thing at the right time? Meaning, the expensive step (ACL lookup) is done late?

3) Did you manage to think of any kind of index, e.g. an expression index, that could be used?

4) When you say “the raw search is ~18ms”, does that include the relevance ranking ordering? (FWIW, that’s the part that kills us in PG, requires a table scan.)

5) For the materialized path, I think we might have you look at it, just to have in our back pocket, but after we’ve gone through some other things.

6) Does your query do anything like “cache” ACL lookups on intermediate nodes? (Or that might be my stupid procedural thinking still kicking around.)

7) Re: your idea about flattening…rather than store the actual ACL on all the children, could you store a single ACL, and point all the children at it?

- parent1 has ACL with an ACL of root-parent1
- child1, child2, grandchild3 all point at ACL root-parent1
- Later, parent1 changes its ACL information, but child1, etc. don’t have to change anything and just have 1 join to do, no recursion

Yep, I can chat any time this afternoon. I just hopped on #freenode, poke zopepaul when you’re around.

—Paul

> On Nov 3, 2016, at 1:54 PM, Jim Fulton <email address hidden> wrote:
>
> OK, I have something working and it seems to provide reasonable
> performance.
>
> I tested with queries as I believe they'd be generated by the
> pgtextindex. With a query for text containing "zuma", and also limiting
> to the african-advocacy community, the basic search takes around 10ms
> and yields 519 rows on the staging server.
>
> Adding security filtering adds 10-20ms, depending on the filter.
>
> If I don't filter for community, the raw search is ~18ms and the
> security-filtered search takes around 105ms. Note though that I only
> added ACE data for the african-advocacy community. It's possible/likely
> the filtered search would be faster for the larger data set if the full
> ACE data were present.
>
> I added 2 (temporary for my testing) tables:
>
> parents (docid int, parent_docid int)")
>
> ace (docid int, allowed bool, who varchar, permission varchar, ord
> int)
>
> I used a fairly straightforward recursive query, which I won't include
> here. Currently, I have a Python function that generates the query. I
> have some basic tests using psycopg2 and that connect to a (unspecified,
> so local, or controlled via environment variables) postgres database.
>
> Note that pgtextindex doesn't use sqlalchemy, so that shouldn't be an
> issue. :)
>
> Would you like me to check what I have in somewhere? Wanna get together
> online to discuss?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1638295
>
> Title:
> Propose SQL ACLs and filtered queries
>
> Status in KARL4:
> New
>
> Bug description:
> Jim will do some R&D...

Read more...

Revision history for this message
Jim Fulton (jim-zope) wrote :
Download full text (4.1 KiB)

On Thu, Nov 3, 2016 at 2:41 PM, Paul Everitt <email address hidden> wrote:

> Some questions:
>
> 1) How would you envision getting class ACL information into the query?
>

I don't. Tres said you weren't using class information.

>
> 2) Let’s say you do a search that says, give me the BlogEntry types in
> SomeCommunity with a text search of “africa”, ranked by relevance, that
> I have permission to see, and just the first 20. Are you confident that
> the PG optimizer does the right thing at the right time? Meaning, the
> expensive step (ACL lookup) is done late?
>

Yes.

>
> 3) Did you manage to think of any kind of index, e.g. an expression
> index, that could be used?
>

Yes, I used an index for docid on the parent table and on docid, permission
and principal, on the ace table. I don't see a need for an expression index.

>
> 4) When you say “the raw search is ~18ms”, does that include the
> relevance ranking ordering? (FWIW, that’s the part that kills us in PG,
> requires a table scan.)
>

Yes. Here's the query I used for the "raw" search not limited to the one
community:

WITH _filtered AS (
            SELECT docid, coefficient, text_vector
            FROM pgtextindex
            WHERE text_vector @@ 'zuma'::tsquery),
        _counter AS (SELECT count(1) AS n FROM _filtered),
        _ranked AS (
            SELECT docid, coefficient * (
                CASE WHEN n <= 6000 THEN
                    ts_rank_cd(text_vector, 'zuma'::tsquery)
                ELSE 1 END) AS rank
            FROM _filtered, _counter)
        SELECT docid, rank
        FROM _ranked;

This is more or less the query used by pgtextindex.

>
> 5) For the materialized path, I think we might have you look at it, just
> to have in our back pocket, but after we’ve gone through some other
> things.
>

I can imagine it will be a win over the recursive approach. I wish I'd
thought of it first.

> 6) Does your query do anything like “cache” ACL lookups on intermediate
> nodes? (Or that might be my stupid procedural thinking still kicking
> around.)
>

No. Here's template I use for the filtered query:

with recursive
     text_results as (%(search)s),
     allowed(docid, id, parent_docid, allowed %(extra)s) as (
       select docid, id, parent_docid, allowed %(extra)s from (
         select distinct on (r.docid)
                r.docid, r.docid as id, p.parent_docid, a.allowed %(extra)s
         from text_results r
         join parents p using (docid)
         left join ace a on (
           a.docid = r.docid and
           a.permission in ('%(permission)s', '*') and
           a.who in %(principals)s
           )
         order by r.docid, a.ord
         ) base
     union all
       select docid, id, parent_docid, allowed %(extra)s from (
         select distinct on (p.docid)
                p.docid, p.id, p.parent_docid, a.allowed %(extra)s
         from (select allowed.docid, p.docid as id, p.parent_docid %(extra)s
               from allowed, parents p
               where allowed.allowed is null and
                     allowed.parent_docid = p.docid) p
              left join ace a on (
                a.docid = p.id and
                a.permission in ('%(perm...

Read more...

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Two non-search queries to measure the impact of ACL filtering:

1) Look for all content authored by plonepaul, that I'm allowed to see (as group:KarlStaff and as group:KarlAffiliiate), sort reverse by creation date, and show me the first 10.

** This brings up an interesting point...the user making the request might have lots of groups assigned, one for each community they are in. This affects the size of the assembled query. Does that need to be measured?

2) Show me the first 20 BlogEntry resources in a community, that I'm allowed to see, reverse sorted on creation date.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Out of curiosity, you did each ACE in the ACL as a row in a table for acls. Why not do them directly on the object as an array column, and not have to do the join (and risk a dangling reference)?

Revision history for this message
Jim Fulton (jim-zope) wrote :

FTR, I spend 13 hours and 20 minutes on this

Revision history for this message
Jim Fulton (jim-zope) wrote :

On Thu, Nov 3, 2016 at 4:36 PM, Paul Everitt <email address hidden> wrote:

> Out of curiosity, you did each ACE in the ACL as a row in a table for
> acls.

Yes.

Why not do them directly on the object

What object? Do you mean the parents table? The pgtextindex table?

> as an array column, and not
> have to do the join

Most of the joins are done to walk the hierarchy. I might have avoided
some joins by putting the parent and acl information into the pgtextindex
table. But then I'd need to do the filtering in sql expressions. I doubt
that would be faster than the join-based approach. Perhaps it would have
been. I chose to go with a more relational model.

The joins are supported by indexes and are fast. Relational databases are
good at this. :)

Where an array-based approach might be a win would be to avoid using select
distinct on and order by to deal with the ordered evaluate of ACLs. IOW
another design might be for the ace table to look like:

  ace (docid int, permission varchar, access Access[])

where Access is a custom type having principal ids and boolean access flags.

But then I'd have to figure out how to search the array in order and stop
when I found something. Skimming the documentation doesn't reveal anything.
Assuming I figured this out, it might not be any faster than what I'm doing
now. <shrug>

> (and risk a dangling reference)?
>

I'm not sure what you mean. Most objects don't acls. In the
african-advocacy community, almost 2/3 of the objects lack acls. The
algorithm explicitly handles missing ACLs and would have to do so
regardless of whether they were represented as empty arrays or as null's
resulting from outer joins.

Jim

--
Jim Fulton
http://jimfulton.info

Revision history for this message
Jim Fulton (jim-zope) wrote :

On Fri, Nov 4, 2016 at 10:48 AM, Jim Fulton <email address hidden> wrote:
...

> But then I'd have to figure out how to search the array in order and stop
> when I found something. Skimming the documentation doesn't reveal anything.
> Assuming I figured this out,
>

This would involve a PL/pgSQL function containing a for loop, apparently.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

I think my comments about doing ACL as a column on the catalog entry, instead of rows in a table that are joined, show that I continue to think procedurally about this. I think your points have helped me see it better.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

FWIW, Kotti (a Zope-style CMS in SQL and Pyramid) also does ACLs as rows in another table, I believe.

Revision history for this message
Jim Fulton (jim-zope) wrote :

On Sat, Nov 5, 2016 at 10:06 AM, Paul Everitt <email address hidden> wrote:

> I think my comments about doing ACL as a column on the catalog entry,
> instead of rows in a table that are joined, show that I continue to
> think procedurally about this. I think your points have helped me see it
> better.
>

Cool, but <shrug> :). You ask very good questions.

Arrays are very tempting for folks like us, because we want databases to be
higher level. It's rather surprising how hard it is to leverage postgres
arrays. Makes me suspect my ignorance, but consider:
http://stackoverflow.com/questions/2486725/postgresql-join-with-array-type-with-array-elements-order-how-to-implement
which suggests to search based on array ordering, you need to effectively
disassemble them.

Jim

--
Jim Fulton
http://jimfulton.info

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

> On Nov 5, 2016, at 10:26 AM, Jim Fulton <email address hidden> wrote:
>
> On Sat, Nov 5, 2016 at 10:06 AM, Paul Everitt <email address hidden>
> wrote:
>
>> I think my comments about doing ACL as a column on the catalog entry,
>> instead of rows in a table that are joined, show that I continue to
>> think procedurally about this. I think your points have helped me see it
>> better.
>>
>
> Cool, but <shrug> :). You ask very good questions.
>
> Arrays are very tempting for folks like us, because we want databases to be
> higher level. It's rather surprising how hard it is to leverage postgres
> arrays. Makes me suspect my ignorance, but consider:
> http://stackoverflow.com/questions/2486725/postgresql-join-with-array-type-with-array-elements-order-how-to-implement
> which suggests to search based on array ordering, you need to effectively
> disassemble them.

Let’s say you stored docids, instead of __name__. Thus, your path index would do /22342323/42355.

If you wanted to find anything at /a/b, then you’d need to get the docid of b. At that point, your query is a contains, as 42355 is guaranteed to be unique. I suspect an index on the array, to support asking if the array contains 42355, would be fast.

—Paul

Revision history for this message
Jim Fulton (jim-zope) wrote :

See https://github.com/jimfulton/acl-filtered-search

In addition to exploring materialized paths, I looked at using Postgres arrays ACEs, which seemed to be the best approach, mainly because ur tied for being the fastest and provided the cleanest database representation. https://github.com/jimfulton/acl-filtered-search#recursive-search-representing-acls-as-postgres-arrays

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

I’m not a fan of extending tasks indefinitely. You’ve produced exactly (in fact, even better) than I hoped. I propose we close this one and do some planning for next work.

—Paul

> On Nov 7, 2016, at 8:58 AM, Jim Fulton <email address hidden> wrote:
>
> See https://github.com/jimfulton/acl-filtered-search
>
> In addition to exploring materialized paths, I looked at using Postgres
> arrays ACEs, which seemed to be the best approach, mainly because ur
> tied for being the fastest and provided the cleanest database
> representation. https://github.com/jimfulton/acl-filtered-search
> #recursive-search-representing-acls-as-postgres-arrays
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1638295
>
> Title:
> Propose SQL ACLs and filtered queries
>
> Status in KARL4:
> New
>
> Bug description:
> Jim will do some R&D investigation on karlstaging:
>
> - Access to the snapshot database
>
> - Access to the application so he can get to the relstorage.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl4/+bug/1638295/+subscriptions

Revision history for this message
Jim Fulton (jim-zope) wrote :

On Mon, Nov 7, 2016 at 10:02 AM, Paul Everitt <email address hidden> wrote:

> I’m not a fan of extending tasks indefinitely. You’ve produced exactly (in
> fact, even better) than I hoped. I propose we close this one and do some
> planning for next work.
>

Sounds good. Let me know about planning.

--
Jim Fulton
http://jimfulton.info

Changed in karl4:
status: New → Fix Committed
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

I believe this research project completed.

Changed in karl4:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.