Bug #1896125 “LDAP paging leaks memory” : Bugs : OpenStack Identity (keystone)

Lance Bragstad (lbragstad) on 2020-09-17

tags:

added: ldap

Lance Bragstad (lbragstad) on 2020-09-17

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-26: Fix proposed to keystone (master)

#1

Fix proposed to branch: master
Review: https://review.opendev.org/754488

Changed in keystone:
assignee:	nobody → Lance Bragstad (lbragstad)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-02: Fix proposed to keystone (stable/victoria)

#2

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/755734

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-02: Fix proposed to keystone (stable/ussuri)

#3

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/755735

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-02: Fix proposed to keystone (stable/train)

#4

Fix proposed to branch: stable/train
Review: https://review.opendev.org/755736

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-15: Fix merged to keystone (master)

#5

Download full text (7.3 KiB)

Reviewed: https://review.opendev.org/754488
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=e98d1ac622f74cbbb41872226ab755bf4ff3ed84
Submitter: Zuul
Branch: master

commit e98d1ac622f74cbbb41872226ab755bf4ff3ed84
Author: Lance Bragstad <email address hidden>
Date: Tue Sep 29 10:20:13 2020 -0500

Implement more robust connection handling for asynchronous LDAP calls

    Keystone's paging implementation contains a memory leak. The issue is
    noticeable if you integrate keystone with an LDAP server that supports
    paging and set `keystone.conf [ldap] page_size` to a low integer
    (e.g., 5).

    Keystone's LDAP backend uses `python-ldap` to interact with LDAP
    servers. For paged requests, it uses `search_ext()`, which is an
    asynchronous API [0]. The server responds with a message ID, which the
    client uses to retrieve all data for the request. In keystone's case,
    the `search_ext()` method is invoked with a page control that tells
    the server to deliver responses in increments according to the page
    size configured with `keystone.conf [ldap] page_size`. So long as the
    client has the original connection used to fetch the message ID, it
    can request the rest of the information associated to the request.

    Keystone's paging implementation loops continuously for paged
    requests. It takes the message ID it gets from `search_ext()` and
    calls `result3()`, asking the server for the data associated with that
    specific message. Keystone continues to do this until the server sends
    an indicator that it has no more data relevant to the query (via a
    cookie). The `search_ext()` and `result3()` methods must use the same
    LDAP connection.

    Given the above information, keystone uses context managers to provide
    connections. This is relevant when deploying connection pools, where
    certain connections are re-used from a pool. Keystone relies on Python
    context managers to handle connections, which is pretty typical
    use-case for context managers. Connection managers allow us to do the
    following (assuming pseudocode):

      with self.get_connection as conn:
          response = conn.search_s()
          return format(response)

    The above snippet assumes the `get_connection` method provides a
    connection object and a callable that implements `search_s`. Upon
    exiting the `with` statement, the connection is disconnected, or put
    back into the pool, or whatever the implementation of the context
    manager decides to do. Most connections in the LDAP backend are
    handled in this fashion.

    Unfortunately, the LDAP driver is somewhat oblivious to paging, it's
    control implementation, or the fact that it uses an asynchronous API.
    Instead, the driver leaves it up to the handler objects it uses for
    connections to determine if the request should be controlled via
    paging. This is an anti-pattern since the backend establishes the
    connection for the request but doesn't ensure that connection is
    safely handled for asynchronous APIs.

This forces the `search_ext()` and `result3()` implementations...

Reviewed:  https://review.opendev.org/754488
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=e98d1ac622f74cbbb41872226ab755bf4ff3ed84
Submitter: Zuul
Branch:    master

commit e98d1ac622f74cbbb41872226ab755bf4ff3ed84
Author: Lance Bragstad <lbragstad@gmail.com>
Date:   Tue Sep 29 10:20:13 2020 -0500

Implement more robust connection handling for asynchronous LDAP calls
    
    Keystone's paging implementation contains a memory leak. The issue is
    noticeable if you integrate keystone with an LDAP server that supports
    paging and set `keystone.conf [ldap] page_size` to a low integer
    (e.g., 5).
    
    Keystone's LDAP backend uses `python-ldap` to interact with LDAP
    servers. For paged requests, it uses `search_ext()`, which is an
    asynchronous API [0]. The server responds with a message ID, which the
    client uses to retrieve all data for the request. In keystone's case,
    the `search_ext()` method is invoked with a page control that tells
    the server to deliver responses in increments according to the page
    size configured with `keystone.conf [ldap] page_size`. So long as the
    client has the original connection used to fetch the message ID, it
    can request the rest of the information associated to the request.
    
    Keystone's paging implementation loops continuously for paged
    requests. It takes the message ID it gets from `search_ext()` and
    calls `result3()`, asking the server for the data associated with that
    specific message. Keystone continues to do this until the server sends
    an indicator that it has no more data relevant to the query (via a
    cookie). The `search_ext()` and `result3()` methods must use the same
    LDAP connection.
    
    Given the above information, keystone uses context managers to provide
    connections. This is relevant when deploying connection pools, where
    certain connections are re-used from a pool. Keystone relies on Python
    context managers to handle connections, which is pretty typical
    use-case for context managers. Connection managers allow us to do the
    following (assuming pseudocode):
    
      with self.get_connection as conn:
          response = conn.search_s()
          return format(response)
    
    The above snippet assumes the `get_connection` method provides a
    connection object and a callable that implements `search_s`. Upon
    exiting the `with` statement, the connection is disconnected, or put
    back into the pool, or whatever the implementation of the context
    manager decides to do. Most connections in the LDAP backend are
    handled in this fashion.
    
    Unfortunately, the LDAP driver is somewhat oblivious to paging, it's
    control implementation, or the fact that it uses an asynchronous API.
    Instead, the driver leaves it up to the handler objects it uses for
    connections to determine if the request should be controlled via
    paging. This is an anti-pattern since the backend establishes the
    connection for the request but doesn't ensure that connection is
    safely handled for asynchronous APIs.
    
    This forces the `search_ext()` and `result3()` implementations in the
    PooledLDAPHandler to know how to handle connections and context
    managers, since it needs to ensure the same connection is used for
    paged requests. The current code tried to clean up the context
    manager responsible for connections after the results are collected
    from the server using the message ID. I believe it does this because
    it needs to get a new connection for each message in the paged
    results, even though it already operates from within a connection
    established via a context manager and the PooledLDAPHandler almost
    always returns the same connection object from the pool. The code
    tries to use a weak reference to create a callback that tears down the
    context manager when nothing else references it. At a high-level, the
    idea is similar to the following pseudocode:
    
      with self.get_connection as conn:
          while True:
            ldap_data = []
            context_manager = self.get_connection()
            connection = context_manager.__enter__()
            message_id = connection.search_ext()
            results = connection.result3(message_id)
            ldap_data.append(results)
            context_manager.__exit__()
    
    I wasn't able to see the callback get invoked or work as described in
    comments, resulting in memory bloat, especially with low page sizes
    which results in more requests. A weak reference invokes the callback
    when the weak reference is called, but there are no other references
    to the original object [1]. In our case, I don't think we invoke that
    path because we don't actually do anything with the weak reference. We
    assume it's going to run the callback when the object is garbage
    collected.
    
    This commit attempts to address this issue by using the concept of a
    finalizer [2], which was designed for similar cases. It also attempts
    to hide the cleanup implementation in the AsynchronousMessage object,
    so that callers don't have to worry about making sure they invoke the
    finalizer.
    
    An alternative approach would be to push more of the paging logic and
    implementation up into the LDAP driver. This would make it easier to
    put the entire asynchronous API flow for paging into a `with`
    statement and relying on the normal behavior of context managers to
    clean up accordingly. This approach would remove the manual cleanup
    invocation, regardless of using weak references or finalizer objects.
    However, this approach would likely require a non-trivial amount of
    design work to refactor the entire LDAP backend. The LDAP backend has
    other issues that would complicate the re-design process:
    
      - Handlers and connection are generalized to mean the same thing
      - Method names don't follow a convention
      - Domain-specific language from python-ldap bleeds into keystone's
        implementation (e.g., get_all, _ldap_get_all, add_member) at
        different points in the backend (e.g., UserApi (BaseLdap), GroupApi
        (BaseLdap), KeystoneLDAPHandler, PooledLDAPHandler,
        PythonLDAPHandler)
      - Backend contains dead code from when keystone supported writeable
        LDAP backends
      - Responsibility for connections and connection handling is spread
        across objects (BaseLdap, LDAPHandler)
      - Handlers will invoke methods differently based on configuration at
        runtime, which is a sign that the relationship between the driver,
        handlers, and connection objects isn't truely polymorphic
    
    While keeping the logic for properly handling context managers and
    connections in the Handlers might not be ideal, it is a relatively
    minimal fix in comparison to a re-design or backend refactor. These
    issues can be considered during a refactor of the LDAP backend if or
    when the community decides to re-design the LDAP backend.
    
    [0] https://www.python-ldap.org/en/python-ldap-3.3.0/reference/ldap.html#ldap.LDAPObject.search_ext
    [1] https://docs.python.org/3/library/weakref.html#weakref.ref
    [2] https://docs.python.org/3/library/weakref.html#finalizer-objects
    
    Closes-Bug: 1896125
    Change-Id: Ia45a45ff852d0d4e3a713dae07a46d4ff8d370f3

Changed in keystone:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-21: Fix merged to keystone (stable/victoria)

#6

Download full text (7.3 KiB)

Reviewed: https://review.opendev.org/755734
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=a26a40d441e7f211fd68e0c1b9c779d965c19dea
Submitter: Zuul
Branch: stable/victoria

commit a26a40d441e7f211fd68e0c1b9c779d965c19dea
Author: Lance Bragstad <email address hidden>
Date: Tue Sep 29 10:20:13 2020 -0500