Caching with stale data when a server disconnects due to network partition and reconnects

Bug #1819957 reported by Morgan Fainberg
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
Invalid
High
Morgan Fainberg
OpenStack Security Advisory
Won't Fix
Undecided
Unassigned
keystonemiddleware
Triaged
High
Morgan Fainberg
oslo.cache
Fix Released
High
Morgan Fainberg

Bug Description

The flush_on_reconnect optional flag is not used. This can cause stale data to be utilized from a cache server that disconnected due to a network partition. This has security concerns as follows:

1* Password changes/user changes may be reverted for the cache TTL
1a* User may get locked out if PCI-DSS is on and the password change happens during the network
    partition.
2* Grant changes may be reverted for the cache TTL
3* Resources (all types) may become "undeleted" for the cache TTL
4* Tokens (KSM) may become valid again during the cache TTL

As noted in the python-memcached library:

    @param flush_on_reconnect: optional flag which prevents a
            scenario that can cause stale data to be read: If there's more
            than one memcached server and the connection to one is
            interrupted, keys that mapped to that server will get
            reassigned to another. If the first server comes back, those
            keys will map to it again. If it still has its data, get()s
            can read stale data that was overwritten on another
            server. This flag is off by default for backwards
            compatibility.

The solution is to explicitly pass flush_on_reconnect as an optional argument. A concern with this model is that the memcached servers may be utilized by other tooling and may lose cache state (in the case the oslo.cache connection is the only thing affected by the network partitioning).

This similarly needs to be addressed in pymemcache when it is utilized in lieu of python-memcached.

tags: added: caching security
Changed in keystone:
importance: Undecided → High
Changed in keystonemiddleware:
importance: Undecided → High
Changed in oslo.cache:
importance: Undecided → High
Changed in keystone:
assignee: nobody → Morgan Fainberg (mdrnstm)
Changed in keystonemiddleware:
assignee: nobody → Morgan Fainberg (mdrnstm)
Changed in oslo.cache:
assignee: nobody → Morgan Fainberg (mdrnstm)
Revision history for this message
Ben Nemec (bnemec) wrote :

"A concern with this model is that the memcached servers may be utilized by other tooling and may lose cache state (in the case the oslo.cache connection is the only thing affected by the network partitioning)."

That may be, but I'd rather have an empty but consistent cache than a full but incorrect one. Hopefully network partitions aren't a particularly common occurrence anyway.

So I guess +1 on setting this option.

Colleen Murphy (krinkle)
Changed in keystone:
milestone: none → stein-rc1
Colleen Murphy (krinkle)
Changed in keystone:
status: New → Triaged
Changed in keystonemiddleware:
status: New → Triaged
Revision history for this message
Jeremy Stanley (fungi) wrote :

Unless there's a way for a malicious actor to trigger and take advantage of this condition, this is probably a class D (security hardening opportunity) report: https://security.openstack.org/vmt-process.html#incident-report-taxonomy

Changed in ossa:
status: New → Won't Fix
information type: Public Security → Public
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.cache (master)

Fix proposed to branch: master
Review: https://review.openstack.org/644774

Changed in oslo.cache:
status: New → In Progress
Colleen Murphy (krinkle)
Changed in keystone:
milestone: stein-rc1 → stein-rc2
Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

Keystone is fixed with oslo.cache fix, marked as invalid for keystone

Changed in keystone:
status: Triaged → Invalid
Colleen Murphy (krinkle)
Changed in keystone:
milestone: stein-rc2 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.cache (master)

Reviewed: https://review.openstack.org/644774
Committed: https://git.openstack.org/cgit/openstack/oslo.cache/commit/?id=1192f185a5fd2fa6177655f157146488a3de81d1
Submitter: Zuul
Branch: master

commit 1192f185a5fd2fa6177655f157146488a3de81d1
Author: Morgan Fainberg <email address hidden>
Date: Fri Mar 22 12:35:16 2019 -0700

    Pass `flush_on_reconnect` to memcache pooled backend

    If a memcache server disappears and then reconnects when multiple memcache
    servers are used (specific to the python-memcached based backends) it is
    possible that the server will contain stale data. The default is now to
    supply the ``flush_on_reconnect`` optional argument to the backend. This
    means that when the service connects to a memcache server, it will flush
    all cached data in the server. The pooled backend is more likely to
    run into issues with this as it does not explicitly use a thread.local
    for the client. The non-pooled backend was not touched, it is not
    the recommended production use-case.

    See the help from python-memcached:

        @param flush_on_reconnect: optional flag which prevents a
     scenario that can cause stale data to be read: If there's more
     than one memcached server and the connection to one is
     interrupted, keys that mapped to that server will get
     reassigned to another. If the first server comes back, those
     keys will map to it again. If it still has its data, get()s
     can read stale data that was overwritten on another
     server. This flag is off by default for backwards
     compatibility.

    Change-Id: I3e335261f749ad065e8abe972f4ac476d334e6b3
    closes-bug: #1819957

Changed in oslo.cache:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.cache 1.34.0

This issue was fixed in the openstack/oslo.cache 1.34.0 release.

Revision history for this message
Matthew Thode (prometheanfire) wrote :

is this going to be backported to stein/rocky? (the oslo.cache fix)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.cache (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/656419

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.cache (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/656420

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.cache (stable/stein)

Reviewed: https://review.opendev.org/656419
Committed: https://git.openstack.org/cgit/openstack/oslo.cache/commit/?id=aeb95779c7bea4117ccb69b17eab31d452397964
Submitter: Zuul
Branch: stable/stein

commit aeb95779c7bea4117ccb69b17eab31d452397964
Author: Morgan Fainberg <email address hidden>
Date: Fri Mar 22 12:35:16 2019 -0700

    Pass `flush_on_reconnect` to memcache pooled backend

    If a memcache server disappears and then reconnects when multiple memcache
    servers are used (specific to the python-memcached based backends) it is
    possible that the server will contain stale data. The default is now to
    supply the ``flush_on_reconnect`` optional argument to the backend. This
    means that when the service connects to a memcache server, it will flush
    all cached data in the server. The pooled backend is more likely to
    run into issues with this as it does not explicitly use a thread.local
    for the client. The non-pooled backend was not touched, it is not
    the recommended production use-case.

    See the help from python-memcached:

        @param flush_on_reconnect: optional flag which prevents a
     scenario that can cause stale data to be read: If there's more
     than one memcached server and the connection to one is
     interrupted, keys that mapped to that server will get
     reassigned to another. If the first server comes back, those
     keys will map to it again. If it still has its data, get()s
     can read stale data that was overwritten on another
     server. This flag is off by default for backwards
     compatibility.

    Change-Id: I3e335261f749ad065e8abe972f4ac476d334e6b3
    closes-bug: #1819957
    (cherry picked from commit 1192f185a5fd2fa6177655f157146488a3de81d1)
    Signed-off-by: Matthew Thode <email address hidden>

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.cache (stable/rocky)

Reviewed: https://review.opendev.org/656420
Committed: https://git.openstack.org/cgit/openstack/oslo.cache/commit/?id=1d6d08f4911edb50debab4639e6a2825260b9e09
Submitter: Zuul
Branch: stable/rocky

commit 1d6d08f4911edb50debab4639e6a2825260b9e09
Author: Morgan Fainberg <email address hidden>
Date: Fri Mar 22 12:35:16 2019 -0700

    Pass `flush_on_reconnect` to memcache pooled backend

    If a memcache server disappears and then reconnects when multiple memcache
    servers are used (specific to the python-memcached based backends) it is
    possible that the server will contain stale data. The default is now to
    supply the ``flush_on_reconnect`` optional argument to the backend. This
    means that when the service connects to a memcache server, it will flush
    all cached data in the server. The pooled backend is more likely to
    run into issues with this as it does not explicitly use a thread.local
    for the client. The non-pooled backend was not touched, it is not
    the recommended production use-case.

    See the help from python-memcached:

        @param flush_on_reconnect: optional flag which prevents a
     scenario that can cause stale data to be read: If there's more
     than one memcached server and the connection to one is
     interrupted, keys that mapped to that server will get
     reassigned to another. If the first server comes back, those
     keys will map to it again. If it still has its data, get()s
     can read stale data that was overwritten on another
     server. This flag is off by default for backwards
     compatibility.

    Change-Id: I3e335261f749ad065e8abe972f4ac476d334e6b3
    closes-bug: #1819957
    (cherry picked from commit 1192f185a5fd2fa6177655f157146488a3de81d1)
    Signed-off-by: Matthew Thode <email address hidden>

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.cache 1.33.3

This issue was fixed in the openstack/oslo.cache 1.33.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.cache 1.30.4

This issue was fixed in the openstack/oslo.cache 1.30.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.cache (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/704508

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.cache (stable/queens)

Reviewed: https://review.opendev.org/704508
Committed: https://git.openstack.org/cgit/openstack/oslo.cache/commit/?id=c31dd1aaac0a1dd8ca3f77b6da911ae85de6dc7a
Submitter: Zuul
Branch: stable/queens

commit c31dd1aaac0a1dd8ca3f77b6da911ae85de6dc7a
Author: Morgan Fainberg <email address hidden>
Date: Fri Mar 22 12:35:16 2019 -0700

    Pass `flush_on_reconnect` to memcache pooled backend

    If a memcache server disappears and then reconnects when multiple memcache
    servers are used (specific to the python-memcached based backends) it is
    possible that the server will contain stale data. The default is now to
    supply the ``flush_on_reconnect`` optional argument to the backend. This
    means that when the service connects to a memcache server, it will flush
    all cached data in the server. The pooled backend is more likely to
    run into issues with this as it does not explicitly use a thread.local
    for the client. The non-pooled backend was not touched, it is not
    the recommended production use-case.

    See the help from python-memcached:

        @param flush_on_reconnect: optional flag which prevents a
     scenario that can cause stale data to be read: If there's more
     than one memcached server and the connection to one is
     interrupted, keys that mapped to that server will get
     reassigned to another. If the first server comes back, those
     keys will map to it again. If it still has its data, get()s
     can read stale data that was overwritten on another
     server. This flag is off by default for backwards
     compatibility.

    Change-Id: I3e335261f749ad065e8abe972f4ac476d334e6b3
    closes-bug: #1819957
    (cherry picked from commit 1192f185a5fd2fa6177655f157146488a3de81d1)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.cache queens-eol

This issue was fixed in the openstack/oslo.cache queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.