memcached socket not released upon lbaas API request

Bug #1892852 reported by Sergio Morant on 2020-08-25
268
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Security Advisory
Undecided
Unassigned
keystonemiddleware
Undecided
Unassigned
oslo.cache
Undecided
Unassigned

Bug Description

We have recently installer an OpenStack cluster in Train release and we have noticed an unexpected behavior when Neutron contacts memcached upon some specific API requests. In fact, in our environment lbaas is not configured in Neutron (I don't know if this still possible in the current Neutron version), but we have deployed a monitoring service based on Prometheus openstack-exporter which, by default, checks the accessibility of the lbaas neutron part of the API.
Each time the check have the neutron server generates a 404, as this part of the API is not available. An example can be found at http://paste.openstack.org/show/797113/

The actual issue happens at the socket management level. Each time the check is performed, the socket between the Neutron server and memcached is not released as expected. This leads to a continuous increment of the established sockets between Neutron an Memcached until all the available sockets are exhausted and the cluster authentication locks.
Removing the checking on the lbaas service from Prometheus openstack-exporter allows to avoid the issue but we think that an issue like this (if confirmed) have quite important security concerns as it can lead quite easily to a DoS (wanted or not).

For the record, We have tried to explicitly configure the memcached socket monitoring options in Neutron but it looks like that they are not applied under the current conditions :
memcache_pool_socket_timeout = 3
memcache_pool_unused_timeout = 60

Here under is the environment setup:

Deployer : OpenStack Ansible 20.1.2

Neutron version: neutron 15.1.1.dev2

Neutron config file http://paste.openstack.org/show/797117/

Memcached config file http://paste.openstack.org/show/797118/

Prometheus OpenStack Exporter : https://github.com/openstack-exporter/openstack-exporter

Prometheus OpenStack Exporter config : http://paste.openstack.org/show/797119/

Prometheus command WITH lbaas checking : =/usr/local/bin/openstack-exporter \
    --os-client-config=/etc/openstack-exporter/openstack-exporter.yml \
    --web.telemetry-path="/metrics" \
    --web.listen-address="0.0.0.0:9180" \
    --prefix="openstack" \
    --endpoint-type="public"

Prometheus command WITHOUT lbaas checking:
/usr/local/bin/openstack-exporter \
    --os-client-config=/etc/openstack-exporter/openstack-exporter.yml \
    --web.telemetry-path="/metrics" \
    --web.listen-address="0.0.0.0:9180" \
    --prefix="openstack" \
    --endpoint-type="public" \
    --disable-metric neutron-loadbalancers \
    --disable-metric neutron-loadbalancers_not_active \
    --disable-service.object-store \
    --disable-service.load-balancer \

I hope the provided information is enough to reproduce the issue.

Cheers
Sergio

Slawek Kaplonski (slaweq) wrote :

Hi,

Thanks for reporting this bug. I was just checking Neutron code and I'm a bit confused as we are not using memcached in Neutron code anywhere.
Is this memcahed used by keystonemiddleware? And if it is, isn't that keystonemiddleware bug instead of neutron?

Sergio Morant (smorant) wrote :

Hello,
I guess you are right. The memcached configuration is related to the authentication and thus to Keystone.
I didn't know that the socket management was handled by the keystone middleware. Maybe this interaction with the middleware can explain somehow the fact that the socket is somehow left open.

I do not how I should proceed. Should I copy the content of this ticket and create a new one on the keystone component, or there is a way to move it ?

Thanks a lot for your quick feedback
Sergio

Jeremy Stanley (fungi) wrote :

Thanks for the clarification, I'll reset the affected project to keystonemiddleware now.

description: updated
affects: neutron → keystonemiddleware
Changed in ossa:
status: New → Incomplete
Jeremy Stanley (fungi) wrote :

Since this report concerns a possible security risk, an incomplete
security advisory task has been added while the core security
reviewers for the affected project or projects confirm the bug and
discuss the scope of any vulnerability along with potential
solutions.

Jeremy Stanley (fungi) wrote :

Just so I understand, a user can call the /v2.0/lbaas/loadbalancers method repeatedly and each invocation creates a new socket which does not terminate, allowing them to (quickly?) exhaust system resources when the process reaches the system file descriptor limit or no ephemeral ports are available to source new connections to memcached. Is this an accurate summary?

Gage Hugo (gagehugo) wrote :

Neutron has caching from a quick look:

https://github.com/openstack/neutron/blob/master/neutron/common/cache_utils.py

I am confused on why this is keystonemiddleware, afaik the only thing ksm caches is the token, and we would be seeing this issue everywhere if it was leaving sockets open.

Sergio Morant (smorant) wrote :

Hello,
I will try to clarify as much as possible this issue. To answer Jeremy's question in #5, the summary is right but I need to add that this feature (lbaas) is not enabled in the cluster. That's why there is a 404 message.

I answer to Gage comment in #6, we are also confused. Our team has check quite a while on internet because we were pretty sure the issue, if not related to an exotic setup, should be known by now. The complementary information I can provide is :
* We have two clusters running train version and the issue is present in both of them
* We have several clusters in Ocata version with the same monitoring setup and we haven't experienced the issue so far.

I don't know how much the keystone setup itself may impact this behavior, and I would like to avoid messing up things. However I will point out that we have added in Train deployment an identity federation setup in keystone with OIDC and Keycloak as authentication backend for cluster users. From my point of view this shouldn't impact Neutron's authentication, but this is the only tunning we made with respect to default authentication setup and is the main change with regard to the Ocata's deployment which use LDAP as authentication backend for cluster users.

Please, let me know if we can provide you any traces or cluster configuration information that could help understand what is going on.

Best regards
Sergio

Jeremy Stanley (fungi) wrote :

Could this be a duplicate of (or at least related to) bug 1883659 do you think?

Gage Hugo (gagehugo) wrote :

They do seem very similar, and there is a bit more discussion and work being done on the public one as well.

Gage Hugo (gagehugo) wrote :

Has there been any observation here with the other neutron APIs not releasing memcache sockets, or has it been confined to only lbaas API requests in this case?

This does closely resemble bug 1883659 from the descriptions.

Sergio Morant (smorant) wrote :

Hello,
I pretty much agree with you. The behavior described in bug 1883659 is quite similar to what we observe on our clusters. It would be interesting to check if they have also requests that hits unused parts of the neutron-server API frequently.

Jeremy Stanley (fungi) wrote :

I've subscribed the oslo-coresec reviewers and added a new bugtask for oslo.cache, hopefully we can get some additional clarity that way. Ultimately, I'm not convinced keeping this bug private is helping anyone, and strongly suggest we switch it to Public Security so we can at least add the potential vulnerability concerns to the increasing amount of community discussion around bug 1883659.

I second that.

We were working on a possible fix for this in this patch[1], but there are still some missing changes to be made.

And Sergio, you have a cleartext memcache_secret_key in your Neutron config file.

[1] https://review.opendev.org/#/c/742193/

Jeremy Stanley (fungi) wrote :

It looks to me like change https://review.opendev.org/742193 and subsequent review discussion spells out the security risks fairly clearly, so this is basically a publicly known issue at this point. Since I have seen no objections to my proposal in comment #12 two weeks ago, I'm switching this report to Public Security so that it can weigh more clearly in any discussion of proposed fixes or mitigations.

description: updated
information type: Private Security → Public Security
Sergio Morant (smorant) wrote :

I fully agree with the analysis. The patch seems address the issue we faced.

We will check whether the patch is back ported to Train release. In that case we will test it.

Thanks a lot to all of you to help out figure out the root cause.

PS: Regarding comment on #14, I couldn't manage to update the paste-bin contents but It is not big deal.

Jeremy Stanley (fungi) wrote :

It looks like this may be the same as bug 1883659 and bug 1888394.

Jeremy Stanley (fungi) wrote :

At this point there's no clear exploit scenario and the description of this and the other two presumed related reports seems to be of a normal (albeit potentially crippling) bug. As such, the vulnerability management team is going to treat this as a class D report per our taxonomy and not issue an advisory once it's fixed, but if anyone disagrees we can reconsider the position: https://security.openstack.org/vmt-process.html#incident-report-taxonomy

Changed in ossa:
status: Incomplete → Won't Fix
Ben Nemec (bnemec) wrote :

The linked patch has merged, so I'm going to mark this as fixed. Feel free to reopen if the problem doesn't go away with the new oslo.cache release.

Changed in oslo.cache:
status: New → Fix Released
To post a comment you must log in.
This report contains Public Security information  Edit
Everyone can see this security related information.

Other bug subscribers