sysinv conductor running out of sockets

Bug #1838088 reported by Tyler Smith on 2019-07-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
John Kung

Bug Description

Brief Description
-----------------
After many application operations (upload, apply, reapply, remove, delete, repeat x50) the sysinv conductor runs out of sockets and the sysinv log fills with exceptions, many sysinv commands start failing.

Severity
--------
Minor: System/Feature is usable with minor issue
Could lock/unlock host to fix it

Steps to Reproduce
------------------
Perform many repeated application operations

Expected Behavior
------------------
Sysinv-conductor should remain operational

Actual Behavior
----------------
System commands start being rejected, sysinv.log filled with exceptions (see below):

Reproducibility
---------------
Seen once so far

System Configuration
--------------------
Multi-node system, Dedicated storage, IPv4

Branch/Pull Time/Commit
-----------------------
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190720T013000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="186"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-07-20 01:30:00 +0000"

app tarball version:
stx-openstack-1.0-17-centos-stable-versioned.tgz

Timestamp/Logs
--------------

2019-07-26 17:32:50.587 524584 ERROR sysinv.openstack.common.rpc.amqp [req-c79048cd-8642-4823-9f79-eaab631a6e48 None None] Exception during message handling
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp Traceback (most recent call last):
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 438, in _process_data
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp **args)
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 8205, in ilvg_get_nova_ilvg_by_ihost
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp ilvgs = self.dbapi.ilvg_get_by_ihost(ihost_uuid)
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/objects/__init__.py", line 102, in wrapper
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp result = fn(*args, **kwargs)
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/db/sqlalchemy/api.py", line 3128, in ilvg_get_by_ihost
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp sort_key, sort_dir, query)
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sysinv/db/sqlalchemy/api.py", line 119, in _paginate_query
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp return query.all()
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2703, in all
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2855, in __iter__
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2876, in _execute_and_instances
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2885, in _get_bind_args
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2867, in _connection_from_session
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 998, in connection
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1005, in _connection_for_bind
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2112, in contextual_connect
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2151, in _wrap_pool_connect
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1461, in _handle_dbapi_exception_noconnection
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 387, in connect
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 766, in _checkout
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 516, in checkout
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 1138, in _do_get
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 1135, in _do_get
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 333, in _create_connection
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 461, in __init__
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 651, in __connect
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 105, in connect
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 393, in connect
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp File "/usr/lib64/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp OperationalError: (psycopg2.OperationalError) could not create socket: Too many open files
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp
2019-07-26 17:32:50.587 524584 TRACE sysinv.openstack.common.rpc.amqp
2019-07-26 17:32:50.589 524584 ERROR sysinv.openstack.common.rpc.common [req-c79048cd-8642-4823-9f79-eaab631a6e48 None None] Returning exception (psycopg2.OperationalError) could not create socket: Too many open files
 to caller

Seems the max open files is set to 1024
cat /proc/1914392/limits
Limit Soft Limit Hard Limit Units
...
Max open files 1024 4096 files

and sysinv-conductor has hit that limit:
ls -l /proc/524584/fd | wc -l
1025

vast majority of which are sockets:
...
lrwx------ 1 root root 64 Jul 26 15:02 974 -> socket:[106351332]
lrwx------ 1 root root 64 Jul 26 15:02 975 -> socket:[106351333]
lrwx------ 1 root root 64 Jul 26 15:02 976 -> socket:[106406580]
...

Test Activity
-------------
Developer Testing

Ghada Khalil (gkhalil) wrote :

Marking as medium priority / stx.2.0 - issue happens as a result of repetitive/stress testing; recoverable by manual action.

tags: added: stx.2.0 stx.config
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → John Kung (john-kung)
John Kung (john-kung) wrote :

The number of sockets held grows at the rate of 4 sockets per iteration, of 'system application-apply' as measured by ‘ss – xp |grep sysinv-conductor’. This socket growth occurs actually at the step armada_svc.exec_run(cmd) which calls into the docker library (installed on the controller at /usr/lib/python/site-packages/docker/client.py provided by rpm: python2-docker-3.3.0-1.el7.noarch).

The latest available rpm python-docker-3.7.0-2.el7.src.rpm was also attempted with the same results.

The following source code patch to docker/api/client.py:: _read_from_socket() removes the socket leak by closing the socket.:
[/usr/lib/python/site-packages/docker/api/:: diff client.py_FIX client.py_ORIG
373,376c373
< try:
< return six.binary_type().join(gen)
< finally:
< response.close()
---
> return six.binary_type().join(gen)

This solution would require bringing in the docker source code rpm via stx-tool and providing a source code patch to the build in stx-integ.

Fix proposed to branch: master
Review: https://review.opendev.org/675686

Changed in starlingx:
status: Triaged → In Progress

Reviewed: https://review.opendev.org/675686
Committed: https://git.openstack.org/cgit/starlingx/tools/commit/?id=b9ee1ac1c068583d74c22047772fb5cff4713f7d
Submitter: Zuul
Branch: master

commit b9ee1ac1c068583d74c22047772fb5cff4713f7d
Author: John Kung <email address hidden>
Date: Fri Aug 9 09:46:13 2019 -0400

    Fix socket growth on system application operations

    There is an issue whereby docker exec_run commands do not close the TCP socket.
    Thus on each kube application operation, such as application-apply, the docker
    exec_run consumes 4 TCP sockets per iteration. Thus, when it hits limit
    of 1024 sockets, system can no longer perform application operations.

    This updates the python-docker package. The source code was already in the
    cengn mirror; and this update adds a patch to finally close socket in
    docker client.

    Tests performed:
    Install AIO, Storage labs.
    system application-upload stx-openstack, apply, remove
    system application apply stress test of over 300 iterations
    and verify no growth in sockets held by sysinv-conductor.

    Change-Id: I8a8efbb372581c7ce13a0cf813f63c13d5186973
    Closes-Bug: 1838088
    Signed-off-by: John Kung <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/675687
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=3f4a4391095d73c2d78dfb4db78934043e6b9aca
Submitter: Zuul
Branch: master

commit 3f4a4391095d73c2d78dfb4db78934043e6b9aca
Author: John Kung <email address hidden>
Date: Fri Aug 9 09:49:41 2019 -0400

    Fix socket growth on system application operations

    There is an issue whereby docker exec_run commands do not close the TCP socket.
    Thus on each kube application operation, such as application-apply, the docker
    exec_run consumes 4 TCP sockets per iteration. Thus, when it hits limit
    of 1024 sockets, system can no longer perform application operations.

    This updates the python-docker package. The source code was already in the
    cengn mirror; and this update adds a patch to finally close socket in
    docker client.

    Tests performed:
    Install AIO, Storage labs.
    system application-upload stx-openstack, apply, remove
    system application apply stress test of over 300 iterations
    and verify no growth in sockets held by sysinv-conductor.

    Change-Id: I21b9e486af1e3bb3b82fd79c3e3662c97b53b732
    Closes-Bug: 1838088
    Depends-On: https://review.opendev.org/#/c/675686/
    Signed-off-by: John Kung <email address hidden>

Reviewed: https://review.opendev.org/676479
Committed: https://git.openstack.org/cgit/starlingx/tools/commit/?id=3ae7d66753333c977a6c2af95bd26614ae130d68
Submitter: Zuul
Branch: r/stx.2.0

commit 3ae7d66753333c977a6c2af95bd26614ae130d68
Author: John Kung <email address hidden>
Date: Fri Aug 9 09:46:13 2019 -0400

    Fix socket growth on system application operations

    There is an issue whereby docker exec_run commands do not close the TCP socket.
    Thus on each kube application operation, such as application-apply, the docker
    exec_run consumes 4 TCP sockets per iteration. Thus, when it hits limit
    of 1024 sockets, system can no longer perform application operations.

    This updates the python-docker package. The source code was already in the
    cengn mirror; and this update adds a patch to finally close socket in
    docker client.

    Tests performed:
    Install AIO, Storage labs.
    system application-upload stx-openstack, apply, remove
    system application apply stress test of over 300 iterations
    and verify no growth in sockets held by sysinv-conductor.

    Change-Id: I8a8efbb372581c7ce13a0cf813f63c13d5186973
    Closes-Bug: 1838088
    Signed-off-by: John Kung <email address hidden>
    (cherry picked from commit b9ee1ac1c068583d74c22047772fb5cff4713f7d)

Reviewed: https://review.opendev.org/676481
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=847bd52a92bd71d4a6a81f3258ea74c3e1ea3a4c
Submitter: Zuul
Branch: r/stx.2.0

commit 847bd52a92bd71d4a6a81f3258ea74c3e1ea3a4c
Author: John Kung <email address hidden>
Date: Fri Aug 9 09:49:41 2019 -0400

    Fix socket growth on system application operations

    There is an issue whereby docker exec_run commands do not close the TCP socket.
    Thus on each kube application operation, such as application-apply, the docker
    exec_run consumes 4 TCP sockets per iteration. Thus, when it hits limit
    of 1024 sockets, system can no longer perform application operations.

    This updates the python-docker package. The source code was already in the
    cengn mirror; and this update adds a patch to finally close socket in
    docker client.

    Tests performed:
    Install AIO, Storage labs.
    system application-upload stx-openstack, apply, remove
    system application apply stress test of over 300 iterations
    and verify no growth in sockets held by sysinv-conductor.

    Change-Id: I21b9e486af1e3bb3b82fd79c3e3662c97b53b732
    Closes-Bug: 1838088
    Depends-On: https://review.opendev.org/#/c/676479/
    Signed-off-by: John Kung <email address hidden>
    (cherry picked from commit 3f4a4391095d73c2d78dfb4db78934043e6b9aca)

Ghada Khalil (gkhalil) on 2019-08-15
tags: added: in-r-stx20
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers