A lot of processes '/usr/bin/openstack' stuck on controllers during deployment, that leads to OOM

Bug #1502936 reported by Dennis Dmitriev on 2015-10-05
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Andrey Kurilin
7.0.x
High
Dennis Dmitriev

Bug Description

During cluster deployment, puppet manifests use shell command '/usr/bin/python /usr/bin/openstack ...' to communicate with OpenStack components.

These processes never end, consuming a lot of memory and as the result invoking oom-killer.

For example:
- controller with 3Gb of memory and 3Gb of swap,
- after deploy finished (with failure), all memory and swap was filled; 3.5Gb was taken by processes /usr/bin/openstack.

Reproduced on CI: https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.ha_neutron_tun_scale/9/

        Scenario:
            1. Create cluster
            2. Add 1 controller node
            3. Deploy the cluster
            4. Add 2 controller nodes
            5. Deploy changes

Result: re-deploy of primary controller on step 5 failed.

Here is memory consumption on primary controller right after step 5: http://paste.openstack.org/show/475339/

Here is memory consumption after killing the processes /usr/bin/openstack: http://paste.openstack.org/show/475342/ (+2Gb free memory and +1.5Gb free swap space)

---------------------

Such issue can lead to failures like described in https://bugs.launchpad.net/fuel/+bug/1493372.

Here is an error.log from apache on primary controller, when there was no free memory: http://paste.openstack.org/show/475327/
[Sun Oct 04 22:29:31.686995 2015] [core:notice] [pid 13201:tid 140384935454592] AH00052: child pid 27972 exit signal Segmentation fault (11)
[Sun Oct 04 22:29:31.687026 2015] [core:error] [pid 13201:tid 140384935454592] AH00546: no record of generation 0 of exiting child 27972
...

See 'atop' logs for node-5 in the diagnostic snapshot attached to the bug.

Matthew Mosesohn (raytrac3r) wrote :

This really needs to get addressed on the openstackclient itself. It doesn't ever die if it hangs and doesn't support a timeout.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Packaging Team (mos-packaging)
Changed in fuel:
status: New → Confirmed
tags: added: swarm-blocker
Changed in fuel:
assignee: MOS Packaging Team (mos-packaging) → Artem Silenkov (asilenkov)
Ivan Berezovskiy (iberezovskiy) wrote :

The main problem here is that openstackclient doesn't support timeouts. So, when you are requesting smth using openstackclient and connection hangs, this process will never die. So we need a python engineer to implement somehow a support of timeouts in openstackclient. So, we should have some default value (60s for example) and ability to override it using '--timeout' arg. This also helps puppet to use this timeouts.

Dmitry Pyzhov (dpyzhov) on 2015-10-22
tags: added: area-build
Roman Podoliaka (rpodolyaka) wrote :

Ivan, another question would be why Puppet manifest continue to start new openstackclient processes not checking if the previous call succeeded (i.e. *wait* for the process to *exit* with a correct error code) or not. Looks like there is a retry loop in the manifest code that keeps spawning new and new openstackclient processes, which remain hanging forever.

Roman Podoliaka (rpodolyaka) wrote :

While I agree that adding support for --timeout would be useful, IMO, we should also ensure that the caller (i.e. Puppet manifests) waits for the openstackclient process to end, rather than just spawning a new processes.

It would also be nice if we could enforce such call timeout at the Puppet level, as we see not all things we call support timeout on their own. E.g. similar to subprocess.call() in Python - https://docs.python.org/3.4/library/subprocess.html#subprocess.call

Roman Podoliaka (rpodolyaka) wrote :

Do we have a reproduce? It would be nice to know why openstackclient gets stuck in the first place (although, we still need a way to stop it forcibly).

Andrey Kurilin (andreykurilin) wrote :

Proposed fix for openstackclient: https://review.fuel-infra.org/#/c/13261/
Branch: openstack-ci/fuel-8.0/liberty

Reviewed: https://review.fuel-infra.org/13261
Submitter: Dmitry Mescheryakov <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: 44c59b9d22356a296b8a80611bd25ce1a642895c
Author: Andrey Kurilin <email address hidden>
Date: Thu Oct 29 13:56:54 2015

Add timeout option

This option should prevent situations when connection hangs.

Related-bug: #1502936
Change-Id: I952a923b056eef653fe45676216b563729b29133

Dmitry Mescheryakov (dmitrymex) wrote :

We expect change referenced above to fix the issue.

Changed in fuel:
assignee: Artem Silenkov (asilenkov) → Andrey Kurilin (andreykurilin)
status: Confirmed → Fix Committed
Vitaly Sedelnik (vsedelnik) wrote :

Dennis, please update the description with MOS version and confirm whether this issue affects 7.0.

Tatyanka (tatyana-leontovich) wrote :

Move to invalid for 7.0 according to there were not reproduced for Mos-7.0 version

Tatyanka (tatyana-leontovich) wrote :

try to reproduce on 429 and 466 iso, could nit reproduce, so move to fix released

Changed in fuel:
status: Fix Committed → Fix Released

Related fix proposed to branch: 9.0/mitaka
Change author: Andrey Kurilin <email address hidden>
Review: https://review.fuel-infra.org/18721

Change abandoned by Alexander Evseev <email address hidden> on branch: 9.0/mitaka
Review: https://review.fuel-infra.org/18721

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers