Fuel for OpenStack

A lot of processes '/usr/bin/openstack' stuck on controllers during deployment, that leads to OOM

Bug #1502936 reported by Dennis Dmitriev on 2015-10-05

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Andriy Kurilin	Fuel for OpenStack 8.0
	7.0.x	Invalid	High	Dennis Dmitriev	Fuel for OpenStack 7.0-updates

Bug Description

During cluster deployment, puppet manifests use shell command '/usr/bin/python /usr/bin/openstack ...' to communicate with OpenStack components.

These processes never end, consuming a lot of memory and as the result invoking oom-killer.

For example:
- controller with 3Gb of memory and 3Gb of swap,
- after deploy finished (with failure), all memory and swap was filled; 3.5Gb was taken by processes /usr/bin/openstack.

Reproduced on CI: https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.ha_neutron_tun_scale/9/

        Scenario:
            1. Create cluster
            2. Add 1 controller node
            3. Deploy the cluster
            4. Add 2 controller nodes
            5. Deploy changes

Result: re-deploy of primary controller on step 5 failed.

Here is memory consumption on primary controller right after step 5: http://paste.openstack.org/show/475339/

Here is memory consumption after killing the processes /usr/bin/openstack: http://paste.openstack.org/show/475342/ (+2Gb free memory and +1.5Gb free swap space)

---------------------

Such issue can lead to failures like described in https://bugs.launchpad.net/fuel/+bug/1493372.

Here is an error.log from apache on primary controller, when there was no free memory: http://paste.openstack.org/show/475327/
[Sun Oct 04 22:29:31.686995 2015] [core:notice] [pid 13201:tid 140384935454592] AH00052: child pid 27972 exit signal Segmentation fault (11)
[Sun Oct 04 22:29:31.687026 2015] [core:error] [pid 13201:tid 140384935454592] AH00546: no record of generation 0 of exiting child 27972
...

See 'atop' logs for node-5 in the diagnostic snapshot attached to the bug.

Tags:

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-10-05:

fail_error_neutron_tun_scalability-fuel-snapshot-2015-10-04_23-29-06.tar.xz Edit (41.5 MiB, application/octet-stream)

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-10-05:

This really needs to get addressed on the openstackclient itself. It doesn't ever die if it hangs and doesn't support a timeout.

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → MOS Packaging Team (mos-packaging)

Roman Podoliaka (rpodolyaka) on 2015-10-06

Changed in fuel:
status:	New → Confirmed

Nastya Urlapova (aurlapova) on 2015-10-09

tags:

added: swarm-blocker

Artem Silenkov (asilenkov) on 2015-10-14

Changed in fuel:
assignee:	MOS Packaging Team (mos-packaging) → Artem Silenkov (asilenkov)

Revision history for this message

Ivan Berezovskiy (iberezovskiy) wrote on 2015-10-21:

The main problem here is that openstackclient doesn't support timeouts. So, when you are requesting smth using openstackclient and connection hangs, this process will never die. So we need a python engineer to implement somehow a support of timeouts in openstackclient. So, we should have some default value (60s for example) and ability to override it using '--timeout' arg. This also helps puppet to use this timeouts.

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-build

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-10-22:

Ivan, another question would be why Puppet manifest continue to start new openstackclient processes not checking if the previous call succeeded (i.e. *wait* for the process to *exit* with a correct error code) or not. Looks like there is a retry loop in the manifest code that keeps spawning new and new openstackclient processes, which remain hanging forever.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-10-22:

While I agree that adding support for --timeout would be useful, IMO, we should also ensure that the caller (i.e. Puppet manifests) waits for the openstackclient process to end, rather than just spawning a new processes.

It would also be nice if we could enforce such call timeout at the Puppet level, as we see not all things we call support timeout on their own. E.g. similar to subprocess.call() in Python - https://docs.python.org/3.4/library/subprocess.html#subprocess.call

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-10-22:

Do we have a reproduce? It would be nice to know why openstackclient gets stuck in the first place (although, we still need a way to stop it forcibly).

Revision history for this message

Andriy Kurilin (andreykurilin) wrote on 2015-10-28:

Proposed fix for openstackclient: https://review.fuel-infra.org/#/c/13261/
Branch: openstack-ci/fuel-8.0/liberty

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-10-29: Related fix merged to openstack/python-openstackclient (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/13261
Submitter: Dmitry Mescheryakov <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: 44c59b9d22356a296b8a80611bd25ce1a642895c
Author: Andrey Kurilin <email address hidden>
Date: Thu Oct 29 13:56:54 2015

Add timeout option

This option should prevent situations when connection hangs.

Related-bug: #1502936
Change-Id: I952a923b056eef653fe45676216b563729b29133

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-10-29:

We expect change referenced above to fix the issue.

Changed in fuel:
assignee:	Artem Silenkov (asilenkov) → Andrey Kurilin (andreykurilin)
status:	Confirmed → Fix Committed

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-12-10:

#10

Dennis, please update the description with MOS version and confirm whether this issue affects 7.0.

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-01-12:

#11

Move to invalid for 7.0 according to there were not reproduced for Mos-7.0 version

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-01-22:

#12

try to reproduce on 429 and 466 iso, could nit reproduce, so move to fix released

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-03-24: Related fix proposed to openstack/python-openstackclient (9.0/mitaka)

#13

Related fix proposed to branch: 9.0/mitaka
Change author: Andrey Kurilin <email address hidden>
Review: https://review.fuel-infra.org/18721

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-11-30: Change abandoned on openstack/python-openstackclient (9.0/mitaka)

#14

Change abandoned by Alexander Evseev <email address hidden> on branch: 9.0/mitaka
Review: https://review.fuel-infra.org/18721

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1502289

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fail_error_neutron_tun_scalability-fuel-snapshot-2015-10-04_23-29-06.tar.xz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.