skydive_agent container not reaping zombies

Bug #1806167 reported by JL
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
kolla
Medium
Michal Nasiadka

Bug Description

The skydive_agent container is creating zombie processes, which are not being reaped. It is possible for it to create so many zombies that it exhausts the PID space - and you can now not start any new processes.

Trying to run a command at the bash prompt in this state will just give you the (inaccurate) error message "fork: Cannot allocate memory".

The zombies are all "ovs-ofctl".

"docker stop skydive_agent" will result in all the zombie processes being reaped, fixing up the machine (assuming you got there before the PID space is all used up!).

It sounds a lot like this problem: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

kolla-ansible 7.0.0
CentOS 7.5.1804 GenericCloud for the hardware
CentOS 7.5.1804 for the containers
kolla-build run locally, for centos binary -> 7.0.0 (I have made no changes to the skydive or openvswitch related containers)

Changed in kolla-ansible:
milestone: none → rocky-3
assignee: nobody → Michal Nasiadka (mnasiadka)
importance: Undecided → Medium
milestone: rocky-3 → none
status: New → Confirmed
affects: kolla-ansible → kolla
Revision history for this message
Dai Dang Van (daikk115) wrote :

Everyone should NOT use skydive at this time if you don't rebuild skydive image from source to get the fix from upstream skydive project[1].

We met with this problem when we deployed kolla-ansible with master branch image on testbed env, all compute nodes down after around 1 or 2 hours with more than 4k zombie processes.

[1] https://github.com/skydive-project/skydive/issues/1541

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Source rebuild will NOT help.
The fix has not yet been part of the relase - I'm tracking this, and will bump up the version in kolla once that is included.

Changed in kolla:
status: Confirmed → In Progress
Revision history for this message
Michal Nasiadka (mnasiadka) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (master)

Fix proposed to branch: master
Review: https://review.openstack.org/645522

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (master)

Reviewed: https://review.openstack.org/645522
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=1ff012d2c5a0fd37ff7c9ac37bbbf9233ccbd86b
Submitter: Zuul
Branch: master

commit 1ff012d2c5a0fd37ff7c9ac37bbbf9233ccbd86b
Author: Michal Nasiadka <email address hidden>
Date: Fri Mar 22 10:52:57 2019 +0100

    Bump skydive version to 0.22.0

    Change-Id: Ic88ab9fabcfd0287d41e13606aba1290f6f7a011
    Closes-Bug: #1806167

Changed in kolla:
status: In Progress → Fix Released
Mark Goddard (mgoddard)
Changed in kolla:
milestone: none → 8.0.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla 8.0.0.0rc1

This issue was fixed in the openstack/kolla 8.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.