Swift memory usage grows until it is killed

Bug #1595916 reported by Derek Higgins on 2016-06-24
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
High
Derek Higgins
Liberty
High
Derek Higgins
Mitaka
High
Derek Higgins

Bug Description

On RH2 we've deployed tripleo mitaka on centos7

Everything was going well until we registered the cloud with nodepool, soon after swift got killed by the OOM
 Jun 24 01:06:56 overcloud-controller-0 kernel: Killed process 28870 (swift-proxy-ser) total-vm:80969136kB, anon-rss:80715676kB, file-rss:3004kB

it was using 80GB of ram

I restarted swift-proxy and after 30 minutes its back using 63GB

Also nova instances are failing to boot (the image sizes are nearly 6GB so they are chunked in swift), as glance is throwing errors while trying to get the image from swift

  File "/usr/lib/python2.7/site-packages/glance_store/_drivers/swift/store.py", line 456, in _get_object
    resp_chunk_size=self.CHUNKSIZE, headers=headers)
  File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1666, in get_object
    headers=headers)
  File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1565, in _retry
    service_token=self.service_token, **kwargs)
  File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1100, in get_object
    conn.request(method, path, '', headers)
  File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 401, in request
    files=files, **self.requests_args)
  File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 384, in _request
    return self.request_session.request(*arg, **kwarg)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 426, in send
    raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', BadStatusLine("''",))

Dan Prince (dan-prince) wrote :

We could switch to use a Glance file backend instead?

Derek Higgins (derekh) wrote :

@dprince, ya I was thinking that same, I spent a few hours trying to reproduce the problem in a virt env, but couldn't. A 6GB image could be created and downloaded fine. So I've switched rh2 to use the file backend for now.

Changed in tripleo:
importance: Undecided → High
Ben Nemec (bnemec) wrote :

Note that tripleo-ci appears to be hitting a similar problem: http://logs.openstack.org/79/329079/4/check-tripleo/gate-tripleo-ci-centos-7-nonha/0ec8e2d/logs/undercloud/var/log/glance/api.txt.gz#_2016-06-27_13_38_16_395

So I don't think this is environment-specific. It looks like there is an actual bug here, although it may not always reproduce.

tags: added: alert
Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
clayg (clay-gerrard) wrote :

Would it be possible to add the swift-proxy-server's configured pipeline when running in this context to this issue?

It's normally in the file /etc/swift/proxy-server.conf

There's no reason for the proxy server to consume gigabytes of ram - I suspect its somethings in the applications configured middleware.

clayg (clay-gerrard) wrote :

well, that pipeline and proxy configuration look pretty reasonable to me.

You could reduce the workers to 1, same thing for the account/container/object services - but that's still no excuse for any individual worker bloating beyond some megabytes... I monitor RSS by process group on some production swift deployments and all of the dozens of proxy workers combined seem to hang around ~250MB, which is a lot... but a far cry from 60-80GiB?

Also in all observed cases the object-server workers are much larger than the proxy-servers workers...

I don't really have any other immediate hypothesis? I can't really think of what else might be interesting to try and capture on this system? netstat? lsof? swift-ring-builder output?

"I restarted swift-proxy and after 30 minutes its back using 63GB"

^ maybe it'd be possible to get on a sandbox environment with this configuration and observe that directly? Not sure what'd I'd dig around for exactly - short of instrumenting with heapy ;)

Reviewed: https://review.openstack.org/334555
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=b8c5ac736733e28315364a0c9e70465b6f41166d
Submitter: Jenkins
Branch: master

commit b8c5ac736733e28315364a0c9e70465b6f41166d
Author: Emilien Macchi <email address hidden>
Date: Mon Jun 27 12:18:37 2016 -0400

    glance: disable swift backend

    - Disable Swift backend in the list of stores available in Glance.
    - Set file backend by default.
    - include glance::backend::file

    We're having a lot of OOM in swift-proxy-server at this time, this patch
    aims to temporarily disable Swift backend for Glance.

    We'll reconsider enable it again in the future, but due to limited CI
    resources, let's disable it now.

    Related-Bug: #1595916
    Change-Id: I5e2feff7e5dc900849c9535f2b7ac05d3c8f93e1

Derek Higgins (derekh) wrote :

After redeploying several times to try and reproduce this, I've a managed to narrow it down to the way we have "proxy-logging" configured in the proxy pipeline

We have,
pipeline = catch_errors healthcheck cache ratelimit bulk tempurl formpost authtoken keystone staticweb proxy-logging proxy-server

With this ^^ if I try to swift download a certain image, the memory usage of swift-proxy just increases until it gets killed, every time.

Doing what is in several online examples and adding it twice, like this
pipeline = catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb proxy-logging proxy-server

Removes the problem and the swift-proxy memory usages stays stable,

The problem appears to be related to the image also, as I can reliably reproduce this with a certain image and not others. Perhaps we started hitting this recently because the image went over a certain size...

I'll see if I can narrow this down a little further and then submit a patch to puppet-swift.

Fix proposed to branch: master
Review: https://review.openstack.org/336651

Changed in tripleo:
assignee: Emilien Macchi (emilienm) → Derek Higgins (derekh)
status: New → In Progress
Steven Hardy (shardy) on 2016-07-12
Changed in tripleo:
milestone: none → newton-3

Reviewed: https://review.openstack.org/336651
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=0b42b6df23f64a7b13d7f809ecd4c1642640e3a7
Submitter: Jenkins
Branch: master

commit 0b42b6df23f64a7b13d7f809ecd4c1642640e3a7
Author: Derek Higgins <email address hidden>
Date: Fri Jul 1 17:19:51 2016 +0100

    Add a second proxy-logging middleware entry to swift-proxy

    Its absence results in swift using up all the memory available to it when
    certain objects are requested, we are not sure exactly what triggers the
    problem but we know this fixes it.

    Change-Id: Ie4eeaaa83c4e0a181559af639fc13e7fc4939480
    Closes-Bug: #1595916

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/340389
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=286b8e6f06e29450da483729acb61e0a470abc5c
Submitter: Jenkins
Branch: master

commit 286b8e6f06e29450da483729acb61e0a470abc5c
Author: Derek Higgins <email address hidden>
Date: Mon Jul 11 15:11:01 2016 +0100

    Add a second proxy-logging middleware entry to swift-proxy

    Its absence results in swift using up all the memory available to it when
    certain objects are requested, we are not sure exactly what triggers the
    problem but we know this fixes it.

    Change-Id: Iaf00a8a2a947e0683cc60fef2e75fd7c444d07a8
    Closes-Bug: #1595916

tags: added: liberty-backport-potential mitaka-backport-potential

Reviewed: https://review.openstack.org/341314
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=893d20dc11473d5ed72d968616210d69f9c1bd58
Submitter: Jenkins
Branch: stable/mitaka

commit 893d20dc11473d5ed72d968616210d69f9c1bd58
Author: Derek Higgins <email address hidden>
Date: Fri Jul 1 17:19:51 2016 +0100

    Add a second proxy-logging middleware entry to swift-proxy

    Its absence results in swift using up all the memory available to it when
    certain objects are requested, we are not sure exactly what triggers the
    problem but we know this fixes it.

    Change-Id: Ie4eeaaa83c4e0a181559af639fc13e7fc4939480
    Closes-Bug: #1595916
    (cherry picked from commit 0b42b6df23f64a7b13d7f809ecd4c1642640e3a7)

Reviewed: https://review.openstack.org/341315
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=19768bbd77cc4473122b14b33287dbffb8e4b1a2
Submitter: Jenkins
Branch: stable/liberty

commit 19768bbd77cc4473122b14b33287dbffb8e4b1a2
Author: Derek Higgins <email address hidden>
Date: Fri Jul 1 17:19:51 2016 +0100

    Add a second proxy-logging middleware entry to swift-proxy

    Its absence results in swift using up all the memory available to it when
    certain objects are requested, we are not sure exactly what triggers the
    problem but we know this fixes it.

    Change-Id: Ie4eeaaa83c4e0a181559af639fc13e7fc4939480
    Closes-Bug: #1595916
    (cherry picked from commit 0b42b6df23f64a7b13d7f809ecd4c1642640e3a7)

Reviewed: https://review.openstack.org/341321
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=721d8fd710cbcb9da0aeadfb1674a4ba74dbba38
Submitter: Jenkins
Branch: stable/mitaka

commit 721d8fd710cbcb9da0aeadfb1674a4ba74dbba38
Author: Derek Higgins <email address hidden>
Date: Mon Jul 11 15:11:01 2016 +0100

    Add a second proxy-logging middleware entry to swift-proxy

    Its absence results in swift using up all the memory available to it when
    certain objects are requested, we are not sure exactly what triggers the
    problem but we know this fixes it.

    Change-Id: Iaf00a8a2a947e0683cc60fef2e75fd7c444d07a8
    Closes-Bug: #1595916
    (cherry picked from commit 286b8e6f06e29450da483729acb61e0a470abc5c)

Reviewed: https://review.openstack.org/341322
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=56bf5b6cd25073317f75d915c1e493ad002388cc
Submitter: Jenkins
Branch: stable/liberty

commit 56bf5b6cd25073317f75d915c1e493ad002388cc
Author: Derek Higgins <email address hidden>
Date: Mon Jul 11 15:11:01 2016 +0100

    Add a second proxy-logging middleware entry to swift-proxy

    Its absence results in swift using up all the memory available to it when
    certain objects are requested, we are not sure exactly what triggers the
    problem but we know this fixes it.

    Change-Id: Iaf00a8a2a947e0683cc60fef2e75fd7c444d07a8
    Closes-Bug: #1595916
    (cherry picked from commit 286b8e6f06e29450da483729acb61e0a470abc5c)

Reviewed: https://review.openstack.org/336652
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=b4fa8fb2b6a0b8496d286583e4074b437855d531
Submitter: Jenkins
Branch: master

commit b4fa8fb2b6a0b8496d286583e4074b437855d531
Author: Derek Higgins <email address hidden>
Date: Fri Jul 1 17:25:25 2016 +0100

    Revert "glance: disable swift backend"

    This reverts commit b8c5ac736733e28315364a0c9e70465b6f41166d.
    Now that the swift problem is fixed we can re-enable it.

    Related-Bug: 1595916
    Closes-Bug: 1610935
    Change-Id: I3fe9eb54b4efec99b4e58a67b4ff2d531011ee90

Change abandoned by Christian Schwede (<email address hidden>) on branch: master
Review: https://review.openstack.org/344754
Reason: Abandoning - seems to be unneeded.

This issue was fixed in the openstack/instack-undercloud 4.2.0 release.

This issue was fixed in the openstack/tripleo-heat-templates 2.1.0 release.

This issue was fixed in the openstack/instack-undercloud 5.0.0.0b3 development milestone.

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0.0b3 development milestone.

This issue was fixed in the openstack/instack-undercloud 4.2.0 release.

This issue was fixed in the openstack/tripleo-heat-templates 2.1.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers