Glance creating heavy CPU load on standby cluster

Bug #1463522 reported by Aleksandr Shaposhnikov
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Alexander Tivelkov
6.0.x
Fix Released
High
Denis Meltsaykin
6.1.x
Fix Released
High
Denis Meltsaykin
7.0.x
Invalid
High
Alexander Tivelkov

Bug Description

Steps to reproduce:

1. Install MOS OpenStack HA environment with 3 controllers.
2. Run "ps aux | grep glance-api" on a controller and memorize cpu consumption by these services
3. Run rally tests on the environment (for description of what rally tests are, see "User impact" section below)
4. Repeat step #2 on the same controller.

You will notice that cpu consumption by all glance-api processes raised a little (less than 1% each). There are 12 glance-api processes, so all together consumption by the glance-api service should raise by 12%. If you run rally test again, the consumption will again go up a little. Note, here 1% is a 1% of a single cpu core, not overall machine compute capacity (which consists of several cpus/cores). In reproduction of this bug a single rally run created/deleted around 1300 images in Glance.

Conditions for reproduction:

No additional conditions required. We have not much data (2 repros so far), but we think that the issue reproduces in 100% cases.

User impact:

Rally tests emulate multi-user usage of cloud, by concurrently doing various actions with cloud. For example CRUD operations with users, instances, volumes, etc.

The bug does not result in visible impact for user, aside from that cpu consumption constantly grows while cloud is used. Obviously that will lead to problem once service consumes considerable part of machine compute capacity.

Workaround:

Restart the affected service, that will immediately drop cpu consumption by the service back to almost zero.

Current plan:

We continue to investigate issue and test possible fixes (see comments for details). We plan to fix the issue in updates for 6.1. Dina Belova from Scale team agreed that the issue is not blocker for the release, but must be fixed in 6.1 updates.

-----------------------------------------------
Original description by Aleksandr Shaposhnikov:

Basically all on the controllers nodes have a lot glance processes consuming a lot of CPU resources.
On cluster there is no active provisioning or snapshotting.

Here is some information:

root@node-8:~# glance image-list --all-tenants
+--------------------------------------+--------+-------------+------------------+----------+--------+
| ID | Name | Disk Format | Container Format | Size | Status |
+--------------------------------------+--------+-------------+------------------+----------+--------+
| 44bd0e36-0a4e-44d6-b5b0-f16b38abd3db | TestVM | qcow2 | bare | 14024704 | active |
+--------------------------------------+--------+-------------+------------------+----------+--------+
root@node-8:~# top

Tasks: 443 total, 4 running, 439 sleeping, 0 stopped, 0 zombie
%Cpu(s): 35.0 us, 3.6 sy, 0.0 ni, 60.4 id, 0.1 wa, 0.0 hi, 0.8 si, 0.0 st
KiB Mem: 32913976 total, 32647224 used, 266752 free, 144372 buffers
KiB Swap: 16777212 total, 1476 used, 16775736 free. 15889604 cached Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 5872 glance 20 0 2963992 159628 9316 R 24.2 0.5 96:22.79 glance-api
 5866 glance 20 0 2921064 113452 9316 S 18.2 0.3 110:35.66 glance-api
 5867 glance 20 0 3576688 115964 9316 S 18.2 0.4 103:15.95 glance-api
 5868 glance 20 0 5173904 144564 9316 S 18.2 0.4 104:06.07 glance-api
 5869 glance 20 0 2985956 175728 9316 S 18.2 0.5 83:01.87 glance-api
 5870 glance 20 0 2939812 135292 9316 S 18.2 0.4 111:22.43 glance-api
 5871 glance 20 0 2954588 151700 9320 S 18.2 0.5 119:42.82 glance-api
 5873 glance 20 0 2997468 192524 9292 S 18.2 0.6 109:54.66 glance-api
 5874 glance 20 0 2981164 179176 9316 S 18.2 0.5 105:45.58 glance-api
 5876 glance 20 0 3009780 200332 9292 S 18.2 0.6 99:36.75 glance-api
 5856 cinder 20 0 282696 87076 3920 S 12.1 0.3 20:23.18 cinder-api
 5865 glance 20 0 2988316 184412 9320 S 12.1 0.6 108:24.61 glance-api
 5875 glance 20 0 2988312 183496 9316 R 12.1 0.6 93:03.88 glance-api
    4 root 20 0 0 0 0 S 6.1 0.0 1:48.71 kworker/0:0

MOS 6.1 build #521

Will attach snapshot later once it will be downloaded.

tags: added: scale
Revision history for this message
Alexander Nevenchannyy (anevenchannyy) wrote :

strace to glance-api sad, that we have many hundreds events per second of this:
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 2, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 2, 0) = 0 (Timeout)
--- was eaten by mices ---
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 2, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 0) = 0 (Timeout)
recvfrom(8, 0x7f748ef547d4, 7, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)

root@node-8:~# time strace -p 5871 -o api.txt
Process 5871 attached
^CProcess 5871 detached

real 0m3.986s
user 0m0.153s
sys 0m0.310s

root@node-8:~# cat api.txt | wc -l
8916

Changed in mos:
assignee: nobody → MOS Glance (mos-glance)
ruhe (ruhe)
Changed in mos:
milestone: none → 6.1
importance: Undecided → High
tags: added: glance
ruhe (ruhe)
Changed in mos:
status: New → Confirmed
assignee: MOS Glance (mos-glance) → Mike Fedosin (mfedosin)
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

Ceph or Swift? The first thought is about librbd + eventlet issues.

Revision history for this message
Mike Fedosin (mfedosin) wrote :

Seems like it's a known issue in eventlet and glance logging to syslog. It was fixed in python-eventlet - 0.17.3. Current version installed on the env is 0.15.2. So it's proposed to update this package.

See more information here:
https://bugs.launchpad.net/ubuntu/+source/python-eventlet/+bug/1452312

Changed in mos:
assignee: Mike Fedosin (mfedosin) → Inessa Vasilevskaya (ivasilevskaya)
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :
Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :
tags: added: 6.1rc2
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

Unfortunately, the proposed fix doesn't fix the issue.

Me and Leontiy Istomin investigated the problem in the following matter: we gathered ps aux info every 1 second after the Glance Rally tests (load factor = 5) were started. After test job finished we analyzed the load of an idle glance cluster.

We managed to gather results of 2 runs only thoroughout a working day, but the tendency was clearly observable: the load was always rising and never returned to the initial values (as it should have done in "idle" state).

The results for node41 during the second run can be found here: https://docs.google.com/spreadsheets/d/1zG5kJRGlAB6mAZCgjmzr-e70zotLbIv0VB8cceSeLOw/edit
The state of idle node41 after test run is here: http://paste.openstack.org/show/284391/

Mike Fedosin (mfedosin)
Changed in mos:
assignee: Inessa Vasilevskaya (ivasilevskaya) → Mike Fedosin (mfedosin)
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: abhishekkekane <email address hidden>
Review: https://review.fuel-infra.org/7821

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Mike Fedosin <email address hidden>
Review: https://review.fuel-infra.org/7824

Revision history for this message
Mike Fedosin (mfedosin) wrote :

Current situation: eventlet has an option called 'keepalive', which is True by default http://eventlet.net/doc/modules/wsgi.html . It means that server will never close connections and wait for client actions after response. It's very bad practice, because clients can create many connection and won't close them after. It leads to the situation when many threads are active and consume cpu forever.

The fix is very easy: add keepalive parameter to glance config and pass it to wsgi server. It's done during kilo release and the commit was merged. https://review.openstack.org/#/c/130839/ Also it's discussing to port it back to Juno https://review.openstack.org/#/c/162964/ .

Setting 'keepalive' to False partially fixes the problem, but creates other issues: the performance may be reduced because of frequent reconnections and it won't prevent malefactors of attacking the cloud - they can create connections, but don't send requests. The result of these actions may be complete fault of the cloud or its parts.

That's why I proposed another commit to limit socket timeout to 900 seconds. It's pretty enough to perform needed operations, so it won't affect performance, and it will guard the system from malefactors. The same issues occurred in Keystone and they just added this timeout too https://review.openstack.org/#/c/177670/ .

So after I'm going to propose timeout commit to upstream glance.

Revision history for this message
Mike Fedosin (mfedosin) wrote :

It seems, that we have bug similar to this one https://bugs.launchpad.net/nova/+bug/1361360 I created a fix for it https://review.fuel-infra.org/7824 based on the fix for keystone (https://review.openstack.org/#/c/177670/) and cherry-pick from upstream (https://review.fuel-infra.org/7821) We are going to test it on the scale lab.

description: updated
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack-build/glance-build (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Ivan Berezovskiy <email address hidden>
Review: https://review.fuel-infra.org/7889

tags: added: 6.1scale
removed: 6.1rc2 scale
tags: added: scale
removed: 6.1scale
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: abhishekkekane <email address hidden>
Review: https://review.fuel-infra.org/7891

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Mike Fedosin <email address hidden>
Review: https://review.fuel-infra.org/7892

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/glance (openstack-ci/fuel-6.1/2014.2)

Change abandoned by Mike Fedosin <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/7892

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Change abandoned by Mike Fedosin <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/7821

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Change abandoned by Mike Fedosin <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/7824

description: updated
description: updated
description: updated
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: abhishekkekane <email address hidden>
Review: https://review.fuel-infra.org/7902

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Mike Fedosin <email address hidden>
Review: https://review.fuel-infra.org/7903

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/glance (openstack-ci/fuel-6.1/2014.2)

Change abandoned by Mike Fedosin <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/7891

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack-build/glance-build (openstack-ci/fuel-6.1/2014.2)

Change abandoned by Ivan Berezovskiy <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/7889

description: updated
Revision history for this message
Dan Hata (dhata) wrote :

Requested by Eugene Bogdanov
Clear steps to reproduce and expected result vs actual result

 Install MOS OpenStack HA environment with 3 controllers.
2. Run "ps aux | grep glance-api" on a controller and memorize cpu consumption by these services
3. Run rally tests on the environment (for description of what rally tests are, see "User impact" section below)
4. Repeat step #2 on the same controller.

You will notice that cpu consumption by all glance-api processes raised a little (less than 1% each). There are 12 glance-api processes, so all together consumption by the glance-api service should raise by 12%. If you run rally test again, the consumption will again go up a little. Note, here 1% is a 1% of a single cpu core, not overall machine compute capacity (which consists of several cpus/cores). In reproduction of this bug a single rally run created/deleted around 1300 images in Glance.

Rough estimate of the probability of user facing the issue

Especially if steps to reproduce are complex, like "shut down controller, then bring it up, then shut down another controller, then 3rd,. ... "

Rough estimate of the probability of user facing the issue

We have not much data (2 repros so far), but we think that the issue reproduces in 100% cases.

What is the real user facing impact / severity and is there a workaround available?

User impact:

Rally tests emulate multi-user usage of cloud, by concurrently doing various actions with cloud. For example CRUD operations with users, instances, volumes, etc.

The bug does not result in visible impact for user, aside from that cpu consumption constantly grows while cloud is used. Obviously that will lead to problem once service consumes considerable part of machine compute capacity.

Workaround:

Restart the affected service, that will immediately drop cpu consumption by the service back to almost zero.

Can we deliver the fix later and apply it easy on running env?
yes
 It seems, that we have bug similar to this one https://bugs.launchpad.net/nova/+bug/1361360 I created a fix for it https://review.fuel-infra.org/7824 based on the fix for keystone (https://review.openstack.org/#/c/177670/) and cherry-pick from upstream (https://review.fuel-infra.org/7821) We are going to test it on the scale lab.

Changed in mos:
milestone: 6.1 → 6.1-updates
tags: added: 6.1-mu-1
Alexey Khivin (akhivin)
Changed in mos:
status: Confirmed → Fix Committed
Changed in mos:
milestone: 6.1-updates → 6.1-mu-1
Changed in mos:
status: Fix Committed → In Progress
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

The test scenario will not be included in patching-erratum as it will be covered by functional test included in Glance.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to patching-tests (stable/6.1)

Fix proposed to branch: stable/6.1
Change author: Denis V. Meltsaykin <email address hidden>
Review: https://review.fuel-infra.org/9066

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

The fix for this issue needs to be validated on scale lab - not possible to meet 07/08 deadline - targetting to 6.1-updates, to be included into 6.1-mu-2

Changed in mos:
milestone: 6.1-mu-1 → 6.1-updates
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Mike Fedosin <email address hidden>
Review: https://review.fuel-infra.org/9083

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/glance (openstack-ci/fuel-6.1/2014.2)

Change abandoned by Mike Fedosin <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/9083

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance (openstack-ci/fuel-6.0-updates/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.0-updates/2014.2
Change author: abhishekkekane <email address hidden>
Review: https://review.fuel-infra.org/9980

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/glance (openstack-ci/fuel-6.0-updates/2014.2)

Reviewed: https://review.fuel-infra.org/9980
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-6.0-updates/2014.2

Commit: 76aa3af5f8745fb2895d759d5862bed8c36ce3ee
Author: abhishekkekane <email address hidden>
Date: Thu Jul 30 12:52:54 2015

Eventlet green threads not released back to pool

Presently, the wsgi server allows persist connections. Hence even after
the response is sent to the client, it doesn't close the client socket
connection. Because of this problem, the green thread is not released
back to the pool.

In order to close the client socket connection explicitly after the
response is sent and read successfully by the client, you simply have to
set keepalive to False when you create a wsgi server.

DocImpact:
Added http_keepalive option (default=True).

Conflicts:
        doc/source/configuring.rst
        etc/glance-api.conf
        glance/common/wsgi.py
        glance/tests/unit/test_opts.py

SecurityImpact

Partial-Bug: #1463522
(cherry picked from commit 16a821e00d15520d2f6e940e184bd289b8782620)

Change-Id: Ib35d9ada36e491e5dbca3691c1ebef6464d8e39e

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance (openstack-ci/fuel-6.0-updates/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.0-updates/2014.2
Change author: Mike Fedosin <email address hidden>
Review: https://review.fuel-infra.org/10120

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Denis V. Meltsaykin <email address hidden>
Review: https://review.fuel-infra.org/10123

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/glance (openstack-ci/fuel-6.0-updates/2014.2)

Reviewed: https://review.fuel-infra.org/10120
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-6.0-updates/2014.2

Commit: c756ffb1cbfc1129e2b73af8f3c390ce38e67e92
Author: Mike Fedosin <email address hidden>
Date: Wed Aug 5 16:08:20 2015

Add client_socket_timeout parameter

Add a parameter to take advantage of the new(ish) eventlet
socket timeout behaviour. Allows closing idle client
connections after a period of time.

Setting 'client_socket_timeout = 0' means no timeout.

DocImpact:
Added client_socket_timeout option (default=900)

SecurityImpact

Change-Id: I071b11e79d20cdb0426a5c72593b5f46bc09b39c
Closes-Bug: #1463522

tags: added: 6.0-mu-5 done release-notes
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/glance (openstack-ci/fuel-6.1/2014.2)

Reviewed: https://review.fuel-infra.org/10123
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-6.1/2014.2

Commit: 293b2370a88cbac526f1dcbbac7e1b3fb8408826
Author: Denis V. Meltsaykin <email address hidden>
Date: Wed Aug 5 16:20:19 2015

Eventlet green threads not released back to pool

Presently, the wsgi server allows persist connections. Hence even after
the response is sent to the client, it doesn't close the client socket
connection. Because of this problem, the green thread is not released
back to the pool.

In order to close the client socket connection explicitly after the
response is sent and read successfully by the client, you simply have to
set keepalive to False when you create a wsgi server.

Add a parameter to take advantage of the new(ish) eventlet
socket timeout behaviour. Allows closing idle client
connections after a period of time.

DocImpact:
Added http_keepalive option (default=True).
Added client_socket_timeout option (default=900)

Conflicts:
        doc/source/configuring.rst
        etc/glance-api.conf
        glance/common/wsgi.py
        glance/tests/unit/test_opts.py

SecurityImpact

(cherry-picked from 16a821e00d15520d2f6e940e184bd289b8782620)
(cherry-picked from bdb66a569c5943f820e0d902990ec5dc64bf0713)

Change-Id: I509d30debe9cd0036f7f27582d6531d67ccbd3b0
Closes-Bug: #1463522

Revision history for this message
Vadim Rovachev (vrovachev) wrote :

Verified on 6.0. Fix works.

Revision history for this message
Mike Fedosin (mfedosin) wrote :

The fix was backported in upstream stable/kilo https://review.openstack.org/#/q/I9e7edcbf25ece61dc16b8cd5a8bef5ed9a14e3d6,n,z and transferred in MOS 7.0. So I change status to Invalid for this bug.

Revision history for this message
Vadim Rovachev (vrovachev) wrote :

Verified on 6.1.
Used packages:
{glance-commom,glance-api,glance-registry,python-glance}=2014.2.2-1~u14.04+mos9
In mirror:
http://mirror.fuel-infra.org/mos/snapshots/ubuntu-latest/ mos6.1-proposed/main amd64 Packages

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/glance (openstack-ci/fuel-6.1/2014.2)

Change abandoned by Mike Fedosin <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/7902

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on patching-tests (stable/6.1)

Change abandoned by Denis V. Meltsaykin <email address hidden> on branch: stable/6.1
Review: https://review.fuel-infra.org/9066
Reason: No errata anymore!

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/glance (openstack-ci/fuel-6.1/2014.2)

Change abandoned by Denis V. Meltsaykin <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/7903
Reason: Already merged at https://review.fuel-infra.org/#/c/10123/

Roman Rufanov (rrufanov)
tags: added: customer-found support
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.