Bug #1578172 “Glance had been deleting 4 images more than 7 minu...” : Bugs : Mirantis OpenStack

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-05-04:

#1

Here is results of "top" on both controllers where an issue reproduced:
https://paste.mirantis.net/show/2229/
https://paste.mirantis.net/show/2230/
It looks very suspicious that some processes periodically takes almost 100 CPU on both controllers. This seems the reason why glance-api is so slow.

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-05-04:

#2

Waiting confirmation from oslo.messaging and LMA guys that it is not normal to have 100% CPU busy.

Revision history for this message

Sergey Arkhipov (sarkhipov) wrote on 2016-05-04:

#3

It is ok. Some processes may have 100% CPU time but each controller has 12 CPU so it is not a cause definitely.

Roman Podoliaka (rpodolyaka) on 2016-05-04

tags:	added: area-glance
Changed in mos:
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Sergey Arkhipov (sarkhipov) wrote on 2016-05-04:

#4

Kairat took environment for troubleshooting till May 5 9:00 UTC

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-05-04:

#6

So I inserted debugs and it seems that the trouble in rbd because a lot of time wasted when executing functions from rbd lib.
I inserted the following debugs:
https://paste.mirantis.net/show/2237/
I also added additional debugs to glance code and there are no long running ops in glance.
So logs are the following:
https://paste.mirantis.net/show/2238/

rbd.Image(ioctx, image_name).__enter__ took almost 30 sec
image.unprotect_snap(snapshot_name) took 67 sec
image.remove_snap(snapshot_name) took 16 sec
rbd.Image(ioctx, image_name).__exit__ took almost 30 sec again
rbd.RBD().remove(ioctx, image_name) took 1 min 23 sec.

Unfortunately we call these functions from rbd lib which is not part of Glance. So I would recommend to storage folks to look at it.

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-05-04:

#7

Ceph folks,
after executing rally tests on node requests to rbd though python lib is quite slow.
Could you please help with the problem?

Changed in mos:
assignee:	MOS Glance (mos-glance) → MOS Ceph (mos-ceph)

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-05-05:

#8

> 1. Deploy MOS 9.0, 227 build
> 2. Run rally workload for 4-6 days and make several runs of Rally tests.
> 3. Try to remove images from rally test runs and workloads

I don't think such bug reports are of any use since a typical developer has no resources neither
for reproducing the problem nor for testing a possible fix.

Please explain how to reproduce the problem in a more systematic way, for instance

"1. Deploy ceph cluster with X OSD nodes (Y OSDs per node, FOOBAR 1 TB HDD, BLAH 80 GB SSD as a journal, 12 cores, 16 GB RAM, 10 Gb NIC) and Z monitors
2. Create a replicated pool with # of copies == 3 and PG number N
3. Create an rbd image, M GB size (specify the exact command)
4. Explain how the image has been used ("randomly write L GB, snapshot, write more XX GB, rollback, ...", etc)
5. Remove the image"

> Diagnostic snapshot: http://mos-scale-share.mirantis.com/env14/fuel-snapshot-2016-05-04_08-12-02.tar.xz

A ~ 5GB compressed tarball is not very useful either. Please use the common sense and extract the relevant data,
such as the cluster and the image parameters, i.e.

1. What are the journal and the data devices?
2. The output of ceph -s
3. The output of ceph health detail
4. The output of ceph mon_status
5. What is the image size? How many snapshots are there? (rbd info -p $POOL $IMG_ID)
6. What are the parameters of the backing pool (in particular the number of the placement groups)?

> image.unprotect_snap(snapshot_name) took 67 sec

This number itself does not indicate a problem. What are the cluster and the image parameters (see the above list)? Is there any more "heavy" IO running concurrently (ceph or otherwise)?

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-05-05:

#9

> Expected results: Images has to be deleted pretty quick

The expectation is somewhat wrong: deleting a snapshot is in general quite a "heavy" operation (the time is proportional to the image size), and "pretty quick" is way too vague.

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-05-05:

#10

Marking as Incomplete. Feel free to provide the necessary data (see https://bugs.launchpad.net/mos/+bug/1578172/comments/8) and reopen the bug.

Changed in mos:
status:	Confirmed → Incomplete

Revision history for this message

Sergey Arkhipov (sarkhipov) wrote on 2016-05-05:

#11

I understand that it is hard to troubleshoot or verify such issue but this bug was discovered due to long-haul testing and it is rather hard to make systematic description of problem in determined way. That's why I've collected diagnostic snapshot because it has to consist all possible information.

Relevant Ceph information:

root@node-59:~# ceph -s
    cluster 9f660137-ea27-4e8e-88ed-409894217173
     health HEALTH_WARN
            6 requests are blocked > 32 sec
            too many PGs per OSD (422 > max 300)
            pool compute has too few pgs
            1 mons down, quorum 0,2 node-59,node-57
     monmap e3: 3 mons at {node-57=192.168.1.10:6789/0,node-59=192.168.1.6:6789/0,node-60=192.168.1.7:6789/0}
            election epoch 46, quorum 0,2 node-59,node-57
     osdmap e2724: 5 osds: 5 up, 5 in
      pgmap v577987: 704 pgs, 11 pools, 127 GB data, 32939 objects
            390 GB used, 4261 GB / 4652 GB avail
                 704 active+clean
  client io 301 kB/s rd, 118 MB/s wr, 82 op/s

root@node-59:~# ceph health detail
HEALTH_WARN 4 requests are blocked > 32 sec; 1 osds have slow requests; too many PGs per OSD (422 > max 300); pool compute has too few pgs; 1 mons down, quorum 0,2 node-59,node-57
4 ops are blocked > 32.768 sec
4 ops are blocked > 32.768 sec on osd.0
1 osds have slow requests
too many PGs per OSD (422 > max 300)
pool compute objects per pg (492) is more than 10.6957 times cluster average (46)
mon.node-60 (rank 1) addr 192.168.1.7:6789/0 is down (out of quorum)

root@node-59:~# ceph mon_status
{"name":"node-57","rank":2,"state":"peon","election_epoch":46,"quorum":[0,2],"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":3,"fsid":"9f660137-ea27-4e8e-88ed-409894217173","modified":"2016-04-25 06:47:31.716201","created":"0.000000","mons":[{"rank":0,"name":"node-59","addr":"192.168.1.6:6789\/0"},{"rank":1,"name":"node-60","addr":"192.168.1.7:6789\/0"},{"rank":2,"name":"node-57","addr":"192.168.1.10:6789\/0"}]}}

Please check email: I've sent you credentials to environment. Right now it has the same issues described in this bug: glance is slow becuase rbd is slow too. I need some help to figure out real root cause.

I understand that it is hard to troubleshoot or verify such issue but this bug was discovered due to long-haul testing and it is rather hard to make systematic description of problem in determined way. That's why I've collected diagnostic snapshot because it has to consist all possible information.

Relevant Ceph information:

root@node-59:~# ceph -s
    cluster 9f660137-ea27-4e8e-88ed-409894217173
     health HEALTH_WARN
            6 requests are blocked > 32 sec
            too many PGs per OSD (422 > max 300)
            pool compute has too few pgs
            1 mons down, quorum 0,2 node-59,node-57
     monmap e3: 3 mons at {node-57=192.168.1.10:6789/0,node-59=192.168.1.6:6789/0,node-60=192.168.1.7:6789/0}
            election epoch 46, quorum 0,2 node-59,node-57
     osdmap e2724: 5 osds: 5 up, 5 in
      pgmap v577987: 704 pgs, 11 pools, 127 GB data, 32939 objects
            390 GB used, 4261 GB / 4652 GB avail
                 704 active+clean
  client io 301 kB/s rd, 118 MB/s wr, 82 op/s

root@node-59:~# ceph health detail
HEALTH_WARN 4 requests are blocked > 32 sec; 1 osds have slow requests; too many PGs per OSD (422 > max 300); pool compute has too few pgs; 1 mons down, quorum 0,2 node-59,node-57
4 ops are blocked > 32.768 sec
4 ops are blocked > 32.768 sec on osd.0
1 osds have slow requests
too many PGs per OSD (422 > max 300)
pool compute objects per pg (492) is more than 10.6957 times cluster average (46)
mon.node-60 (rank 1) addr 192.168.1.7:6789/0 is down (out of quorum)

root@node-59:~# ceph mon_status
{"name":"node-57","rank":2,"state":"peon","election_epoch":46,"quorum":[0,2],"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":3,"fsid":"9f660137-ea27-4e8e-88ed-409894217173","modified":"2016-04-25 06:47:31.716201","created":"0.000000","mons":[{"rank":0,"name":"node-59","addr":"192.168.1.6:6789\/0"},{"rank":1,"name":"node-60","addr":"192.168.1.7:6789\/0"},{"rank":2,"name":"node-57","addr":"192.168.1.10:6789\/0"}]}}

Please check email: I've sent you credentials to environment. Right now it has the same issues described in this bug: glance is slow becuase rbd is slow too. I need some help to figure out real root cause.

Changed in mos:
status:	Incomplete → Confirmed

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-05-05:

#12

@Sergey,

thanks for the info. However it's not enough to debug the problem. Could you please collect the following
data (as requested in comment #8) too

> 1. What are the journal and the data devices?
> 5. What is the image size? How many snapshots are there? (rbd info -p $POOL $IMG_ID)
> 6. What are the parameters of the backing pool (in particular the number of the placement groups)?

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-05-05:

#13

Marking as Incomplete as the required info is not provided yet (and there's no systematic way to reproduce the bug)

Changed in mos:
status:	Confirmed → Incomplete

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-05-05:

#14

The cluster is badly misconfigured.

1. There should be a separate SSD backed journal partition/logical volume, and OSDs of these cluster use an ordinary file as a journal:

root@node-58:~# ls -la /var/lib/ceph/osd/ceph-1/journal
-rw-r--r-- 1 root root 2147483648 May 5 13:15 /var/lib/ceph/osd/ceph-1/journal

First of all HDD-only setup is not recommended for a production environment - roughly speaking
this halves the throughput and substantially increases the latency (4x -- 10x) since the data is written twice
(first to the journal, and than to the actual storage).
Secondly using an ordinary file as a journal introduces extra overhead since the OSD journal gets fsynced
very often (which makes most journaling filesystems very slow/unhappy). Also asynchronous IO does not work
with ordinary files on Linux which further increases the latency.

Having a dedicated journal partition on an SSD drive is a must for a production environment.

2. Using OSD nodes to run the VMs is an awful idea, please don't do that.

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-05-05:

#15

Ceph is badly misconfigured: no SSD backed journals, OSDs are used to host the VMs.
Feel free to reopen the bug if it can be reproduced in a sane environment (that is, with dedicated OSD nodes
with SSD backed journals)

Changed in mos:
status:	Incomplete → Invalid

Mirantis OpenStack

Glance had been deleting 4 images more than 7 minutes

Bug Description

Other bug subscribers

Remote bug watches