Glance had been deleting 4 images more than 7 minutes

Bug #1578172 reported by Sergey Arkhipov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
MOS Ceph

Bug Description

Detailed bug description:
It took almost 8 minutes for Glance to delete 4 images. Cluster had background load at that time.

The image IDs were:
46068590-44d2-4d7b-bfae-ae842f073938
97117fce-8eb9-41ea-ad59-af681b6f5b36
ce47bec7-3f83-4a8f-89f2-06ac1052e69c
38dca2b2-27ce-47cc-8c4e-b7167090dbd7

Steps to reproduce:
1. Deploy MOS 9.0, 227 build
2. Run rally workload for 4-6 days and make several runs of Rally tests.
3. Try to remove images from rally test runs and workloads

Expected results:
Images has to be deleted pretty quick, but it took enormous amount of time to do that.

Actual result:
It took a lot of time to remove those images.

Reproducibility:
Didn't try

Workaround:
No known workaround.

Impact:
N/A

* 10 baremetal nodes:
   - CPU: 12 x 2.10 GHz
   - Disks: 2 drives (SSD - 80 GB, HDD - 931.5 GB), 1006.0 GB total
   - Memory: 2 x 16.0 GB, 32.0 GB total
   - NUMA topology: 1 NUMA node
* Node roles:
  - 1 ElasticSearch / Kibana node
  - 1 InfluxDB / Grafana node
  - 3 controllers (1 was is offline because of disk problems)
  - 5 computes
* Details:
  - OS: Mitaka on Ubuntu 14.04
  - Compute: KVM
  - Neutron with VLAN segmentation
  - Ceph RBD for volumes (Cinder)
  - Ceph RadosGW for objects (Swift API)
  - Ceph RBD for ephemeral volumes (Nova)
  - Ceph RBD for images (Glance)
* MOS 9.0, build 227

Additional information:
Diagnostic snapshot: http://mos-scale-share.mirantis.com/env14/fuel-snapshot-2016-05-04_08-12-02.tar.xz
Delete was start at 1462348656, finished at 1462349130 (UNIX timestamps)

Revision history for this message
Kairat Kushaev (kkushaev) wrote :

Here is results of "top" on both controllers where an issue reproduced:
https://paste.mirantis.net/show/2229/
https://paste.mirantis.net/show/2230/
It looks very suspicious that some processes periodically takes almost 100 CPU on both controllers. This seems the reason why glance-api is so slow.

Revision history for this message
Kairat Kushaev (kkushaev) wrote :

Waiting confirmation from oslo.messaging and LMA guys that it is not normal to have 100% CPU busy.

Revision history for this message
Sergey Arkhipov (sarkhipov) wrote :

It is ok. Some processes may have 100% CPU time but each controller has 12 CPU so it is not a cause definitely.

tags: added: area-glance
Changed in mos:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Sergey Arkhipov (sarkhipov) wrote :

Kairat took environment for troubleshooting till May 5 9:00 UTC

Revision history for this message
Kairat Kushaev (kkushaev) wrote :

So I inserted debugs and it seems that the trouble in rbd because a lot of time wasted when executing functions from rbd lib.
I inserted the following debugs:
https://paste.mirantis.net/show/2237/
I also added additional debugs to glance code and there are no long running ops in glance.
So logs are the following:
https://paste.mirantis.net/show/2238/

rbd.Image(ioctx, image_name).__enter__ took almost 30 sec
image.unprotect_snap(snapshot_name) took 67 sec
image.remove_snap(snapshot_name) took 16 sec
rbd.Image(ioctx, image_name).__exit__ took almost 30 sec again
rbd.RBD().remove(ioctx, image_name) took 1 min 23 sec.

Unfortunately we call these functions from rbd lib which is not part of Glance. So I would recommend to storage folks to look at it.

Revision history for this message
Kairat Kushaev (kkushaev) wrote :

Ceph folks,
after executing rally tests on node requests to rbd though python lib is quite slow.
Could you please help with the problem?

Changed in mos:
assignee: MOS Glance (mos-glance) → MOS Ceph (mos-ceph)
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> 1. Deploy MOS 9.0, 227 build
> 2. Run rally workload for 4-6 days and make several runs of Rally tests.
> 3. Try to remove images from rally test runs and workloads

I don't think such bug reports are of any use since a typical developer has no resources neither
for reproducing the problem nor for testing a possible fix.

Please explain how to reproduce the problem in a more systematic way, for instance

"1. Deploy ceph cluster with X OSD nodes (Y OSDs per node, FOOBAR 1 TB HDD, BLAH 80 GB SSD as a journal, 12 cores, 16 GB RAM, 10 Gb NIC) and Z monitors
2. Create a replicated pool with # of copies == 3 and PG number N
3. Create an rbd image, M GB size (specify the exact command)
4. Explain how the image has been used ("randomly write L GB, snapshot, write more XX GB, rollback, ...", etc)
5. Remove the image"

> Diagnostic snapshot: http://mos-scale-share.mirantis.com/env14/fuel-snapshot-2016-05-04_08-12-02.tar.xz

A ~ 5GB compressed tarball is not very useful either. Please use the common sense and extract the relevant data,
such as the cluster and the image parameters, i.e.

1. What are the journal and the data devices?
2. The output of ceph -s
3. The output of ceph health detail
4. The output of ceph mon_status
5. What is the image size? How many snapshots are there? (rbd info -p $POOL $IMG_ID)
6. What are the parameters of the backing pool (in particular the number of the placement groups)?

> image.unprotect_snap(snapshot_name) took 67 sec

This number itself does not indicate a problem. What are the cluster and the image parameters (see the above list)? Is there any more "heavy" IO running concurrently (ceph or otherwise)?

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Expected results: Images has to be deleted pretty quick

The expectation is somewhat wrong: deleting a snapshot is in general quite a "heavy" operation (the time is proportional to the image size), and "pretty quick" is way too vague.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Marking as Incomplete. Feel free to provide the necessary data (see https://bugs.launchpad.net/mos/+bug/1578172/comments/8) and reopen the bug.

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Sergey Arkhipov (sarkhipov) wrote :

I understand that it is hard to troubleshoot or verify such issue but this bug was discovered due to long-haul testing and it is rather hard to make systematic description of problem in determined way. That's why I've collected diagnostic snapshot because it has to consist all possible information.

Relevant Ceph information:

root@node-59:~# ceph -s
    cluster 9f660137-ea27-4e8e-88ed-409894217173
     health HEALTH_WARN
            6 requests are blocked > 32 sec
            too many PGs per OSD (422 > max 300)
            pool compute has too few pgs
            1 mons down, quorum 0,2 node-59,node-57
     monmap e3: 3 mons at {node-57=192.168.1.10:6789/0,node-59=192.168.1.6:6789/0,node-60=192.168.1.7:6789/0}
            election epoch 46, quorum 0,2 node-59,node-57
     osdmap e2724: 5 osds: 5 up, 5 in
      pgmap v577987: 704 pgs, 11 pools, 127 GB data, 32939 objects
            390 GB used, 4261 GB / 4652 GB avail
                 704 active+clean
  client io 301 kB/s rd, 118 MB/s wr, 82 op/s

root@node-59:~# ceph health detail
HEALTH_WARN 4 requests are blocked > 32 sec; 1 osds have slow requests; too many PGs per OSD (422 > max 300); pool compute has too few pgs; 1 mons down, quorum 0,2 node-59,node-57
4 ops are blocked > 32.768 sec
4 ops are blocked > 32.768 sec on osd.0
1 osds have slow requests
too many PGs per OSD (422 > max 300)
pool compute objects per pg (492) is more than 10.6957 times cluster average (46)
mon.node-60 (rank 1) addr 192.168.1.7:6789/0 is down (out of quorum)

root@node-59:~# ceph mon_status
{"name":"node-57","rank":2,"state":"peon","election_epoch":46,"quorum":[0,2],"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":3,"fsid":"9f660137-ea27-4e8e-88ed-409894217173","modified":"2016-04-25 06:47:31.716201","created":"0.000000","mons":[{"rank":0,"name":"node-59","addr":"192.168.1.6:6789\/0"},{"rank":1,"name":"node-60","addr":"192.168.1.7:6789\/0"},{"rank":2,"name":"node-57","addr":"192.168.1.10:6789\/0"}]}}

Please check email: I've sent you credentials to environment. Right now it has the same issues described in this bug: glance is slow becuase rbd is slow too. I need some help to figure out real root cause.

Changed in mos:
status: Incomplete → Confirmed
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

@Sergey,

thanks for the info. However it's not enough to debug the problem. Could you please collect the following
data (as requested in comment #8) too

> 1. What are the journal and the data devices?
> 5. What is the image size? How many snapshots are there? (rbd info -p $POOL $IMG_ID)
> 6. What are the parameters of the backing pool (in particular the number of the placement groups)?

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Marking as Incomplete as the required info is not provided yet (and there's no systematic way to reproduce the bug)

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

The cluster is badly misconfigured.

1. There should be a separate SSD backed journal partition/logical volume, and OSDs of these cluster use an ordinary file as a journal:

root@node-58:~# ls -la /var/lib/ceph/osd/ceph-1/journal
-rw-r--r-- 1 root root 2147483648 May 5 13:15 /var/lib/ceph/osd/ceph-1/journal

First of all HDD-only setup is not recommended for a production environment - roughly speaking
this halves the throughput and substantially increases the latency (4x -- 10x) since the data is written twice
(first to the journal, and than to the actual storage).
Secondly using an ordinary file as a journal introduces extra overhead since the OSD journal gets fsynced
very often (which makes most journaling filesystems very slow/unhappy). Also asynchronous IO does not work
with ordinary files on Linux which further increases the latency.

Having a dedicated journal partition on an SSD drive is a must for a production environment.

2. Using OSD nodes to run the VMs is an awful idea, please don't do that.

> 58 | ready | compute_1 | 11 | 10.20.0.6 | 0c:c4:7a:1d:92:ae | ceph-osd, compute | | True | 11
> 53 | ready | compute_2 | 11 | 10.20.0.4 | 0c:c4:7a:1d:92:d6 | ceph-osd, compute | | True | 11
> 51 | ready | compute_3 | 11 | 10.20.0.5 | 0c:c4:7a:1d:93:da | ceph-osd, compute | | True | 11
> 52 | ready | compute_4 | 11 | 10.20.0.10 | 0c:c4:7a:1d:93:d6 | ceph-osd, compute | | True | 11
> 54 | ready | compute_0 | 11 | 10.20.0.3 | 0c:c4:7a:1d:91:64 | ceph-osd, compute | | True | 11

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Ceph is badly misconfigured: no SSD backed journals, OSDs are used to host the VMs.
Feel free to reopen the bug if it can be reproduced in a sane environment (that is, with dedicated OSD nodes
with SSD backed journals)

Changed in mos:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.