Bug #1710773 “scenario001 and 004 fails when Glance with rbd bac...” : Bugs : tripleo

John Fulton (jfulton-org) on 2017-08-15

Changed in tripleo:
assignee:	nobody → John Fulton (jfulton-org)

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-08-15:

#1

Download full text (3.2 KiB)

I. Two other ways to describe this issue:
- Why did Glance return an HTTP 503 [0] when asked to upload a ciros image?
- Glance upload fails and logs "since image size is zero we will be doing resize-before-write for each chunk which will be considerably slower than normal"

II. Q/A from CI logs:
- Glance logs show that the rbd scheme was enabled [1]
- Glance logs show Glance creating image with order 23 and size 0 [2] in rbd.py [3]
- There have been cases where the ceph config was the root cause of this error [4]
- Is the glance-api.conf correct? Yes [6] (but see open question A)
- Was the glance container image mounted with the ceph.conf ? Yes [7]
- Is a normal looking ceph.conf on the subnode (i.e. the container host?) ? Yes [8]
- Is ceph.client.openstack.keyring on the subnode ? Yes [8]
- Was a change made to support this? Yes, puppet-ceph still genereates the configs and glance container was changed to use them [9]

Two open questions:
A. In the glance conf, rbd_store_ceph_conf has been commented out but worked in the past (for default reasons) might this be affecting us now?

B. Are the permissions correct of the ceph keyring set so that the glance user can read it?
- CI logs do not confirm the 644 permissions, but they _should_ be correct...
- They have been 644 in the past [10] and need to be so that the container can read the key so why should this change?

Next Steps:
- Attempting to reproduce in my local environment

[0] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-08-12_13_29_17
[1] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/var/log/containers/glance/api.log.txt.gz#_2017-08-12_13_23_40_472
[2] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/var/log/containers/glance/api.log.txt.gz#_2017-08-12_13_27_13_144
[3] https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/rbd.py#L460-L461
[4] https://ask.openstack.org/en/question/78493/glance-image-create-ceph-problem/
[5] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/var/log/config-data/glance_api/etc/glance/glance-api.conf.txt.gz
[6] glance-api.conf.txt
"""
[glance_store]
stores=http,rbd
default_store=rbd
rbd_store_pool=images
rbd_store_user=openstack
#rbd_store_ceph_conf = /etc/ceph/ceph.conf # <--- this will default correctly to what's in comment
show_image_direct_url=True
"""
[7] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/var/log/extra/docker/containers/glance_api/docker_info.log.txt.gz (note the "/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro")
[8] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/etc/ceph/
[9] https://review.openstack.org/#/c/482500
[10] https://github.com/openstack/puppet-ceph/blob/28e8f452...

I. Two other ways to describe this issue:
- Why did Glance return an HTTP 503 [0] when asked to upload a ciros image?
- Glance upload fails and logs "since image size is zero we will be doing resize-before-write for each chunk which will be considerably slower than normal"

II. Q/A from CI logs: 
- Glance logs show that the rbd scheme was enabled [1] 
- Glance logs show Glance creating image with order 23 and size 0 [2] in rbd.py [3]
- There have been cases where the ceph config was the root cause of this error [4] 
- Is the glance-api.conf correct? Yes [6] (but see open question A)
- Was the glance container image mounted with the ceph.conf ? Yes [7]
- Is a normal looking ceph.conf on the subnode (i.e. the container host?) ? Yes [8] 
- Is ceph.client.openstack.keyring on the subnode ? Yes [8]
- Was a change made to support this? Yes, puppet-ceph still genereates the configs and glance container was changed to use them [9]

Two open questions:
A. In the glance conf, rbd_store_ceph_conf has been commented out but worked in the past (for default reasons) might this be affecting us now?

B. Are the permissions correct of the ceph keyring set so that the glance user can read it?
   - CI logs do not confirm the 644 permissions, but they _should_ be correct... 
   - They have been 644 in the past [10] and need to be so that the container can read the key so why should this change?

Next Steps:
- Attempting to reproduce in my local environment

[0] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-08-12_13_29_17
[1] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/var/log/containers/glance/api.log.txt.gz#_2017-08-12_13_23_40_472
[2] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/var/log/containers/glance/api.log.txt.gz#_2017-08-12_13_27_13_144
[3] https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/rbd.py#L460-L461
[4] https://ask.openstack.org/en/question/78493/glance-image-create-ceph-problem/
[5] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/var/log/config-data/glance_api/etc/glance/glance-api.conf.txt.gz
[6] glance-api.conf.txt
"""
[glance_store]
stores=http,rbd
default_store=rbd
rbd_store_pool=images
rbd_store_user=openstack
#rbd_store_ceph_conf = /etc/ceph/ceph.conf  # <--- this will default correctly to what's in comment
show_image_direct_url=True
"""
[7] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/var/log/extra/docker/containers/glance_api/docker_info.log.txt.gz (note the "/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro")
[8] http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/etc/ceph/
[9] https://review.openstack.org/#/c/482500
[10] https://github.com/openstack/puppet-ceph/blob/28e8f4525f4448a9f585f7ff4212fa9df58f4464/examples/nodes/client.yaml

Revision history for this message

Giulio Fidente (gfidente) wrote on 2017-08-16:

#2

I was unable to reproduce this on my dev env. The glance-api log I get locally is pretty muc identical, but it proceeds further and finishes the image creation. Not sure if the version in CI is truncated?

It looks like there aren't lines after 13:27 while the client is receiving 503 codes at 13:29 but those seem to come from HAproxy; wonder if /healthcheck used by HAproxy to verify the backend isn't failing at that point?

Revision history for this message

Giulio Fidente (gfidente) wrote on 2017-08-16:

#3

In http://logs.openstack.org/29/490129/4/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq-container/1d6c57a/logs/subnode-2/etc/ceph/ceph.conf.txt.gz the osd_pool_default_size setting is wrong (set to 3) despite the scenario env files setting CephPoolDefaultSize to 1 ... this should be the issue, looking into a fix

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-16: Fix proposed to tripleo-heat-templates (master)

#4

Fix proposed to branch: master
Review: https://review.openstack.org/494176

Changed in tripleo:
assignee:	John Fulton (jfulton-org) → Giulio Fidente (gfidente)
status:	Triaged → In Progress

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-08-16:

#5

Download full text (3.3 KiB)

- I reproduced the problem on my machine [1]
- Root cause seems to be that the ceph cluster was in HEALTH_ERR [2]
- I then dynamically changed the pool size from 3 to 1 (which fits a 1 OSD deploy) to get the cluster back to HEALTH_OK [3]
- The problem was then resolved and glance didn't give me the 503 error [4]
- Thus, I think Giulio's patch will fix this [5]

[1]
```
(overcloud) [stack@undercloud ~]$ ./test-glance.sh
Using existing images in /home/stack/cirros_images
503 Service Unavailable: No server is available to handle this request. (HTTP 503)
(overcloud) [stack@undercloud ~]$

```

[2]
```
[root@overcloud-controller-0 ~]# ceph -s
    cluster 4b5c8c0a-ff60-454b-a1b4-9747aa737d19
     health HEALTH_ERR
            288 pgs are stuck inactive for more than 300 seconds
            288 pgs degraded
            288 pgs stuck degraded
            288 pgs stuck inactive
            288 pgs stuck unclean
            288 pgs stuck undersized
            288 pgs undersized
            2 requests are blocked > 32 sec
     monmap e1: 1 mons at {overcloud-controller-0=192.168.24.16:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e17: 1 osds: 1 up, 1 in
            flags sortbitwise,require_jewel_osds
      pgmap v3073: 288 pgs, 8 pools, 0 bytes data, 0 objects
            20171 MB used, 31016 MB / 51187 MB avail
                 288 undersized+degraded+peered
[root@overcloud-controller-0 ~]#
```
[3]
```
[root@overcloud-controller-0 ~]# ceph osd pool get images size
size: 3
[root@overcloud-controller-0 ~]# for pool in rbd backups images manila_data manila_metadata metrics vms volumes ; do \
> ceph osd pool set $pool size 1; \
> ceph osd pool set $pool min_size 1; \
> done
set pool 0 size to 1
set pool 0 min_size to 1
set pool 1 size to 1
set pool 1 min_size to 1
set pool 2 size to 1
set pool 2 min_size to 1
set pool 3 size to 1
set pool 3 min_size to 1
set pool 4 size to 1
set pool 4 min_size to 1
set pool 5 size to 1
set pool 5 min_size to 1
set pool 6 size to 1
set pool 6 min_size to 1
set pool 7 size to 1
set pool 7 min_size to 1
[root@overcloud-controller-0 ~]# ceph -s
    cluster 4b5c8c0a-ff60-454b-a1b4-9747aa737d19
     health HEALTH_WARN
            1 requests are blocked > 32 sec
     monmap e1: 1 mons at {overcloud-controller-0=192.168.24.16:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e33: 1 osds: 1 up, 1 in
            flags sortbitwise,require_jewel_osds
      pgmap v9028: 288 pgs, 8 pools, 0 bytes data, 1 objects
            20680 MB used, 30507 MB / 51187 MB avail
                 288 active+clean
[root@overcloud-controller-0 ~]# ceph -s
    cluster 4b5c8c0a-ff60-454b-a1b4-9747aa737d19
     health HEALTH_OK
     monmap e1: 1 mons at {overcloud-controller-0=192.168.24.16:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e33: 1 osds: 1 up, 1 in
            flags sortbitwise,require_jewel_osds
      pgmap v9029: 288 pgs, 8 pools, 0 bytes data, 1 objects
            20681 MB used, 30506 MB / 51187 MB avail
                 288 active+clean
[root@overcloud-controller-0 ~]#
```

[4]
```
(undercloud) [stack@undercloud ~]$ ./test-glance.sh...

- I reproduced the problem on my machine [1]
- Root cause seems to be that the ceph cluster was in HEALTH_ERR [2]
- I then dynamically changed the pool size from 3 to 1 (which fits a 1 OSD deploy) to get the cluster back to HEALTH_OK [3]
- The problem was then resolved and glance didn't give me the 503 error [4]
- Thus, I think Giulio's patch will fix this [5]

[1]
```
(overcloud) [stack@undercloud ~]$ ./test-glance.sh
Using existing images in /home/stack/cirros_images
503 Service Unavailable: No server is available to handle this request. (HTTP 503)
(overcloud) [stack@undercloud ~]$

```

[2]
```
[root@overcloud-controller-0 ~]# ceph -s
    cluster 4b5c8c0a-ff60-454b-a1b4-9747aa737d19
     health HEALTH_ERR
            288 pgs are stuck inactive for more than 300 seconds
            288 pgs degraded
            288 pgs stuck degraded
            288 pgs stuck inactive
            288 pgs stuck unclean
            288 pgs stuck undersized
            288 pgs undersized
            2 requests are blocked > 32 sec
     monmap e1: 1 mons at {overcloud-controller-0=192.168.24.16:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e17: 1 osds: 1 up, 1 in
            flags sortbitwise,require_jewel_osds
      pgmap v3073: 288 pgs, 8 pools, 0 bytes data, 0 objects
            20171 MB used, 31016 MB / 51187 MB avail
                 288 undersized+degraded+peered
[root@overcloud-controller-0 ~]# 
```
[3]
```
[root@overcloud-controller-0 ~]# ceph osd pool get images size 
size: 3
[root@overcloud-controller-0 ~]# for pool in rbd backups images manila_data manila_metadata metrics vms volumes ; do \
> ceph osd pool set $pool size 1; \
> ceph osd pool set $pool min_size 1; \
> done
set pool 0 size to 1
set pool 0 min_size to 1
set pool 1 size to 1
set pool 1 min_size to 1
set pool 2 size to 1
set pool 2 min_size to 1
set pool 3 size to 1
set pool 3 min_size to 1
set pool 4 size to 1
set pool 4 min_size to 1
set pool 5 size to 1
set pool 5 min_size to 1
set pool 6 size to 1
set pool 6 min_size to 1
set pool 7 size to 1
set pool 7 min_size to 1
[root@overcloud-controller-0 ~]# ceph -s
    cluster 4b5c8c0a-ff60-454b-a1b4-9747aa737d19
     health HEALTH_WARN
            1 requests are blocked > 32 sec
     monmap e1: 1 mons at {overcloud-controller-0=192.168.24.16:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e33: 1 osds: 1 up, 1 in
            flags sortbitwise,require_jewel_osds
      pgmap v9028: 288 pgs, 8 pools, 0 bytes data, 1 objects
            20680 MB used, 30507 MB / 51187 MB avail
                 288 active+clean
[root@overcloud-controller-0 ~]# ceph -s
    cluster 4b5c8c0a-ff60-454b-a1b4-9747aa737d19
     health HEALTH_OK
     monmap e1: 1 mons at {overcloud-controller-0=192.168.24.16:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e33: 1 osds: 1 up, 1 in
            flags sortbitwise,require_jewel_osds
      pgmap v9029: 288 pgs, 8 pools, 0 bytes data, 1 objects
            20681 MB used, 30506 MB / 51187 MB avail
                 288 active+clean
[root@overcloud-controller-0 ~]# 
```

[4]
```
(undercloud) [stack@undercloud ~]$ ./test-glance.sh
Using existing images in /home/stack/cirros_images
(undercloud) [stack@undercloud ~]$ 
```

[5] https://review.openstack.org/#/c/494176

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-17: Fix merged to tripleo-heat-templates (master)

#6

Reviewed: https://review.openstack.org/494176
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=4abb8adf8ea0c4060aa5032eb3e703353f0ef939
Submitter: Jenkins
Branch: master

commit 4abb8adf8ea0c4060aa5032eb3e703353f0ef939
Author: Giulio Fidente <email address hidden>
Date: Wed Aug 16 13:41:28 2017 +0200

Set default OSD pool size to 1 in scenario 001/004 containers

When the OSD pool size is unset it defaults to 3, while we only
have a single OSD in CI so the pools are created but not writable.

We did set the default pool size to 1 in the non-containerized
scenarios but apparently missed it in the containerized version.

Change-Id: I1ac1fe5c2effd72a2385ab43d27abafba5c45d4d
Closes-Bug: #1710773

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-24: Fix included in openstack/tripleo-heat-templates 7.0.0.0rc1

#7

This issue was fixed in the openstack/tripleo-heat-templates 7.0.0.0rc1 release candidate.

tripleo

scenario001 and 004 fails when Glance with rbd backend is containerized but not Ceph

Bug Description

Other bug subscribers

Remote bug watches