tripleo

pingtest fails on ipv6/ssl: glanceclient fails to upload pingtest_initramfs

Bug #1694847 reported by Emilien Macchi on 2017-05-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Unassigned	tripleo pike-2 "pike-2"

Bug Description

Promotion jobs (ovb-update) fail to finish pingtest. The upload of pingtest_initramfs image into Glance fails:

http://logs.openstack.org/84/469484/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9e7fb5d/console.html#_2017-05-31_16_01_48_131828

Looking at Glance logs, it sounds like the action of creating the image was executed but not finished:
http://logs.openstack.org/84/469484/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9e7fb5d/logs/overcloud-controller-0/var/log/glance/api.txt.gz#_2017-05-31_15_59_48_548

I'm wondering if it's normal to not having more logs in glance...

Tags:

Revision history for this message

Cyril Roelandt (cyril-roelandt) wrote on 2017-06-01:

So apparently, some requests worked, and then we got a 504 (timeout) when uploading the image. Could it be a "real" timeout? Could the Glance service have been killed because there were not enough resources on the node? Does this happen every time?

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-06-01:

Yes it happens every time.
I don't know why we hit this bug tbh, the other jobs are fine. Note: this jobs runs TripleO in SSL and IPv6.

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-01:

We are also using ceph in the updates job, which I don't believe we are in any other jobs right now. SSL is unlikely to be the culprit since I believe that is also used in the ha job (although different certs and combined with ipv6, so I wouldn't rule it out completely either).

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-02:

It does appear to be ceph. I reproduced this locally by deploying with ceph, but not ipv6 or ssl.

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-02:

Looks like the ceph cluster is not happy:

[root@overcloud-controller-0 ceph]# ceph status
    cluster 56c1eb9e-47b9-11e7-9e94-fa163ed919ae
     health HEALTH_ERR
            288 pgs are stuck inactive for more than 300 seconds
            288 pgs degraded
            288 pgs stuck degraded
            288 pgs stuck inactive
            288 pgs stuck unclean
            288 pgs stuck undersized
            288 pgs undersized
            3 requests are blocked > 32 sec
     monmap e1: 1 mons at {overcloud-controller-0=172.18.0.18:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e19: 1 osds: 1 up, 1 in
            flags sortbitwise,require_jewel_osds
      pgmap v196: 288 pgs, 8 pools, 0 bytes data, 0 objects
            8390 MB used, 42797 MB / 51187 MB avail
                 288 undersized+degraded+peered

Probably because we only have 1 osd? I thought we used to deploy an osd onto the controller too. I don't see any evidence of that now though.

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-02:

Confirmed that a broken ceph cluster is the problem. On the functional ocata jobs, even with 1 osd the cluster is just in HEALTH_WARN state: http://logs.openstack.org/42/462542/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/876a85a/logs/overcloud-controller-0/var/log/host_info.txt.gz

However, on the broken master jobs it is HEALTH_ERR: http://logs.openstack.org/57/470057/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/5fd3a77/logs/overcloud-controller-0/var/log/host_info.txt.gz

You can also see "2 requests are blocked > 32 sec" which jives with the glance errors. Interestingly, both jobs have the same version of ceph packages, so there must be something we changed in how we deploy it.

Revision history for this message

Christian Schwede (cschwede) wrote on 2017-06-02:

Yes, I found the same a few minutes ago. Here is how to get a reproducible env:

1. ./quickstart.sh -R tripleo-ci/master -N config/nodes/1ctlr_1comp_1ceph.yml -X -n 127.0.0.2

2. SSH into the undercloud

3. Put this into ~/environ:

OVERCLOUD_DEPLOY_ARGS="
      -e /usr/share/openstack-tripleo-heat-templates/environments/enable-swap.yaml
                    -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml
                    -e /opt/stack/new/tripleo-ci/test-environments/net-iso.yaml
                    --ceph-storage-scale 1
                    -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml
                    -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml
                    -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml"
OVERCLOUD_UPDATE_ARGS=${OVERCLOUD_DEPLOY_ARGS}

4. git clone https://github.com/openstack-infra/tripleo-ci.git

5. cd tripleo-ci

6. ./scripts/tripleo.sh --overcloud-deploy ; ./scripts/tripleo.sh --overcloud-update

7. cp overcloudrc.v3 ~/overcloudrc; ./scripts/tripleo.sh --overcloud-pingtest

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-02:

Deploying two osd nodes got ceph working for me. Trying the same in ci here: https://review.openstack.org/#/c/470409/

We don't really want to do that if we can help it (that's a lot of resources that will sit there idle most of the time), but at least it will confirm that this is the problem.

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-02:

It's almost certainly https://review.openstack.org/#/c/464183 that broke this. I'm trying to figure out how that passed the updates job on the review because it seems to me that it shouldn't have. Maybe there's another change that combined with this one messed things up.

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-02:

#10

This patch is probably what broke us: https://review.openstack.org/#/c/464183

My guess is it passed the CI job because ceph was initially deployed without the change, and then after the stack update happened it must not have been restarted to recognize the change (or maybe because this is a default, it doesn't take effect on update?).

In any case, this should be fixed by https://review.openstack.org/#/c/470418/

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-06-02:

#11

FWIW, I think https://review.openstack.org/#/c/470418/ will fix this bug.

I reproduced the problem as described in step #7. I then ran the following on my cluster:

for pool in rbd backups images manila_data manila_metadata metrics vms volumes ; do
ceph osd pool set $pool size 1
ceph osd pool set $pool min_size 1
done

The above just sets the pool size to 1 without having to redeploy. I then re-ran the pingtest and it passed.

Emilien Macchi (emilienm) on 2017-06-05

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2017-06-06:

#12

It seems we are hitting the same issue for other jobs, as periodic-tripleo-ci-centos-7-ovb-nonha, logs in:

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/da6eba2/console.html

We may need a similar fix for this job however i'm not sure where's the right place to push it.

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2017-06-06:

#13

It has been already fixed in https://review.openstack.org/#/c/471124/

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.