pingtest fails on ipv6/ssl: glanceclient fails to upload pingtest_initramfs

Bug #1694847 reported by Emilien Macchi
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Promotion jobs (ovb-update) fail to finish pingtest. The upload of pingtest_initramfs image into Glance fails:

http://logs.openstack.org/84/469484/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9e7fb5d/console.html#_2017-05-31_16_01_48_131828

Looking at Glance logs, it sounds like the action of creating the image was executed but not finished:
http://logs.openstack.org/84/469484/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9e7fb5d/logs/overcloud-controller-0/var/log/glance/api.txt.gz#_2017-05-31_15_59_48_548

I'm wondering if it's normal to not having more logs in glance...

Revision history for this message
Cyril Roelandt (cyril-roelandt) wrote :

So apparently, some requests worked, and then we got a 504 (timeout) when uploading the image. Could it be a "real" timeout? Could the Glance service have been killed because there were not enough resources on the node? Does this happen every time?

Revision history for this message
Emilien Macchi (emilienm) wrote :

Yes it happens every time.
I don't know why we hit this bug tbh, the other jobs are fine. Note: this jobs runs TripleO in SSL and IPv6.

Revision history for this message
Ben Nemec (bnemec) wrote :

We are also using ceph in the updates job, which I don't believe we are in any other jobs right now. SSL is unlikely to be the culprit since I believe that is also used in the ha job (although different certs and combined with ipv6, so I wouldn't rule it out completely either).

Revision history for this message
Ben Nemec (bnemec) wrote :

It does appear to be ceph. I reproduced this locally by deploying with ceph, but not ipv6 or ssl.

Revision history for this message
Ben Nemec (bnemec) wrote :

Looks like the ceph cluster is not happy:

[root@overcloud-controller-0 ceph]# ceph status
    cluster 56c1eb9e-47b9-11e7-9e94-fa163ed919ae
     health HEALTH_ERR
            288 pgs are stuck inactive for more than 300 seconds
            288 pgs degraded
            288 pgs stuck degraded
            288 pgs stuck inactive
            288 pgs stuck unclean
            288 pgs stuck undersized
            288 pgs undersized
            3 requests are blocked > 32 sec
     monmap e1: 1 mons at {overcloud-controller-0=172.18.0.18:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e19: 1 osds: 1 up, 1 in
            flags sortbitwise,require_jewel_osds
      pgmap v196: 288 pgs, 8 pools, 0 bytes data, 0 objects
            8390 MB used, 42797 MB / 51187 MB avail
                 288 undersized+degraded+peered

Probably because we only have 1 osd? I thought we used to deploy an osd onto the controller too. I don't see any evidence of that now though.

Revision history for this message
Ben Nemec (bnemec) wrote :

Confirmed that a broken ceph cluster is the problem. On the functional ocata jobs, even with 1 osd the cluster is just in HEALTH_WARN state: http://logs.openstack.org/42/462542/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/876a85a/logs/overcloud-controller-0/var/log/host_info.txt.gz

However, on the broken master jobs it is HEALTH_ERR: http://logs.openstack.org/57/470057/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/5fd3a77/logs/overcloud-controller-0/var/log/host_info.txt.gz

You can also see "2 requests are blocked > 32 sec" which jives with the glance errors. Interestingly, both jobs have the same version of ceph packages, so there must be something we changed in how we deploy it.

Revision history for this message
Christian Schwede (cschwede) wrote :

Yes, I found the same a few minutes ago. Here is how to get a reproducible env:

1. ./quickstart.sh -R tripleo-ci/master -N config/nodes/1ctlr_1comp_1ceph.yml -X -n 127.0.0.2

2. SSH into the undercloud

3. Put this into ~/environ:

OVERCLOUD_DEPLOY_ARGS="
      -e /usr/share/openstack-tripleo-heat-templates/environments/enable-swap.yaml
                    -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml
                    -e /opt/stack/new/tripleo-ci/test-environments/net-iso.yaml
                    --ceph-storage-scale 1
                    -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml
                    -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml
                    -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml"
OVERCLOUD_UPDATE_ARGS=${OVERCLOUD_DEPLOY_ARGS}

4. git clone https://github.com/openstack-infra/tripleo-ci.git

5. cd tripleo-ci

6. ./scripts/tripleo.sh --overcloud-deploy ; ./scripts/tripleo.sh --overcloud-update

7. cp overcloudrc.v3 ~/overcloudrc; ./scripts/tripleo.sh --overcloud-pingtest

Revision history for this message
Ben Nemec (bnemec) wrote :

Deploying two osd nodes got ceph working for me. Trying the same in ci here: https://review.openstack.org/#/c/470409/

We don't really want to do that if we can help it (that's a lot of resources that will sit there idle most of the time), but at least it will confirm that this is the problem.

Revision history for this message
Ben Nemec (bnemec) wrote :

It's almost certainly https://review.openstack.org/#/c/464183 that broke this. I'm trying to figure out how that passed the updates job on the review because it seems to me that it shouldn't have. Maybe there's another change that combined with this one messed things up.

Revision history for this message
Ben Nemec (bnemec) wrote :

This patch is probably what broke us: https://review.openstack.org/#/c/464183

My guess is it passed the CI job because ceph was initially deployed without the change, and then after the stack update happened it must not have been restarted to recognize the change (or maybe because this is a default, it doesn't take effect on update?).

In any case, this should be fixed by https://review.openstack.org/#/c/470418/

Revision history for this message
John Fulton (jfulton-org) wrote :

FWIW, I think https://review.openstack.org/#/c/470418/ will fix this bug.

I reproduced the problem as described in step #7. I then ran the following on my cluster:

for pool in rbd backups images manila_data manila_metadata metrics vms volumes ; do
    ceph osd pool set $pool size 1
    ceph osd pool set $pool min_size 1
done

The above just sets the pool size to 1 without having to redeploy. I then re-ran the pingtest and it passed.

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Alfredo Moralejo (amoralej) wrote :

It seems we are hitting the same issue for other jobs, as periodic-tripleo-ci-centos-7-ovb-nonha, logs in:

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/da6eba2/console.html

We may need a similar fix for this job however i'm not sure where's the right place to push it.

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

It has been already fixed in https://review.openstack.org/#/c/471124/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.