Mitaka nonha ping test failing to upload image to glance

Bug #1654611 reported by Ben Nemec on 2017-01-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Ben Nemec

Bug Description

For quite a while now the mitaka job has been failing the ping test. It appears to be a problem uploading the image to glance, but it's not clear whether the problem is the same as https://bugs.launchpad.net/tripleo/+bug/1646750 because the glance logs in the mitaka jobs are almost empty (just a few deprecation warnings). We did happen to run a job with some debugging enabled, and that output can be seen here: http://logs.openstack.org/34/417134/3/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha-mitaka/fff0bb7/console.html

For posterity, the traceback from the client looks like this:

2017-01-06 13:51:09.193619 | Traceback (most recent call last):
2017-01-06 13:51:09.193724 | File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 118, in run
2017-01-06 13:51:09.193784 | ret_val = super(OpenStackShell, self).run(argv)
2017-01-06 13:51:09.193834 | File "/usr/lib/python2.7/site-packages/cliff/app.py", line 226, in run
2017-01-06 13:51:09.193870 | result = self.run_subcommand(remainder)
2017-01-06 13:51:09.193926 | File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 153, in run_subcommand
2017-01-06 13:51:09.193971 | ret_value = super(OpenStackShell, self).run_subcommand(argv)
2017-01-06 13:51:09.194022 | File "/usr/lib/python2.7/site-packages/cliff/app.py", line 346, in run_subcommand
2017-01-06 13:51:09.194053 | result = cmd.run(parsed_args)
2017-01-06 13:51:09.194117 | File "/usr/lib/python2.7/site-packages/openstackclient/common/command.py", line 38, in run
2017-01-06 13:51:09.194164 | return super(Command, self).run(parsed_args)
2017-01-06 13:51:09.194224 | File "/usr/lib/python2.7/site-packages/cliff/display.py", line 79, in run
2017-01-06 13:51:09.194267 | column_names, data = self.take_action(parsed_args)
2017-01-06 13:51:09.194326 | File "/usr/lib/python2.7/site-packages/openstackclient/image/v1/image.py", line 264, in take_action
2017-01-06 13:51:09.194365 | image = image_client.images.create(**kwargs)
2017-01-06 13:51:09.194417 | File "/usr/lib/python2.7/site-packages/glanceclient/v1/images.py", line 324, in create
2017-01-06 13:51:09.194443 | data=image_data)
2017-01-06 13:51:09.194495 | File "/usr/lib/python2.7/site-packages/glanceclient/common/http.py", line 278, in post
2017-01-06 13:51:09.194531 | return self._request('POST', url, **kwargs)
2017-01-06 13:51:09.194602 | File "/usr/lib/python2.7/site-packages/glanceclient/common/http.py", line 248, in _request
2017-01-06 13:51:09.194666 | **kwargs)
2017-01-06 13:51:09.194726 | File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
2017-01-06 13:51:09.194760 | resp = self.send(prep, **send_kwargs)
2017-01-06 13:51:09.194808 | File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 585, in send
2017-01-06 13:51:09.194839 | r = adapter.send(request, **kwargs)
2017-01-06 13:51:09.194887 | File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 434, in send
2017-01-06 13:51:09.194921 | r = low_conn.getresponse(buffering=True)
2017-01-06 13:51:09.194962 | File "/usr/lib64/python2.7/httplib.py", line 1089, in getresponse
2017-01-06 13:51:09.194986 | response.begin()
2017-01-06 13:51:09.195026 | File "/usr/lib64/python2.7/httplib.py", line 444, in begin
2017-01-06 13:51:09.195062 | version, status, reason = self._read_status()
2017-01-06 13:51:09.195104 | File "/usr/lib64/python2.7/httplib.py", line 408, in _read_status
2017-01-06 13:51:09.195132 | raise BadStatusLine(line)
2017-01-06 13:51:09.195155 | BadStatusLine: ''

Tags: ci Edit Tag help
tags: added: ci
Ben Nemec (bnemec) wrote :

This appears to be ceph-related. When I disabled ceph in the nonha job this problem goes away and we have the same failure as the ha job.

https://review.openstack.org/#/c/417462/

Alex Schultz (alex-schultz) wrote :

This seems like a timeout issues since it happens almost exactly at 60s after it starts.

Ben Nemec (bnemec) wrote :

I tried a ceph mitaka deploy locally and it looks like ceph is not even running. The logs aren't particularly helpful, but when I tried to start it manually I got:

[root@overcloud-cephstorage-0 ceph]# /etc/init.d/ceph start
=== osd.0 ===
/etc/init.d/ceph: line 365: /usr/bin/ceph-crush-location: No such file or directory
Invalid command: saw 0 of args(<string(goodchars [A-Za-z0-9-_.=])>) [<string(goodchars [A-Za-z0-9-_.=])>...], expected at least 1
osd crush create-or-move <osdname (id|osd.id)> <float[0.0-]> <args> [<args>...] : create entry or move existing entry for <name> <weight> at/to location <args>
Error EINVAL: invalid command
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 0.05 '

So I guess it's trying to call something that doesn't exist? I'm not sure how that would happen, but it kind of sounds like a packaging error. Either a binary was missed, or a mismatched init script was included.

I'll investigate further in the morning if someone hasn't already resolved this (hey, I can dream, right? ;-).

Sébastien Han (sebastien-han) wrote :

Ben, you should try to use systemd to start the OSDs, assuming it's OSD 0, just the following:

systemctl start ceph-osd@0

You can find your OSD ids by looking up in /var/lib/ceph/osd/

Ben Nemec (bnemec) wrote :

Thanks for the suggestion. That is also failing unfortunately:

[root@overcloud-cephstorage-0 ceph]# systemctl start ceph-osd@0
Failed to start ceph-osd@0.service: Unit not found.
[root@overcloud-cephstorage-0 ceph]# ls /var/lib/ceph/osd/
ceph-0

Keith Schincke (keith-schincke) wrote :

Are the ceph related packages installed? /usr/bin/ceph-crush-location from comment #3is from ceph-common.

Can you give a listing of the ceph and rbd packages?

Ben Nemec (bnemec) wrote :

I think we already talked on irc, but for posterity here's the list of packages:

[root@overcloud-cephstorage-0 ceph]# rpm -qa | egrep 'ceph|rbd'
ceph-0.94.5-1.el7.x86_64
libcephfs1-0.94.5-1.el7.x86_64
python-cephfs-0.94.5-1.el7.x86_64
python-rbd-0.94.5-1.el7.x86_64
ceph-common-0.94.5-1.el7.x86_64
librbd1-0.94.5-1.el7.x86_64

Ben Nemec (bnemec) wrote :

Okay, this is interesting. Comparing ceph packages to the last passing mitaka job I can find, they appear to be identical. I'm specifically looking at the cephstorage logs from http://logs.openstack.org/34/398234/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/ab50bcf/

So...I'm pretty much at a loss here. :-/

Alfredo Moralejo (amoralej) wrote :

This is related to ceph-related libraries (ceph-common, librbd1, etc...) delivered in base CentOS 7.3, which are incompatible with packages in ceph hammer delivered by CentOS Storage SIG. Unfortunately, they have the same NVR and by default they are fetched from base repo instead of ceph one unless you increase priority of ceph repo.

You can check comparing:

$ rpm -qlp http://mirror.centos.org/centos/7/os/x86_64/Packages/ceph-common-0.94.5-1.el7.x86_64.rpm|grep crush

$ rpm -qlp http://mirror.centos.org/centos/7/storage/x86_64/ceph-hammer/ceph-common-0.94.5-1.el7.x86_64.rpm|grep crush
/usr/bin/ceph-crush-location

This issue has been reported to storage sig but it's not fixed yet so you need to make sure you get all ceph packages from ceph repo and not base. As workaround in puppet-ceph we used yum repo priorities (https://review.openstack.org/#/c/410823/).

I think in your case you are not using puppet-ceph, so a different fix is needed, maybe asking to set priority in the rpm release provided by storage sig could be the easiest.

In

Alfredo Moralejo (amoralej) wrote :

I've requested to apply priorities in the repo configuration as provided by the release rpm in https://github.com/CentOS-Storage-SIG/centos-release-ceph-hammer/issues/2

Ben Nemec (bnemec) wrote :

Okay, I can confirm that setting a priority on the Ceph-Hammer repo has gotten ceph working again in my mitaka environment. Thanks for looking into this.

Ben Nemec (bnemec) on 2017-01-17
Changed in tripleo:
assignee: nobody → Ben Nemec (bnemec)
Ben Nemec (bnemec) wrote :

A workaround (https://review.openstack.org/420156) is in the gate, so dropping alert on this one.

tags: removed: alert
Ben Nemec (bnemec) on 2017-03-29
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.