tripleo

Mitaka nonha ping test failing to upload image to glance

Bug #1654611 reported by Ben Nemec on 2017-01-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Ben Nemec

Bug Description

For quite a while now the mitaka job has been failing the ping test. It appears to be a problem uploading the image to glance, but it's not clear whether the problem is the same as https://bugs.launchpad.net/tripleo/+bug/1646750 because the glance logs in the mitaka jobs are almost empty (just a few deprecation warnings). We did happen to run a job with some debugging enabled, and that output can be seen here: http://logs.openstack.org/34/417134/3/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha-mitaka/fff0bb7/console.html

For posterity, the traceback from the client looks like this:

2017-01-06 13:51:09.193619 | Traceback (most recent call last):
2017-01-06 13:51:09.193724 | File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 118, in run
2017-01-06 13:51:09.193784 | ret_val = super(OpenStackShell, self).run(argv)
2017-01-06 13:51:09.193834 | File "/usr/lib/python2.7/site-packages/cliff/app.py", line 226, in run
2017-01-06 13:51:09.193870 | result = self.run_subcommand(remainder)
2017-01-06 13:51:09.193926 | File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 153, in run_subcommand
2017-01-06 13:51:09.193971 | ret_value = super(OpenStackShell, self).run_subcommand(argv)
2017-01-06 13:51:09.194022 | File "/usr/lib/python2.7/site-packages/cliff/app.py", line 346, in run_subcommand
2017-01-06 13:51:09.194053 | result = cmd.run(parsed_args)
2017-01-06 13:51:09.194117 | File "/usr/lib/python2.7/site-packages/openstackclient/common/command.py", line 38, in run
2017-01-06 13:51:09.194164 | return super(Command, self).run(parsed_args)
2017-01-06 13:51:09.194224 | File "/usr/lib/python2.7/site-packages/cliff/display.py", line 79, in run
2017-01-06 13:51:09.194267 | column_names, data = self.take_action(parsed_args)
2017-01-06 13:51:09.194326 | File "/usr/lib/python2.7/site-packages/openstackclient/image/v1/image.py", line 264, in take_action
2017-01-06 13:51:09.194365 | image = image_client.images.create(**kwargs)
2017-01-06 13:51:09.194417 | File "/usr/lib/python2.7/site-packages/glanceclient/v1/images.py", line 324, in create
2017-01-06 13:51:09.194443 | data=image_data)
2017-01-06 13:51:09.194495 | File "/usr/lib/python2.7/site-packages/glanceclient/common/http.py", line 278, in post
2017-01-06 13:51:09.194531 | return self._request('POST', url, **kwargs)
2017-01-06 13:51:09.194602 | File "/usr/lib/python2.7/site-packages/glanceclient/common/http.py", line 248, in _request
2017-01-06 13:51:09.194666 | **kwargs)
2017-01-06 13:51:09.194726 | File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
2017-01-06 13:51:09.194760 | resp = self.send(prep, **send_kwargs)
2017-01-06 13:51:09.194808 | File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 585, in send
2017-01-06 13:51:09.194839 | r = adapter.send(request, **kwargs)
2017-01-06 13:51:09.194887 | File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 434, in send
2017-01-06 13:51:09.194921 | r = low_conn.getresponse(buffering=True)
2017-01-06 13:51:09.194962 | File "/usr/lib64/python2.7/httplib.py", line 1089, in getresponse
2017-01-06 13:51:09.194986 | response.begin()
2017-01-06 13:51:09.195026 | File "/usr/lib64/python2.7/httplib.py", line 444, in begin
2017-01-06 13:51:09.195062 | version, status, reason = self._read_status()
2017-01-06 13:51:09.195104 | File "/usr/lib64/python2.7/httplib.py", line 408, in _read_status
2017-01-06 13:51:09.195132 | raise BadStatusLine(line)
2017-01-06 13:51:09.195155 | BadStatusLine: ''

Tags:

Emilien Macchi (emilienm) on 2017-01-06

tags:

added: ci

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-01-06:

This appears to be ceph-related. When I disabled ceph in the nonha job this problem goes away and we have the same failure as the ha job.

https://review.openstack.org/#/c/417462/

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-01-07:

This seems like a timeout issues since it happens almost exactly at 60s after it starts.

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-01-10:

I tried a ceph mitaka deploy locally and it looks like ceph is not even running. The logs aren't particularly helpful, but when I tried to start it manually I got:

[root@overcloud-cephstorage-0 ceph]# /etc/init.d/ceph start
=== osd.0 ===
/etc/init.d/ceph: line 365: /usr/bin/ceph-crush-location: No such file or directory
Invalid command: saw 0 of args(<string(goodchars [A-Za-z0-9-_.=])>) [<string(goodchars [A-Za-z0-9-_.=])>...], expected at least 1
osd crush create-or-move <osdname (id|osd.id)> <float[0.0-]> <args> [<args>...] : create entry or move existing entry for <name> <weight> at/to location <args>
Error EINVAL: invalid command
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 0.05 '

So I guess it's trying to call something that doesn't exist? I'm not sure how that would happen, but it kind of sounds like a packaging error. Either a binary was missed, or a mismatched init script was included.

I'll investigate further in the morning if someone hasn't already resolved this (hey, I can dream, right? ;-).

Revision history for this message

Sébastien Han (sebastien-han) wrote on 2017-01-11:

Ben, you should try to use systemd to start the OSDs, assuming it's OSD 0, just the following:

systemctl start ceph-osd@0

You can find your OSD ids by looking up in /var/lib/ceph/osd/

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-01-11:

Thanks for the suggestion. That is also failing unfortunately:

[root@overcloud-cephstorage-0 ceph]# systemctl start ceph-osd@0
Failed to start ceph-osd@0.service: Unit not found.
[root@overcloud-cephstorage-0 ceph]# ls /var/lib/ceph/osd/
ceph-0

Revision history for this message

Keith Schincke (keith-schincke) wrote on 2017-01-11:

Are the ceph related packages installed? /usr/bin/ceph-crush-location from comment #3is from ceph-common.

Can you give a listing of the ceph and rbd packages?

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-01-11:

I think we already talked on irc, but for posterity here's the list of packages:

[root@overcloud-cephstorage-0 ceph]# rpm -qa | egrep 'ceph|rbd'
ceph-0.94.5-1.el7.x86_64
libcephfs1-0.94.5-1.el7.x86_64
python-cephfs-0.94.5-1.el7.x86_64
python-rbd-0.94.5-1.el7.x86_64
ceph-common-0.94.5-1.el7.x86_64
librbd1-0.94.5-1.el7.x86_64

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-01-11:

Okay, this is interesting. Comparing ceph packages to the last passing mitaka job I can find, they appear to be identical. I'm specifically looking at the cephstorage logs from http://logs.openstack.org/34/398234/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/ab50bcf/

So...I'm pretty much at a loss here. :-/

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2017-01-13:

This is related to ceph-related libraries (ceph-common, librbd1, etc...) delivered in base CentOS 7.3, which are incompatible with packages in ceph hammer delivered by CentOS Storage SIG. Unfortunately, they have the same NVR and by default they are fetched from base repo instead of ceph one unless you increase priority of ceph repo.

You can check comparing:

$ rpm -qlp http://mirror.centos.org/centos/7/os/x86_64/Packages/ceph-common-0.94.5-1.el7.x86_64.rpm|grep crush

$ rpm -qlp http://mirror.centos.org/centos/7/storage/x86_64/ceph-hammer/ceph-common-0.94.5-1.el7.x86_64.rpm|grep crush
/usr/bin/ceph-crush-location

This issue has been reported to storage sig but it's not fixed yet so you need to make sure you get all ceph packages from ceph repo and not base. As workaround in puppet-ceph we used yum repo priorities (https://review.openstack.org/#/c/410823/).

I think in your case you are not using puppet-ceph, so a different fix is needed, maybe asking to set priority in the rpm release provided by storage sig could be the easiest.

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2017-01-13:

#10

I've requested to apply priorities in the repo configuration as provided by the release rpm in https://github.com/CentOS-Storage-SIG/centos-release-ceph-hammer/issues/2

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-01-13:

#11

Okay, I can confirm that setting a priority on the Ceph-Hammer repo has gotten ceph working again in my mitaka environment. Thanks for looking into this.

Ben Nemec (bnemec) on 2017-01-17

Changed in tripleo:
assignee:	nobody → Ben Nemec (bnemec)

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-01-17:

#12

A workaround (https://review.openstack.org/420156) is in the gate, so dropping alert on this one.

tags:

removed: alert

Ben Nemec (bnemec) on 2017-03-29

Changed in tripleo:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-centos-storage-sig-centos-release-ceph-hammer #2
[open] Edit

Bug watches keep track of this bug in other bug trackers.