Puppet should exit with error if disk activate fails

Bug #1604728 reported by Marius Cornea
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
puppet-ceph
Fix Released
Undecided
John Fulton

Bug Description

Ceph osds get created only for the first deployment. On any subsequent deployments the OSDs fail to get created:

Steps to reproduce:

First deployment:

source ~/stackrc
export THT='/home/stack/tripleo-heat-templates'
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/hyperconverged-ceph.yaml \
-e $THT/environments/puppet-ceph-devel.yaml \
-e $THT/environments/puppet-pacemaker.yaml \
-e ~/templates/disk-layout.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 3 \
--compute-flavor compute \
--ntp-server clock.redhat.com \
--libvirt-type qemu

Make sure that the deployment was successful and OSDs got created:

[stack@undercloud ~]$ cat templates/disk-layout.yaml
parameter_defaults:
  ExtraConfig:
    ceph::profile::params::osds:
        '/dev/vdb': {}
        '/dev/vdc': {}
[root@overcloud-novacompute-0 heat-admin]# ceph osd tree
# id weight type name up/down reweight
-1 0.1199 root default
-2 0.03998 host overcloud-novacompute-2
0 0.01999 osd.0 up 1
3 0.01999 osd.3 up 1
-3 0.03998 host overcloud-novacompute-1
1 0.01999 osd.1 up 1
4 0.01999 osd.4 up 1
-4 0.03998 host overcloud-novacompute-0
2 0.01999 osd.2 up 1
5 0.01999 osd.5 up 1

Delete existing deployment

Redeploy using the initial deploy command.

Check OSDs:

[root@overcloud-novacompute-0 heat-admin]# ceph osd tree
# id weight type name up/down reweight
-1 0 root default

We can see the following errors in the os-collect-config journal:

Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: 2016-07-20 08:49:41.150384 7f2f29a8c700 0 librados: osd.5 authentication error (1) Operation not permitted
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: Error connecting to cluster: PermissionError
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.5']' returned non-zero exit status 1
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: 2016-07-20 08:49:42.000856 7f31f341f700 0 librados: osd.2 authentication error (1) Operation not permitted
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: Error connecting to cluster: PermissionError
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.2']' returned non-zero exit status 1

As a workaround we can erase the GPT tables of the OSD disks before redeploying:

sgdisk --zap /dev/vdb
sgdisk --zap /dev/vdc

Revision history for this message
Marius Cornea (mcornea) wrote :
Download full text (4.2 KiB)

Speaking with gfidente about it, this is expected behavior as the disks are not erased when deleting the deployment so the 2nd deployment will fail when running disk activate.

Nevertheless the deployment completed fine even though disk activate failed so I'm turning the bug against this issue as the deployment should fail if disk activate fails.

Below is he full log:

Notice: /Stage[main]/Snmp/Service[snmpd]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc1
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc1
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + ceph-disk activate /dev/vdc1
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: === osd.5 ===
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: 2016-07-20 08:49:41.150384 7f2f29a8c700 0 librados: osd.5 authentication error (1) Operation not permitted
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: Error connecting to cluster: PermissionError
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5 --keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5 0.02 host=overcloud-novacompute-0 root=default'
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.5']' returned non-zero exit status 1
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + true
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: executed successfully
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + ceph-disk activate /dev/vdb1
Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/...

Read more...

affects: tripleo → puppet-ceph
summary: - Ceph osds get created only for the first deployment
+ Puppet should exit with error if disk activate fails
Changed in puppet-ceph:
status: New → Confirmed
Revision history for this message
Giulio Fidente (gfidente) wrote :

It looks like the problem is as follows:

after an initial deployment using dedicated disks for Ceph, if we repeat a deployment trying to re-use those same disks without cleaning them up, the 'ceph-disk prepare' command from puppet-ceph at [1] will exit 0 and continue skipping 'ceph-disk activate' (supposed to be triggered via udev when using block devices) and finally attempt a systemctl start ceph-osd which will also exit 0 (making puppet thing everything went fine) except the ceph-osd daemon will later die

1. https://github.com/openstack/puppet-ceph/blob/master/manifests/osd.pp#L102

Changed in puppet-ceph:
assignee: nobody → John Fulton (jfulton-org)
Changed in puppet-ceph:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-ceph (master)

Reviewed: https://review.openstack.org/371756
Committed: https://git.openstack.org/cgit/openstack/puppet-ceph/commit/?id=a46d5c5a2099eeeddad0c0ceaa931579f0872a3f
Submitter: Jenkins
Branch: master

commit a46d5c5a2099eeeddad0c0ceaa931579f0872a3f
Author: John Fulton <email address hidden>
Date: Fri Sep 16 14:29:25 2016 -0400

    Deployment should fail when trying to add another Ceph cluster's OSD

    This change explicitly adds the FSID to the $cluster_option variable
    and causes Puppet to exit if OSD preparation/activation will fail
    because the OSD belongs to a different Ceph cluster as determined by
    an FSID mismatch. FSID mismatch is a symptom of attempting to install
    over another deploy. The FSID mismatch failure will be logged so that
    the user may determine the reason for failure and then choose to zap
    away the old deploy away before re-attempting deployment.

    Closes-Bug: 1604728
    Change-Id: I61d18400754842860372c4cc5f3b80d104d59706

Changed in puppet-ceph:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-ceph 2.2.0

This issue was fixed in the openstack/puppet-ceph 2.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.