Deployment of ceph node fails: ceph-disk activate-all returned 1 instead of one of [0]

Bug #1541946 reported by Artem Panchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Andrew Woodward
8.0.x
Invalid
High
MOS Ceph

Bug Description

Environment deployment fails, because puppet task 'ceph-osd.pp' returns error on one of compute+ceph nodes:

2016-02-04 16:02:02 ERROR [793] Task '{"priority"=>1400, "type"=>"puppet", "id"=>"top-role-ceph-osd", "parameters"=>{"retries"=>nil, "puppet_modules"=>"/etc/puppet/modules", "puppet_manifest"=>"/etc/puppet/modules/osnailyfacter/modular/ceph/ceph-osd.pp", "timeout"=>3600, "cwd"=>"/"}, "uids"=>["5"]}' failed on node 5
2016-02-04 16:02:02 ERROR [793] No more tasks will be executed on the node 5
2016-02-04 16:02:07 ERROR [793] Error running RPC method granular_deploy: Deployment failed on nodes 5

2016-02-04 16:01:59 +0000 Puppet (err): ceph-disk activate-all returned 1 instead of one of [0]
2016-02-04 16:01:59 +0000 /Stage[main]/Ceph::Osds/Exec[ceph-disk activate-all]/returns (err): change from notrun to 0 failed: ceph-disk activate-all returned 1 instead of one of [0]

root@node-5:~# ceph-disk activate-all
ceph-disk: Cannot discover filesystem type: device /dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.1ad7ffa9-7acc-48ce-bf87-ce8be4245017: Command '/sbin/blkid' returned non-zero exit status 2
ceph-disk: Cannot discover filesystem type: device /dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.3fcc99f7-8869-4f9b-9a42-87c3f58cb2f5: Command '/sbin/blkid' returned non-zero exit status 2
ceph-disk: Error: One or more partitions failed to activate
root@node-5:~# echo $?
1

root@node-5:~# blkid
/dev/vda3: UUID="0dd5e251-3b46-423d-a036-af69bfd271ae" TYPE="ext2"
/dev/vda4: UUID="2C9ojS-4YSv-0wEz-idlB-km0B-EDHS-xvK9F9" TYPE="LVM2_member"
/dev/vda5: UUID="fvGhqG-s193-cMJW-3dkD-bBUd-uNfF-oRkqhO" TYPE="LVM2_member"
/dev/vda6: LABEL="cidata" TYPE="iso9660"
/dev/mapper/vm-nova: UUID="c120e80b-00fc-43cf-9716-d2312f4794fe" TYPE="xfs"
/dev/mapper/os-root: UUID="b754ec2d-3ce9-45ca-82ec-be5e852396b3" TYPE="ext4"
/dev/mapper/os-swap: UUID="29929bba-c004-4ec9-9935-05b88edf5eef" TYPE="swap"

Steps to reproduce:

 1. Run 'ceph_rados_gw' system test (bvt_2 group)

Expected result: test passed
Actual: test fails on deployment step

Tags: area-mos
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This is a really strange error. It looks like disks just left the node during the test. Look at the blkid output, it looks like there is no disks that could have been used as ceph partitions. For me it seems like buggy hardware or some other libvirt issues on the node that was used. I would also consider the issue to be due to some purge happenning on the host node, e.g. that the disks were purged or ejected from the VM.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Download full text (29.3 KiB)

Here is the HW discovery call form node-5 which shows that there were 2 disks: vdb and vdc, in the beginning of it all, but we cannot find them in blkid output which is rather strange.
{
 "mac": "64:48:F8:DA:75:9C",
 "ip": "10.109.45.6",
 "os_platform": "ubuntu",
 "manufacturer": "QEMU",
 "platform_name": "Standard PC (i440FX + PIIX, 1996)",
 "meta": {
  "system": {
   "manufacturer": "QEMU",
   "uuid": "78E8657E-75C6-470A-96B8-106B98E3700F",
   "product": "Standard PC (i440FX + PIIX, 1996)",
   "version": "pc-i440fx-2.4",
   "fqdn": "node-5.test.domain.local"
  },
  "interfaces": [{
    "name": "enp0s3",
    "state": "up",
    "mac": "64:48:f8:da:75:9c",
    "pxe": false,
    "driver": "e1000",
    "bus_info": "0000:00:03.0",
    "max_speed": 1000,
    "current_speed": 1000,
    "ip": "10.109.45.6",
    "netmask": "255.255.255.0",
    "offloading_modes": [{
     "name": "rx-all",
     "state": null,
     "sub": []
    }, {
     "name": "rx-fcs",
     "state": null,
     "sub": []
    }, {
     "name": "tx-nocache-copy",
     "state": null,
     "sub": []
    }, {
     "name": "rx-vlan-offload",
     "state": null,
     "sub": []
    }, {
     "name": "generic-receive-offload",
     "state": null,
     "sub": []
    }, {
     "name": "generic-segmentation-offload",
     "state": null,
     "sub": []
    }, {
     "name": "tcp-segmentation-offload",
     "state": null,
     "sub": [{
      "name": "tx-tcp-segmentation",
      "state": null,
      "sub": []
     }]
    }, {
     "name": "scatter-gather",
     "state": null,
     "sub": [{
      "name": "tx-scatter-gather",
      "state": null,
      "sub": []
     }]
    }, {
     "name": "tx-checksumming",
     "state": null,
     "sub": [{
      "name": "tx-checksum-ip-generic",
      "state": null,
      "sub": []
     }]
    }, {
     "name": "rx-checksumming",
     "state": null,
     "sub": []
    }]
   }, {
    "name": "enp0s4",
    "state": "down",
    "mac": "64:98:86:b8:3f:ed",
    "pxe": false,
    "driver": "e1000",
    "bus_info": "0000:00:04.0",
    "max_speed": 1000,
    "current_speed": 1000,
    "offloading_modes": [{
     "name": "rx-all",
     "state": null,
     "sub": []
    }, {
     "name": "rx-fcs",
     "state": null,
     "sub": []
    }, {
     "name": "tx-nocache-copy",
     "state": null,
     "sub": []
    }, {
     "name": "rx-vlan-offload",
     "state": null,
     "sub": []
    }, {
     "name": "generic-receive-offload",
     "state": null,
     "sub": []
    }, {
     "name": "generic-segmentation-offload",
     "state": null,
     "sub": []
    }, {
     "name": "tcp-segmentation-offload",
     "state": null,
     "sub": [{
      "name": "tx-tcp-segmentation",
      "state": null,
      "sub": []
     }]
    }, {
     "name": "scatter-gather",
     "state": null,
     "sub": [{
      "name": "tx-scatter-gather",
      "state": null,
      "sub": []
     }]
    }, {
     "name": "tx-checksumming",
     "state": null,
     "sub": [{
      "name": "tx-checksum-ip-generic",
      "state": null,
      "sub": []
     }]
    }, {
     "name": "rx-checksumming",
     "state": null,
     "sub": []
    }]
   }, {
    "name": "enp0s5",
    "state": "down",
    "mac": "6...

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I checked for the disks on the env - it seems that partitions are in place, but it seems the deployment was lacking prepare step for partitions /dev/vdb3 and /dev/vdc3 - file systems were not created there. So we need to ensure that we did not miss imporant relationships in puppet manifests

Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Andrew Woodward (xarses)
status: Confirmed → In Progress
Revision history for this message
Andrew Woodward (xarses) wrote :
Revision history for this message
Andrew Woodward (xarses) wrote :

`ceph-disk activate-all` is used in conjuction with `udevadm trigger`
to prevent allready prepapred/provisioned disks from being re-formatted.

The error expressed in bug: 1541946 is that activate-all has errors
because there is no filesystem on the devices. This is accurate on nodes
where the OSD was not previously sucessful. To combat this we should
allow for the command to return errors

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I believe the issue affects 8.0 as well

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/276409
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=d01f994d0e2ac8a0fdafe8834f835b8a49afdc2d
Submitter: Jenkins
Branch: master

commit d01f994d0e2ac8a0fdafe8834f835b8a49afdc2d
Author: Andrew Woodward <email address hidden>
Date: Thu Feb 4 10:47:11 2016 -0800

    Allow ceph-disk activate-all to return errors

    `ceph-disk activate-all` is used in conjuction with `udevadm trigger`
    to prevent allready prepapred/provisioned disks from being re-formatted.

    The error expressed in bug: 1541946 is that activate-all has errors
    because there is no filesystem on the devices. This is accurate on nodes
    where the OSD was not previously sucessful. To combat this we should
    allow for the command to return errors

    Change-Id: I5cf29fcba7b0443832489c90518a9197c065ac8c
    Closes-bug: 1541946

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Eugene Bogdanov (ebogdanov) wrote :

Downgraded to high - no data corruption and there is a workaround. So, doesn't match Critical definition.

Changed in fuel:
importance: Critical → High
Mike Scherbakov (mihgen)
tags: added: area-mos
removed: area-library
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

I don't think the patch [1] is correct. ceph-disk exits via raising the SystemExit exception [2].
The exit code 1 means that the error code hasn't been explicitly specified:
"If the associated value is a plain integer, it specifies the system exit status (passed to C’s exit() function);
if it is None, the exit status is zero; if it has another type (such as a string), the object’s value is printed
and the exit status is one." [3] Thus the only possible error code is 1, and ignoring it is not safe at all.

[1] https://review.openstack.org/276409
[2] https://github.com/ceph/ceph/blob/hammer/src/ceph-disk#L2974-L2990
[3] https://docs.python.org/2/library/exceptions.html#exceptions.SystemExit

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Per discussion with QA team:

This has never been seen on 8.0 and given the fact it's a part of BVT (run 4 times a day), we would hit it for sure if 8.0 was affected. We believe this regression was introduced in 9.0. Thus marking as Invalid.

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Issue reproduced again on next cluster configuration:
neutron vlan, ceph for volumes, images, Rados, Ironic selected
1 controller+ceph, 2 controllers+ceph+ironic, 1 compute, 1 ironic

Deployment failed on node-1

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Andrey,

> Issue reproduced again

Nope, it's a completely different problem:

2016-02-29 00:08:55 +0000 /Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/vdc3]/Exec[ceph-deploy osd prepare node-1:/dev/vdc3]/returns (notice): [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
2016-02-29 00:08:55 +0000 /Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/vdc3]/Exec[ceph-deploy osd prepare node-1:/dev/vdc3]/returns (notice): [ceph_deploy.cli][INFO ] Invoked (1.5.20): /usr/bin/ceph-deploy osd prepare node-1:/dev/vdc3
2016-02-29 00:08:55 +0000 /Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/vdc3]/Exec[ceph-deploy osd prepare node-1:/dev/vdc3]/returns (notice): [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks node-1:/dev/vdc3:
2016-02-29 00:08:55 +0000 /Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/vdc3]/Exec[ceph-deploy osd prepare node-1:/dev/vdc3]/returns (notice): [ceph_deploy][ERROR ] RuntimeError: bootstrap-osd keyring not found; run 'gatherkeys'
2016-02-29 00:08:55 +0000 Puppet (err): ceph-deploy osd prepare node-1:/dev/vdc3 returned 1 instead of one of [0]

ceph-deploy osd prepare node-1:/dev/vdc3 fails because bootstrap-osd keyring hasn't been created properly (there's no file node-1/root/ceph.bootstrap-osd.keyring). Also there's no partition named /dev/vdc3

[10.109.6.4] out: /dev/vda3 ext2 /boot ddf014a5-f8a4-4e6e-9b45-efbb457543cc
[10.109.6.4] out: /dev/vda4 LVM2_member (in use) TVkdrb-IPO8-nZuR-0esd-IA0f-DaoH-fFr7Y8
[10.109.6.4] out: /dev/vda5 LVM2_member (in use) oraQcF-IteM-xfFc-PylP-2X1Z-1HhU-HK5H8H
[10.109.6.4] out: /dev/vda6 LVM2_member (in use) djVcBD-VDUa-wlVg-1pUh-RRcQ-HYZO-XVmlx5
[10.109.6.4] out: /dev/vda7 LVM2_member (in use) Ng1If3-PgZG-5KNT-ZN0Z-mvCu-8tX3-ZTxoXu
[10.109.6.4] out: /dev/vda8 iso9660 cidata (not mounted)
[10.109.6.4] out: /dev/vdb3 LVM2_member (in use) t67Vcj-YptD-y0WP-fiQF-vSTs-l3GC-Q0E9hf

Please file a new bug (with a proper description) and attach the relevant data.

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

Verify on fuel-9.0-mos-393-2016-05-24

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.