[library] ceph-deploy osd prepare node-5:/dev/vdb2 returned 1 instead of one of [0]

Bug #1323343 reported by Egor Kotko
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Ryan Moe
5.0.x
Fix Committed
High
Ryan Moe

Bug Description

{"build_id": "2014-05-25_23-01-31", "mirantis": "yes", "build_number": "22", "ostf_sha": "1f020d69acbf50be00c12c29564f65440971bafe", "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4", "production": "docker", "api": "1.0", "fuelmain_sha": "db2d153e62cb2b3034d33359d7e3db9d4742c811", "astute_sha": "a7eac46348dc77fc2723c6fcc3dbc66cc1a83152", "release": "5.0", "fuellib_sha": "b9985e42159187853edec82c406fdbc38dc5a6d0"}

Deployment finished with errors in log:
http://paste.openstack.org/show/81532/

Steps to reproduce:
1) Create env:
Ubuntu HA Neutron-GRE
Ceph for both CInder and Glance

2 x Controller + Ceph OSD
1 x Compute

Expected result:
Deployed cluster

Actual result:
Deployed with errors

Tags: ceph
Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Aleksandr Didenko (adidenko) wrote :
Download full text (5.0 KiB)

Reproduced several times on bare-metal (DELL)

{
    "api": "1.0",
    "astute_sha": "a7eac46348dc77fc2723c6fcc3dbc66cc1a83152",
    "build_id": "2014-05-26_18-06-28",
    "build_number": "24",
    "fuellib_sha": "2f79c0415159651fc1978d99bd791079d1ae4a06",
    "fuelmain_sha": "d7f86968880a484d51f99a9fc439ef21139ea0b0",
    "mirantis": "yes",
    "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4",
    "ostf_sha": "89bbddb78132e2997d82adc5ae5db9dcb7a35bcd",
    "production": "docker",
    "release": "5.0"
}

Environment:
multinode, 1 controller+ceph-osd, 1 compute+ceph-osd, 1 mongodb.
        "volumes_lvm": False,
        "volumes_ceph": True,
        "images_ceph": True,
        "murano": True,
        "sahara": True,
        "ceilometer": True,
        "net_provider": 'neutron',
        "net_segment_type": 'gre',
        "libvirt_type": "kvm"

Sporadic errors during deployments:
2014-05-27T10:14:01.414383+00:00 err: ceph-deploy osd prepare node-2:/dev/sda5 returned 1 instead of one of [0]

More detailed logs:
2014-05-27T10:14:01.398761+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sda5
2014-05-27T10:14:01.399020+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] Traceback (most recent call last):
2014-05-27T10:14:01.399728+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] File "/usr/lib/python2.7/dist-packages/ceph_deploy/osd.py", line 126, in prepare_disk
2014-05-27T10:14:01.400017+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] File "/usr/lib/python2.7/dist-packages/ceph_deploy/util/decorators.py", line 10, in inne
r
2014-05-27T10:14:01.400017+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] def inner(*args, **kwargs):
2014-05-27T10:14:01.400265+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] File "/usr/lib/python2.7/dist-packages/ceph_deploy/util/wrappers.py", line 6, in remote_
call
2014-05-27T10:14:01.401206+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] This allows us to only remote-execute the actual calls, not whole functions.
2014-05-27T10:14:01.401420+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
2014-05-27T10:14:01.401420+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] raise CalledProcessError(retcode, cmd)
2014-05-27T10:14:01.401519+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] CalledProcessError: Command '['ceph-disk-prepare', '--fs-type', 'xfs', '--cluster', 'ceph'
, '--', '/dev/sda5']' returned non-zero exit status 1
2014-05-27T10:14:01.402453+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) [node-2][ERROR ] Traceback (most recent call last):
2014-05-27T10:14:01.402676+00:00...

Read more...

Changed in fuel:
importance: Medium → Critical
milestone: 5.1 → 5.0
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Bare-metal bug is pretty specific and could be reproduced only with numerous re-deployments on the same hardware. So lowering it back to Medium. Will create a bare-metal related bug separately.

Revision history for this message
Anastasia Palkina (apalkina) wrote :

This bug reproduced on master ISO

"build_id": "2014-05-30_00-35-28",
"mirantis": "yes",
"build_number": "230",
"ostf_sha": "3e709cba57df7e958d01484aeb80ba7a3c875133",
"nailgun_sha": "3e2c266bf285b86f6d222c60aca999ea6a745b50",
"production": "docker",
"api": "1.0",
"fuelmain_sha": "e6a54762e6fa49f9d24a2f658f9d30d6b84b43a1",
"astute_sha": "b1f8d0eafed110fd748e473ea74674e7e1c495eb",
"release": "5.1",
"fuellib_sha": "bafa771b4ccd9df266cc2238810b67c6cc7aa995"

1. Create new environment (Ubuntu, simple mode)
2. Choose GRE segmentation
3. Choose both Ceph
4. Choose Sahara and Ceilometer installation
5. Add controller+ceph, compute+ceph, mongo
6. Start deployment. It has failed

Error in puppet.log on controller node:

(/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) change from notrun to 0 failed: ceph-deploy osd prepare node-1:/dev/sdb2 node-1:/dev/sdc2 returned 1 instead of one of [0]

Error in puppet.log on compute node:

(/Stage[main]/Ceph::Conf/Exec[ceph-deploy gatherkeys remote]/returns) change from notrun to 0 failed: ceph-deploy gatherkeys node-1 returned 1 instead of one of [0]

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Revision history for this message
Vladimir Grujic (hyperbaba) wrote :

The problem is in redeployment scenarios.
ceph-deploy creates an gpt partition on osd disk and tries to format it using mkfs_xfs.
Mkfs_xfs itself has a check for already existing xfs filesystem on partition and refuzes to format the partition.
I think not only lvm leftovers brake ceph recepie but this problem also.

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Reproduced on ISO #112
"build_id": "2014-07-10_00-39-56",
"mirantis": "yes",
"build_number": "112",
"ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
"nailgun_sha": "f5ff82558f99bb6ca7d5e1617eddddf7142fe857",
"production": "docker",
"api": "1.0",
"fuelmain_sha": "293015843304222ead899270449495af91b06aed",
"astute_sha": "5df009e8eab611750309a4c5b5c9b0f7b9d85806",
"release": "5.0.1",
"fuellib_sha": "364dee37435cbdc85d6b814a61f57800b83bf22d"

1. Create new environment (CentOS, simple mode)
2. Choose VLAN segmentation
3. Choose Ceph for images
4. Add controller, compute, cinder, 2 ceph
5. Untag storage network and move it to other interface
6. Start deployment. It was successful
7. But there is error in puppet.log on ceph node (node-35):

2014-07-11 09:49:47 ERR

 (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) change from notrun to 0 failed: ceph-deploy osd prepare node-35:/dev/sdb4 node-35:/dev/sdc4 returned 1 instead of one of [0]

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

/root/ceph.log on node-35:

2014-07-11 09:49:28,503 [node-35][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdb4
2014-07-11 09:49:31,873 [node-35][ERROR ] Traceback (most recent call last):
2014-07-11 09:49:31,874 [node-35][ERROR ] File "/usr/lib/python2.6/site-packages/ceph_deploy/osd.py", line 126, in prepare_disk
2014-07-11 09:49:31,932 [node-35][ERROR ] File "/usr/lib/python2.6/site-packages/ceph_deploy/util/decorators.py", line 10, in inner
2014-07-11 09:49:31,948 [node-35][ERROR ] def inner(*args, **kwargs):
2014-07-11 09:49:31,960 [node-35][ERROR ] File "/usr/lib/python2.6/site-packages/ceph_deploy/util/wrappers.py", line 6, in remote_call
2014-07-11 09:49:31,982 [node-35][ERROR ] This allows us to only remote-execute the actual calls, not whole functions.
2014-07-11 09:49:31,990 [node-35][ERROR ] File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call
2014-07-11 09:49:32,003 [node-35][ERROR ] raise CalledProcessError(retcode, cmd)
2014-07-11 09:49:32,013 [node-35][ERROR ] CalledProcessError: Command '['ceph-disk-prepare', '--fs-type', 'xfs', '--cluster', 'ceph', '--', '/dev/sdb4']' returned non-zero exit status 1
2014-07-11 09:49:32,048 [node-35][ERROR ] mkfs.xfs: /dev/sdb4 contains a mounted filesystem
2014-07-11 09:49:32,048 [node-35][ERROR ] Usage: mkfs.xfs
...
2014-07-11 09:49:32,061 [node-35][ERROR ] ceph-disk: Error: Command '['/sbin/mkfs', '-t', 'xfs', '-f', '-i', 'size=2048', '--', '/dev/sdb4']' returned non-zero exit status 1
2014-07-11 09:49:32,080 [ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdb4

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
Changed in fuel:
milestone: 5.0.1 → 5.1
Dmitry Ilyin (idv1985)
summary: - ceph-deploy osd prepare node-5:/dev/vdb2 returned 1 instead of one of
- [0]
+ [library] ceph-deploy osd prepare node-5:/dev/vdb2 returned 1 instead of
+ one of [0]
Revision history for this message
Ryan Moe (rmoe) wrote :

This seems to be an intermittent issue (I had to delete and redeploy the same node 6 times for this to show up). Sometimes when a node is deleted from the environment the individual partitions are not wiped.

This node had OSDs deployed on vdb and vdc and was then redployed with nothing allocated to those disks. After provisioning you can see that the previous partition data still exists.

[root@node-16 ~]# parted /dev/vdc print
Model: Virtio Block Device (virtblk)
Disk /dev/vdc: 32.2GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
 1 17.4kB 25.2MB 25.1MB primary bios_grub
 2 25.2MB 235MB 210MB primary boot
 3 235MB 445MB 210MB ext2 primary boot

[root@node-16 ~]# parted -a none -s /dev/vdc unit MiB mkpart primary 424 30580

[root@node-16 ~]# parted /dev/vdc print
Model: Virtio Block Device (virtblk)
Disk /dev/vdc: 32.2GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
 1 17.4kB 25.2MB 25.1MB primary bios_grub
 2 25.2MB 235MB 210MB primary boot
 3 235MB 445MB 210MB ext2 primary boot
 4 445MB 32.1GB 31.6GB xfs primary

(the new partition has an XFS file system formatted on it)

[root@node-16 ~]# mount /dev/vdc4 /mnt
[root@node-16 ~]# ls /mnt/
activate.monmap active ceph_fsid current fsid journal keyring magic ready store_version superblock sysvinit whoami

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/107793

Changed in fuel:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/107793
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=9f1e69aa3a2fe7a6093fa50596d6931826d93a09
Submitter: Jenkins
Branch: master

commit 9f1e69aa3a2fe7a6093fa50596d6931826d93a09
Author: Ryan Moe <email address hidden>
Date: Thu Jul 17 11:30:29 2014 -0700

    Wipe each partition for all disks when removing nodes

    If partitions are not wiped then old formatted filesystems
    will show up if a parititon is created at the same point as
    the previous one. This will cause ceph-deploy to fail when
    bringing up an OSD. Ceph will see an XFS filesystem and the
    correct partition GUID and auto-mount the partition prior to
    activating the OSD.

    Change-Id: I2b71ec429ac1982f3df585423ca0818b294d8210
    Closes-bug: #1323343

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/107826

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/5.0)

Reviewed: https://review.openstack.org/107826
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=6db5f5031b74e67b92fcac1f7998eaa296d68025
Submitter: Jenkins
Branch: stable/5.0

commit 6db5f5031b74e67b92fcac1f7998eaa296d68025
Author: Ryan Moe <email address hidden>
Date: Thu Jul 17 11:30:29 2014 -0700

    Wipe each partition for all disks when removing nodes

    If partitions are not wiped then old formatted filesystems
    will show up if a parititon is created at the same point as
    the previous one. This will cause ceph-deploy to fail when
    bringing up an OSD. Ceph will see an XFS filesystem and the
    correct partition GUID and auto-mount the partition prior to
    activating the OSD.

    Change-Id: I2b71ec429ac1982f3df585423ca0818b294d8210
    Closes-bug: #1323343
    (cherry picked from commit 9f1e69aa3a2fe7a6093fa50596d6931826d93a09)

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #347
"build_id": "2014-07-23_02-01-14",
"ostf_sha": "c1b60d4bcee7cd26823079a86e99f3f65414498e",
"build_number": "347",
"auth_required": false,
"api": "1.0",
"nailgun_sha": "f5775d6b7f5a3853b28096e8c502ace566e7041f",
"production": "docker",
"fuelmain_sha": "74b9200955201fe763526ceb51607592274929cd",
"astute_sha": "fd9b8e3b6f59b2727b1b037054f10e0dd7bd37f1",
"feature_groups": ["mirantis"],
"release": "5.1",
"fuellib_sha": "fb0e84c954a33c912584bf35054b60914d2a2360"

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.