Volumes are in error state if ceph for volumes and for images - error connecting to ceph cluster

Bug #1352335 reported by Andrey Sledzinskiy
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Dmitry Borodaenko

Bug Description

{

    "build_id": "2014-08-03_02-01-14",
    "ostf_sha": "a3fa823ea0e4e03beb637ae07a91adea82c33182",
    "build_number": "389",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "bd0127be0061029f9f910547db5e633c82244942",
    "production": "docker",
    "fuelmain_sha": "e99879292cf6e96b8991300d947df76b69134bb1",
    "astute_sha": "ce86172e77661026c91fdf1ff8066d7df1f7d89d",
    "feature_groups": [
        "mirantis"
    ],
    "release": "5.1",
    "fuellib_sha": "4e3fdd75f8dabde8e5d07067545d8043a70a176b"

}

Create next cluster:
-Ubuntu, Simple, Flat Nova-network, Ceph for volumes and for images, 1 controller+ceph, 2 compute+ceph
-Deploy cluster
- After deployment try to create volume

Expected - volume is created successfully
Actual - volume creation failed with error in cinder-volume log on controller (node-1):
2014-08-03 16:22:32 ERROR

cinder.volume.manager [req-e401e969-ae6e-4e18-8207-762c25189d67 - - - - -] Bad or unexpected response from the storage volume backend API: error connecting to ceph cluster
2014-08-03 15:22:32.595 23084 TRACE cinder.volume.manager Traceback (most recent call last):
2014-08-03 15:22:32.595 23084 TRACE cinder.volume.manager File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 243, in init_host
2014-08-03 15:22:32.595 23084 TRACE cinder.volume.manager self.driver.check_for_setup_error()
2014-08-03 15:22:32.595 23084 TRACE cinder.volume.manager File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py", line 268, in check_for_setup_error
2014-08-03 15:22:32.595 23084 TRACE cinder.volume.manager raise exception.VolumeBackendAPIException(data=msg)
2014-08-03 15:22:32.595 23084 TRACE cinder.volume.manager VolumeBackendAPIException: Bad or unexpected response from the storage volume backend API: error connecting to ceph cluster
2014-08-03 15:22:32.595 23084 TRACE cinder.volume.manager

2014-08-03 16:22:32 ERROR

cinder.volume.manager [req-e401e969-ae6e-4e18-8207-762c25189d67 - - - - -] Error encountered during initialization of driver: RBDDriver

2014-08-03 16:22:32 ERROR

cinder.volume.drivers.rbd [req-e401e969-ae6e-4e18-8207-762c25189d67 - - - - -] error connecting to ceph cluster
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd Traceback (most recent call last):
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py", line 263, in check_for_setup_error
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd with RADOSClient(self):
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py", line 235, in __init__
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd self.cluster, self.ioctx = driver._connect_to_rados(pool)
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py", line 283, in _connect_to_rados
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd client.connect()
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd File "/usr/lib/python2.7/dist-packages/rados.py", line 419, in connect
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd raise make_ex(ret, "error calling connect")
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd ObjectNotFound: error calling connect
2014-08-03 15:22:32.581 23084 TRACE cinder.volume.drivers.rbd

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Can't replicate in a CentOS environment I deployed from ISO #376.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

There's two separate problems in the logs:

1) cinder-volume is trying and failing to connect to the Ceph cluster (log fragments from the bug description). This happens after ceph-mon is initialized, but before any osds were added to the cluster. This works fine on CentOS, but is reproducible in Ubuntu. Unfortunately cinder rbd driver exception handling is broken so there's no indication with it failed to connect to the ceph cluster.

Since this can be reproduced on Ubuntu, updating status to Confirmed. Since there's a trivial workaround (restarting cinder-volume service makes the problem go away), updating priority to Medium. Leaving target milestone as 5.1 for now due to likely impact on system tests (OSTF will be failing until cinder-volume is restarted).

2) when osd processes are started by ceph-deploy, they report following errors:

2014-08-03 15:23:46.163693 7f4f5b0da7c0 -1 asok(0x7f4f5e382230) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-osd.0.aso
2014-08-03 15:23:46.163729 7f4f5b0da7c0 0 filestore(/var/lib/ceph/osd/ceph-0) lock_fsid failed to lock /var/lib/ceph/osd/ceph-0/fsid, is another ceph-osd still running? (11) Resource temporarily unav
2014-08-03 15:23:46.163738 7f4f5b0da7c0 -1 filestore(/var/lib/ceph/osd/ceph-0) FileStore::mount: lock_fsid failed
2014-08-03 15:23:46.163743 7f4f5b0da7c0 -1 .[0;31m ** ERROR: error converting store /var/lib/ceph/osd/ceph-0: (16) Device or resource busy.[0m

Still, osd show up as up and in in ceph-mon logs, and at the end of deployment the cluster is operational, and TestVM image is successfully uploaded to Glance.

Problem #2 should be reported as a separate Low priority bug.

Changed in fuel:
status: New → Confirmed
importance: High → Medium
assignee: Fuel Library Team (fuel-library) → Ryan Moe (rmoe)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The failed to bind problem is reported in bug #1352524.

Revision history for this message
Ryan Moe (rmoe) wrote :

1) seems to be caused by cinder-volume starting before its pool and keyring have been created. Restarting cinder-volume after that has happened resolves the issue.

Changed in fuel:
assignee: Ryan Moe (rmoe) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/111836

Changed in fuel:
status: Confirmed → In Progress
Changed in fuel:
importance: Medium → High
Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Vladimir Kuklin (vkuklin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/111836
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=513ec5cdcdef74c7419d5bae967b9edc7da8dbd7
Submitter: Jenkins
Branch: master

commit 513ec5cdcdef74c7419d5bae967b9edc7da8dbd7
Author: Dmitry Borodaenko <email address hidden>
Date: Mon Aug 4 14:20:59 2014 -0700

    create pools in Ceph before enabling RBD backends

    If RBD backend for Cinder is enabled and cinder-volume service is
    started before the configured pool is created in Ceph and a keyring with
    permissions to access that pool is placed on the node, cinder-volume
    will fail to connect to Ceph and will refuse to create volumes until
    restarted.

    Change-Id: Ia7f0c937da896da734b7882534b126e4169ae4ed
    Closes-Bug: #1352335

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Issue is still reproduced on CI
{

    "build_id": "2014-08-10_02-01-17",
    "ostf_sha": "acf52a59e04fa74d2ed2b68ea225f4d24403b264",
    "build_number": "420",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "2741cdc0f0615263db2f176899d406207ec4ac04",
    "production": "docker",
    "fuelmain_sha": "9d4463400b4924159c978af43855e48bcf2a84b2",
    "astute_sha": "b52910642d6de941444901b0f20e95ebbcb2b2e9",
    "feature_groups": [
        "mirantis"
    ],
    "release": "5.1",
    "fuellib_sha": "d699fc178559e98cfd7d53b58478b46553ffe39e"

}

Cluster configuration:
- Ubuntu, HA, Flat Nova Network, Ceph for volumes and for images, 3 controller+ceph, 2 compute+ceph, 1 ceph node
- Deploy cluster
- After deployment try to create volume

Expected - volume is created successfully
Actual - volume creation failed with error in cinder-volume log on controller (node-1):
cinder.volume.drivers.rbd [req-ec925ba0-797d-4a41-917b-dd4ae9b721a6 - - - - -] error connecting to ceph cluster
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd Traceback (most recent call last):
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py", line 263, in check_for_setup_error
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd with RADOSClient(self):
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py", line 235, in __init__
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd self.cluster, self.ioctx = driver._connect_to_rados(pool)
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py", line 283, in _connect_to_rados
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd client.connect()
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd File "/usr/lib/python2.7/dist-packages/rados.py", line 419, in connect
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd raise make_ex(ret, "error calling connect")
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd ObjectNotFound: error calling connect
2014-08-10 12:34:32.902 9408 TRACE cinder.volume.drivers.rbd
logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
status: Fix Committed → Confirmed
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Attached snapshot doesn't have any logs or files from the target nodes.

Changed in fuel:
status: Confirmed → Incomplete
assignee: Dmitry Borodaenko (dborodaenko) → Andrey Sledzinskiy (asledzinskiy)
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Please, check these logs - in docker-logs/remote/node-1 in cinder-volume.log

Changed in fuel:
status: Incomplete → Confirmed
assignee: Andrey Sledzinskiy (asledzinskiy) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Confirmed to be the same problem:

node-1 primary-controller:
cinder-volume.log:
2014-08-12T08:16:44.302095+01:00 err: 2014-08-12 07:16:44.293 8381 ERROR cinder.volume.drivers.rbd [req-a1f51185-28cc-4e8f-8b48-d57ce9dfe809 - - - - -] error connecting to ceph cluster
puppet-apply.log:
2014-08-12T08:12:20.852517+01:00 info: (Class[Cinder::Volume::Rbd]) Starting to evaluate the resource
2014-08-12T08:16:43.375591+01:00 info: (/Stage[main]/Cinder::Volume/Service[cinder-volume]) Evaluated in 0.27 seconds
2014-08-12T08:20:39.867294+01:00 info: (Ceph::Pool[volumes]) Starting to evaluate the resource

Looks like commits https://review.openstack.org/111836 and https://review.openstack.org/112324 were not enough to fix this bug, the new dependency is not enforced by Puppet.

Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

This Puppet bug may be related:
https://tickets.puppetlabs.com/browse/PUP-1417

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Yes, ceph class is evaluated by Puppet parser before cinder::volume::rbd, so at the eval time Class['cinder::volume::rbd'] is not defined and the dependency is not declared.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/115097

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/115097
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=52f3ebfa968f0338e0584edf47cff10911109de5
Submitter: Jenkins
Branch: master

commit 52f3ebfa968f0338e0584edf47cff10911109de5
Author: Dmitry Borodaenko <email address hidden>
Date: Mon Aug 18 13:37:48 2014 -0700

    parse order independent dependency on ceph::pool

    ceph class is evaluated by Puppet parser before cinder::volume::rbd, so
    at the eval time Class['cinder::volume::rbd'] is not defined and the
    dependency was not declared.

    Change-Id: I4c07a9a9c3037e5e98f60e17bd916b20b7269d85
    Closes-Bug: #1352335

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Anastasia Palkina (apalkina) wrote :
Download full text (4.4 KiB)

Reproduced on ISO #11 for 5.1

"build_id": "2014-09-17_21-40-34",
"ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346",
"build_number": "11",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d",
"production": "docker",
"fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd",
"astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13",
"feature_groups": ["mirantis"],
"release": "5.1",
"release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"}}}, "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"

1. Create new environment (CentOS, HA mode)
2. Choose nova-network, vlan manager
3. Choose both Ceph
4. Choose Sahara
5. Add 3 controller+ceph+cinder, 2 compute+ceph, 1 cinder
6. Configure interfaces (see screen) and untag management network
7. Start deployment. It was successful
8. Start OSTF tests. It was successful
9. But there is error on first controller (node-1) with ceph in puppet.log:

2014-09-19 12:31:24 ERR

 (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) change from notrun to 0 failed: ceph-deploy osd activate node-1:/dev/sdb4 node-1:/dev/sdc4 returned 1 instead of one of [0]

10. Also there are errors on cinder (node-6):

2014-09-19 12:35:09 WARNING

cinder.volume.manager [req-2dad97a0-b509-46d5-a239-06fa1344f2bf - - - - -] Unable to update stats, RBDDriver -1.1.0 driver is uninitialized.

2014-09-19 12:34:19 ERROR

cinder.volume.manager [req-8a4e3758-7d8f-4bfd-876a-d913163aaf83 - - - - -] Bad or unexpected response from the storage volume backend API: error connecting to ceph cluster
2014-09-19 11:34:19.495 26002 TRACE cinder.volume.manager Traceback (most recent call last):
2014-09-19 11:34:19.495 26002 TRACE cinder.volume.manager File "/usr/lib/python2.6/site-packages/cinder/volume/manager.py", line 243, in init_host
2014-09-19 11:34:19.495 26002 TRACE cinder.volume.manager self.driver.check_for_setup_error()
2014-09-19 11:34:19.495 26002 TRACE cinder.volume.manager File "/usr/lib/python2.6/site-packages/cinder/volume/drivers/rbd.py", line 268, in check_for_setup_error
2014-09-19 11:34:19.495 26002 TRACE cinder.volume.manager raise exception.VolumeBackendAPIException(data=msg)
2014-09-19 11:34:19.495 26002 TRACE cinder.volume.manager VolumeBackendAPIException: Bad or unexpected response from the storage volume backend API: error connecting to ceph cluster
2014-09-19 11:34:19.495 26002 TRACE cinder.volume.manager

2014-09-19 12:34:19 ERROR

cinder.volume.manager [req-8a4e3758-7d8f-4bfd-876a-d913163aaf83 - - - - -] Error encountered during initialization of driver: RBDDriver

2014-09-19 12:34:19 ERROR

cinder.volume.drivers.rbd [req-8a4e3758-7d8f-4bfd-876a-d913163aaf83 - - - - -] error connecting to ceph cluster...

Read more...

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Adding cinder nodes to an environment with Ceph RBD used as backend for volumes (instead of Cinder LVM backend) is not a valid configuration.

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
wale sanya (olawale24) wrote :

I am in tight corner learning ceph integration into cinder .
ceph -s
    cluster 5c10c447-8b27-42d2-b056-9366e9b56b31
     health HEALTH_OK
     monmap e1: 3 mons at {node1=192.168.1.240:6789/0,node2=192.168.1.241:6789/0,node3=192.168.1.242:6789/0}
            election epoch 24, quorum 0,1,2 node1,node2,node3
     osdmap e60: 3 osds: 3 up, 3 in
            flags sortbitwise,require_jewel_osds
      pgmap v17284: 248 pgs, 5 pools, 1130 MB data, 181 objects
            3473 MB used, 24141 MB / 27614 MB avail
                 248 active+clean

controller nodes has all this configuration has cinder sevices running

openstack-service status cinder
MainPID=1457 Id=openstack-cinder-api.service ActiveState=active
MainPID=1505 Id=openstack-cinder-backup.service ActiveState=active
MainPID=1449 Id=openstack-cinder-scheduler.service ActiveState=active
MainPID=4261 Id=openstack-cinder-volume.service ActiveState=active

But creating a volume after integration spins this error

017-02-27 18:04:38.204 19863 ERROR cinder.cmd.volume return import_class(import_str)(*args, **kwargs)
2017-02-27 18:04:38.204 19863 ERROR cinder.cmd.volume File "/usr/lib/python2.7/site-packages/oslo_utils/importutils.py", line 36, in import_class
2017-02-27 18:04:38.204 19863 ERROR cinder.cmd.volume traceback.format_exception(*sys.exc_info())))
2017-02-27 18:04:38.204 19863 ERROR cinder.cmd.volume ImportError: Class RBDVolumeDriver cannot be found (['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/oslo_utils/importutils.py", line 32, in import_class\n return getattr(sys.modules[mod_str], class_str)\n', "AttributeError: 'module' object has no attribute 'RBDVolumeDriver'\n"])
2017-02-27 18:04:38.204 19863 ERROR cinder.cmd.volume
2017-02-27 18:04:38.208 19863 ERROR cinder.cmd.volume [req-977e2a7d-3d2a-4195-8835-998c9e68e3b5 - - - - -] No volume service(s) started successfully, terminating.
2017-02-27 18:04:42.046 19888 WARNING oslo_reports.guru_meditation_report [-] Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, so please use SIGUSR2 to generate reports.
2017-02-27 18:04:42.049 19888 WARNING oslo_config.cfg [-] Option "rpc_backend" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.

openstack is newton

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.