cephfs (possibly others): manila-share does not retry/restart if init_host() fails

Bug #1690159 reported by Thomas Bechtold
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Shared File Systems Service (Manila)
Fix Released
Medium
Thomas Bechtold

Bug Description

This happens with with manila 3.0.1dev27 (Newton release) on SLE12SP2.

# Problem
when starting manila-share (configured with cephfs as backend) and there is a problem in init_host(), the share service does not retry or crash which makes manila-share unusable.

# Reproducer
- configure a cephfs backend and set "cephfs_conf_path" to a file that does not exist (invalid config)
- start manila-share

Now it will be stuck (beside the periodic jobs that fill the logs slowly) with:

2017-05-11 14:43:24.870 31628 DEBUG oslo_service.service [req-5ff58f67-4813-41a3-9ee9-75cd9bcc7d97 - - - - -] database.use_db_reconnect = False log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2678
2017-05-11 14:43:24.871 31628 DEBUG oslo_service.service [req-5ff58f67-4813-41a3-9ee9-75cd9bcc7d97 - - - - -] ******************************************************************************** log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2680
2017-05-11 14:43:25.107 31637 DEBUG manila.service [req-50a9c741-1ac5-4c35-896a-d43fc5f88626 - - - - -] Creating RPC server for service manila-share. start /usr/lib/python2.7/site-packages/manila/service.py:110
2017-05-11 14:43:25.132 31637 DEBUG manila.share.manager [req-50a9c741-1ac5-4c35-896a-d43fc5f88626 - - - - -] Start initialization of driver: 'CephFSNativeDriver@d52-54-77-77-01-01@backend-cephfs-0' init_host /usr/lib/python2.7/site-packages/manila/share/manager.py:266
2017-05-11 14:43:25.133 31637 INFO manila.share.drivers.cephfs.cephfs_native [req-50a9c741-1ac5-4c35-896a-d43fc5f88626 - - - - -] [ceph}] Ceph client found, connecting...
2017-05-11 14:43:25.133 31637 DEBUG ceph_volume_client [req-50a9c741-1ac5-4c35-896a-d43fc5f88626 - - - - -] Connecting to RADOS with config /etc/ceph/ceph.conf2... connect /usr/lib/python2.7/site-packages/ceph_volume_client.py:451
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager [req-50a9c741-1ac5-4c35-896a-d43fc5f88626 - - - - -] Error encountered during initialization of driver 'CephFSNativeDriver' on 'd52-54-77-77-01-01@backend-cephfs-0' host. error calling conf_read_file: error code 22
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager Traceback (most recent call last):
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager File "/usr/lib/python2.7/site-packages/manila/share/manager.py", line 269, in init_host
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager self.driver.check_for_setup_error()
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager File "/usr/lib/python2.7/site-packages/manila/share/drivers/cephfs/cephfs_native.py", line 88, in check_for_setup_error
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager self.volume_client
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager File "/usr/lib/python2.7/site-packages/manila/share/drivers/cephfs/cephfs_native.py", line 156, in volume_client
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager self._volume_client.connect(premount_evict=premount_evict)
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager File "/usr/lib/python2.7/site-packages/ceph_volume_client.py", line 456, in connect
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager conf={}
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager File "rados.pyx", line 525, in rados.Rados.__init__ (/home/abuild/rpmbuild/BUILD/ceph-10.2.6+git.1490339825.57146d8/src/build/rados.c:6081)
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager File "rados.pyx", line 425, in rados.requires.wrapper.validate_func (/home/abuild/rpmbuild/BUILD/ceph-10.2.6+git.1490339825.57146d8/src/build/rados.c:4334)
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager File "rados.pyx", line 568, in rados.Rados.__setup (/home/abuild/rpmbuild/BUILD/ceph-10.2.6+git.1490339825.57146d8/src/build/rados.c:6966)
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager File "rados.pyx", line 425, in rados.requires.wrapper.validate_func (/home/abuild/rpmbuild/BUILD/ceph-10.2.6+git.1490339825.57146d8/src/build/rados.c:4334)
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager File "rados.pyx", line 631, in rados.Rados.conf_read_file (/home/abuild/rpmbuild/BUILD/ceph-10.2.6+git.1490339825.57146d8/src/build/rados.c:8104)
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager Error: error calling conf_read_file: error code 22
2017-05-11 14:43:25.134 31637 ERROR manila.share.manager
2017-05-11 14:43:25.136 31637 INFO ceph_volume_client [req-50a9c741-1ac5-4c35-896a-d43fc5f88626 - - - - -] disconnect
2017-05-11 14:45:17.140 31637 DEBUG oslo_service.periodic_task [req-63b66bec-d1c4-419b-9dd9-a698ffd70179 - - - - -] Running periodic task ShareManager._publish_service_capabilities run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2017-05-11 14:45:17.140 31637 DEBUG oslo_service.periodic_task [req-63b66bec-d1c4-419b-9dd9-a698ffd70179 - - - - -] Running periodic task ShareManager.migration_driver_continue run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2017-05-11 14:45:17.141 31637 ERROR oslo_service.periodic_task [req-63b66bec-d1c4-419b-9dd9-a698ffd70179 - - - - -] Error during ShareManager.migration_driver_continue
2017-05-11 14:45:17.141 31637 ERROR oslo_service.periodic_task Traceback (most recent call last):
2017-05-11 14:45:17.141 31637 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2017-05-11 14:45:17.141 31637 ERROR oslo_service.periodic_task task(self, context)
2017-05-11 14:45:17.141 31637 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/manila/utils.py", line 616, in wrapper
2017-05-11 14:45:17.141 31637 ERROR oslo_service.periodic_task raise exception.DriverNotInitialized(driver=driver_name)
2017-05-11 14:45:17.141 31637 ERROR oslo_service.periodic_task DriverNotInitialized: Share driver 'CephFSNativeDriver' not initialized.
2017-05-11 14:45:17.141 31637 ERROR oslo_service.periodic_task
2017-05-11 14:45:17.142 31637 DEBUG oslo_service.periodic_task [req-63b66bec-d1c4-419b-9dd9-a698ffd70179 - - - - -] Running periodic task ShareManager._report_driver_status run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2017-05-11 14:45:17.142 31637 ERROR oslo_service.periodic_task [req-63b66bec-d1c4-419b-9dd9-a698ffd70179 - - - - -] Error during ShareManager._report_driver_status
2017-05-11 14:45:17.142 31637 ERROR oslo_service.periodic_task Traceback (most recent call last):
2017-05-11 14:45:17.142 31637 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2017-05-11 14:45:17.142 31637 ERROR oslo_service.periodic_task task(self, context)
2017-05-11 14:45:17.142 31637 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/manila/utils.py", line 616, in wrapper
2017-05-11 14:45:17.142 31637 ERROR oslo_service.periodic_task raise exception.DriverNotInitialized(driver=driver_name)
2017-05-11 14:45:17.142 31637 ERROR oslo_service.periodic_task DriverNotInitialized: Share driver 'CephFSNativeDriver' not initialized.
2017-05-11 14:45:17.142 31637 ERROR oslo_service.periodic_task

Revision history for this message
Thomas Bechtold (toabctl) wrote :

For the record: the bugfix for https://bugs.launchpad.net/manila/+bug/1500964 introduced the problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to manila (master)

Fix proposed to branch: master
Review: https://review.openstack.org/464205

Changed in manila:
assignee: nobody → Thomas Bechtold (toabctl)
status: New → In Progress
Changed in manila:
milestone: none → pike-3
importance: Undecided → Medium
Changed in manila:
assignee: Thomas Bechtold (toabctl) → Jan Provaznik (jan-provaznik)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to manila (master)

Reviewed: https://review.openstack.org/465584
Committed: https://git.openstack.org/cgit/openstack/manila/commit/?id=1cf5ccdbddaa845dc06ddc458bf09ec75e39bea1
Submitter: Jenkins
Branch: master

commit 1cf5ccdbddaa845dc06ddc458bf09ec75e39bea1
Author: Thomas Bechtold <email address hidden>
Date: Wed May 17 13:48:17 2017 +0200

    Allow endless retry loops in the utility function

    This can be used if an endless loop is needed.
    Also add a new parameter to allow a maximum backoff sleep time.

    Partial-Bug: #1690159
    Change-Id: Ib544b5bd4781d116dd3dffc8f35f43323cc9e2db

Changed in manila:
assignee: Jan Provaznik (jan-provaznik) → Goutham Pacha Ravi (gouthamr)
Changed in manila:
assignee: Goutham Pacha Ravi (gouthamr) → Thomas Bechtold (toabctl)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/464205
Committed: https://git.openstack.org/cgit/openstack/manila/commit/?id=f41e3c220c209aab35691e95a6db3a8ac303ea09
Submitter: Jenkins
Branch: master

commit f41e3c220c209aab35691e95a6db3a8ac303ea09
Author: Thomas Bechtold <email address hidden>
Date: Fri May 12 15:54:48 2017 +0200

    Retry backend initialization

    Since commit 4b87f6f40d1f09b4, exceptions in the init_host() call (which is
    called for every backend) are catched and backends might end up
    uninitialized and unusable. This might be okish (but is not good) in a
    multi-backend scenario but is definitely wrong in a single backend scenario.
    In that case, the manila-share process would successfully start but the
    backend would never be usable.
    So retry to initialize the driver for every backend in case there was an error
    during initialization. That way even a temporary broken backend can be
    initialized later without restarting manila-share.

    Change-Id: I2194c61fa9e9bdb32d252284eea1864151d9eef7
    Closes-Bug: #1690159

Changed in manila:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/manila 5.0.0.0b3

This issue was fixed in the openstack/manila 5.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.