Cinder filter scheduler not enabled with multiple storage back ends

Bug #1768231 reported by Stuart Grace
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
In Progress
Medium
Unassigned

Bug Description

Cinder docs here: https://docs.openstack.org/cinder/queens/admin/blockstorage-multi-backend.html#configure-block-storage-scheduler-multi-back-end
says filter_scheduler option must be enabled in cinder.conf to use multiple storage backends. This requires a setting such as

[default]
scheduler_default_filters = DriverFilter

but this is not written into cinder.conf. This might explain why, when we have two RBD backends linked to two different Ceph pools, one backend is up and working and the other is down and fails to create volumes. Storage hosts in openstack_user_config.yml all defined like this:

storage_hosts:
  infra1:
    ip: 10.31.128.12
    container_vars: &_container_vars_
      cinder_backends:
        limit_container_types: cinder_volume
        rbd_nvme:
          volume_driver: cinder.volume.drivers.rbd.RBDDriver
          volume_backend_name: rbd_nvme
          rbd_pool: cinder-volumes-nvme
          rbd_ceph_conf: /etc/ceph/ceph.conf
          rbd_user: cinder
          rbd_secret_uuid: "{{ cinder_ceph_client_uuid }}"
        rbd_hdd:
          volume_driver: cinder.volume.drivers.rbd.RBDDriver
          volume_backend_name: rbd_hdd
          rbd_pool: cinder-volumes-hdd
          rbd_ceph_conf: /etc/ceph/ceph.conf
          rbd_user: cinder
          rbd_secret_uuid: "{{ cinder_ceph_client_uuid }}"

Cinder.conf file is attached. On each storage host, rbd_nvme service is UP and rbd_hdd is DOWN after deployment. (Manually restarting the cinder-volume service brought both backends up successfully - not sure why).

Revision history for this message
Stuart Grace (stuartgrace) wrote :
Changed in openstack-ansible:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

We discussed this in our triage meeting.
We are not convinced the `scheduler_default_filters = DriverFilter` needs to happen, but the DOWN problem happened to other users, and should be fixed.

Revision history for this message
Jonathan Rosser (jrosser) wrote :

Here is an log snippet when you try to use the "down" backend

infra3-cinder-api-container-e36ebc84 cinder-scheduler: 2018-04-30 11:07:10.592 3120 ERROR cinder.scheduler.flows.create_volume [req-fa3478fd-5ce5-4253-8fbb-7d8fb8353718 db1b61a8da364c85b793123d956f42aa f48410de667a4e5b92ce50e1b9de27f1 - - -] Failed to run task cinder.scheduler.flows.create_volume.ScheduleCreateVolumeTask;volume:create: No valid backend was found. No weighed backends available: NoValidBackend: No valid backend was found. No weighed backends available

Revision history for this message
Mohammed Naser (mnaser) wrote :

We just hit this, I'm going to dive into the issue.

Changed in openstack-ansible:
assignee: nobody → Mohammed Naser (mnaser)
Revision history for this message
Mohammed Naser (mnaser) wrote :

Do you have any ERRORs in the log files of the agents that went down? I've identified the issue here to be related to the race condition of stats running + a volume disappearing which can crash a thread:

https://review.openstack.org/#/q/I7d3d006b023ca4b7963c4c684e4c036399d1295c

I believe landing this locally will stop the threads from crashing and seeing agents go to down state.

Revision history for this message
Mohammed Naser (mnaser) wrote :

That patch helped with the threads dying, but noticed that an agent process would die:

INFO oslo_service.service [req-459698ea-0f43-4056-8729-d39fef627862 - - - - -] Child 567692 killed by signal 6
INFO cinder.service [req-459698ea-0f43-4056-8729-d39fef627862 - - - - -] Starting cinder-volume node (version 12.0.2)

Looking into the container logs...

cinder-volume[567641]: terminate called after throwing an instance of 'std::runtime_error'
cinder-volume[567641]: what(): random_device::random_device(const std::string&)

Revision history for this message
Jonathan Rosser (jrosser) wrote :

Here is a cinder-volumes log from the point that the deployment started. Once completed some of the backends were down.

Revision history for this message
Shivashankar C R (shivashankarcr) wrote :

Hi,

I have hit the same issue:

2018-07-27 12:44:08.174 3194 INFO cinder.scheduler.base_filter [req-24837425-6f45-42bd-8f41-46ac6e2ca530 073c02c1926b4d22bc8ac73f97b26305 3d0752c33c384e34be6feb41020c9e83 - default default] Filtering removed all hosts for the request with volume ID '4ffcb859-37a9-4b5e-a302-01f3dd74b678'. Filter results: AvailabilityZoneFilter: (start: 0, end: 0), CapacityFilter: (start: 0, end: 0), CapabilitiesFilter: (start: 0, end: 0)
2018-07-27 12:44:08.175 3194 WARNING cinder.scheduler.filter_scheduler [req-24837425-6f45-42bd-8f41-46ac6e2ca530 073c02c1926b4d22bc8ac73f97b26305 3d0752c33c384e34be6feb41020c9e83 - default default] No weighed backend found for volume with properties: None
2018-07-27 12:44:08.176 3194 INFO cinder.message.api [req-24837425-6f45-42bd-8f41-46ac6e2ca530 073c02c1926b4d22bc8ac73f97b26305 3d0752c33c384e34be6feb41020c9e83 - default default] Creating message record for request_id = req-24837425-6f45-42bd-8f41-46ac6e2ca530
2018-07-27 12:44:08.185 3194 ERROR cinder.scheduler.flows.create_volume [req-24837425-6f45-42bd-8f41-46ac6e2ca530 073c02c1926b4d22bc8ac73f97b26305 3d0752c33c384e34be6feb41020c9e83 - default default] Failed to run task cinder.scheduler.flows.create_volume.ScheduleCreateVolumeTask;volume:create: No valid backend was found. No weighed backends available: NoValidBackend: No valid backend was found. No weighed backends available

Ceph-logs:
root@ubuntu-control1:~# lxc-attach -n infra1_ceph-mon_container-6d1a7907
root@infra1-ceph-mon-container-6d1a7907:/# ceph status || ceph -w
  cluster:
    id: 88990417-2535-4b03-bee6-8b398ce234d3
    health: HEALTH_WARN
            Reduced data availability: 13 pgs inactive
            Degraded data redundancy: 32 pgs undersized
            too few PGs per OSD (13 < min 30)

  services:
    mon: 1 daemons, quorum infra1-ceph-mon-container-6d1a7907
    mgr: infra1-ceph-mon-container-6d1a7907(active)
    osd: 3 osds: 3 up, 3 in; 27 remapped pgs

  data:
    pools: 5 pools, 40 pgs
    objects: 0 objects, 0 bytes
    usage: 322 MB used, 2762 GB / 2763 GB avail
    pgs: 32.500% pgs not active
             19 active+undersized+remapped
             13 undersized+peered
             8 active+clean

root@infra1-ceph-mon-container-6d1a7907:/#

Is there any workaround to proceed?

Thanks!

Revision history for this message
Shivashankar C R (shivashankarcr) wrote :

I have hit this issue on clean installaion of openstack-ansible with ceph.

Thanks

Revision history for this message
Stuart Grace (stuartgrace) wrote :

As a workaround, manually restarting the cinder-volume service inside the cinder containers on the Infra nodes worked for us.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-os_cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/588126

Changed in openstack-ansible:
assignee: Mohammed Naser (mnaser) → Kevin Carter (kevin-carter)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on openstack-ansible-os_cinder (master)

Change abandoned by Kevin Carter (cloudnull) (<email address hidden>) on branch: master
Review: https://review.openstack.org/588126

Changed in openstack-ansible:
assignee: Kevin Carter (kevin-carter) → nobody
Revision history for this message
Chenjun Shen (cshen) wrote :

We hit this problem as well.

Here is the ERROR log.

Jul 26 09:20:37 ctr0002-cinder-api-container-b8542159 cinder-scheduler[26728]: 2019-07-26 09:20:37.932 26728 ERROR cinder.scheduler.flows.create_volume [req-2d421101-3734-44f2-a8b1-b0a9331385f7 1cc90cddb1074743b80411de2ffe831c 29a71daf56b840bc83192ea562787b58 - default default] Failed to run task cinder.scheduler.flows.create_volume.ScheduleCreateVolumeTask;volume:create: No valid backend was found. No weighed backends available: NoValidBackend: No valid backend was found. No weighed backends available

Revision history for this message
Russell Holloway (russell-holloway) wrote :

Same error here. I'm not entirely sure what fixed it, I don't think it was a simple restart, but I also renamed by volume_backend_name in one of them during this last run of os-cinder-install.yml, and now it seems to work? Very strange.

Changed from rbddriver and rbddriver_nvme to rbddriver and nvmedriver (I thought maybe the _ was causing an issue).

Somehow it could've been the extra restart, the fact I changed volume_backend_name, or a miracle, but it's oddly working now.

Revision history for this message
VinceLe (legoll) wrote :

Looks like this issue is not "status: in progress" any more, but "status: abandonned". Any update ?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.