Create snapshot, create volume from snapshot or clone a volume still can succeed even the cinder-volume service is disabled

Bug #1555938 reported by YuZhang on 2016-03-11
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Low
Unassigned

Bug Description

Create snapshot, create volume from snapshot or clone a volume still can succeed even the cinder-volume service is disabled

Repro steps:
1. create a volume: cinder create 1
2. disable the related backend cinder-volume service: cinder service-disable host-name cinder-volume
3. create snapshot: cinder snapshot-create vol-id
4. clone volume: cinder create --source-volid vol-id
5. create volume from snapshot: cinder create --snapshot-id snapshot-id

Step 3, 4, 5 should fail, but actually they are succeed.

YuZhang (ivysdu) on 2016-03-11
summary: Create snapshot, create volume from snapshot or clone a volume still
- can success even the cinder-volume service is disabled
+ can succeed even the cinder-volume service is disabled
Changed in cinder:
status: New → Confirmed
Changed in cinder:
importance: Undecided → Low
Changed in cinder:
assignee: nobody → aditi sharma (adi-sky17)
aditi sharma (adi-sky17) wrote :

yes all three operations succeed even if volume service is disabled, the reason behind that is cinder service-disable command only update the service status in database as disable, it does not actually disable the service. so all the driver specific command will run successfully.

if you create a volume after service is disabled, the volume will go in error because cinder-scheduler removes disabled services from its cache but in case of snapshot/clone volume/create volume from snapshot the request will not hit cinder-scheduler instead it will default created on the same backend, so no check applies for this operation.

we need to add check for volume service status in cinder-api or cinder-volume for the requests which does not hit scheduler.

zhangguoqing (474751729-o) wrote :

@adi-sky17
      Yes, the reason is whether hit scheduler or not. OMI, checking the volume service is not enough which there are other factors to make it failure(eg: the surplus capacity of the specified/source-volume host is smaller than creating volume size). So the new volume will default created on the same backend/host is not elegance and reckless, it should always hit scheduler. Then we can give the same backend/host high priority to be selected by scheduler before hitting.

Sheel Rana (ranasheel2000) wrote :

zhangguoqing,

There could be 2 approaches to it:
1. check volume service status during this operation
    I dont consider this approach as this will lead to extra CPU cycles each and everytime this operation runs...extra db query and all.
2. Pass these requests through cinder-scheduler as is done for other operations.
    As these 3 operations require same host for creating these where source exists, we can pass host id to scheduler so that scheduler does not do any scheduling task and directly forward request to same host whose details were passed with this request.
This way it will solve other problems as well.

zhangguoqing (474751729-o) wrote :

Hi Sheel Rana,

For 1. All of extra CPU cycles and extra db query can accept. I mean that although the volume service is OK, there are other factors to make it failure that new volume created on the same host default. For example, the same host has too smaller residual size to create a new one.
   (1) Report a failure.
   (2) New volume can be created on other hosts by scheduler.
So, only check volume service status during this operation is not enough, take action (2) may safely.

For 2. It's my recommended approaches (just for me), and I have finished on my test bed.
   Why are we require same host for creating these where source exists? It can get high priority, but non-essential.

For 3. It's perfectly. But It difficult just for me to implement what pass host id to scheduler.

Pardon my low English level. Thank you. :)

Lisa Li (lisali) wrote :

I think these calls return successfully when the volume service is disabled:
https://github.com/openstack/cinder/blob/master/cinder/backup/manager.py#L334

And it is not through scheduler.

Can we make disabled volume service to return failure for volume/snapshot etc related RPC interface calls?

I think option 2 by Zhangguoqing is reasonable, which can be done in the future releases.

Sheel Rana (ranasheel2000) wrote :

Other possible solution could be disabling rpc queue during disabling service.

zhangguoqing (474751729-o) wrote :

Hi Sheel Rana,
So you agree with hitting scheduler always?

Sheel Rana (ranasheel2000) wrote :

Dear zhangguoqing,

For now,"hitting scheduler always" will solve this specific issue only.

>I think these calls return successfully when the volume service is disabled:
>https://github.com/openstack/cinder/blob/master/cinder/backup/manager.py#L334
As pointed by LisaLi, there are some other operations as welll which will still need seperate fix.

So, I will be discussing this with our smart(cinder) guys and will be keeping this topic as cinder meeting point in next week.
There could be some better way to fix all issues in one go than hitting scheduler always.

So, lets discuss different fix perspectives in meeting on 30th of March, if possible.

zhangguoqing (474751729-o) wrote :

Yes, it's true. Thank you.

Michal Dulko (michal-dulko-f) wrote :

I think we need to take one step back and think of why "disable" was introduced in the first place. We've inherited it when Cinder was forked out of Nova. In Nova the use case is to implement a pattern for host maintenance:

1. Set host to disabled. No new VMs will be scheduled there.
2. Start to live-migrate VMs out of host.
3. When finished patch the host, reboot it and enable it again.

Now that's useful also in Cinder - when running an LVM driver.

Having that in mind - should cloning a volume or snapshot creation go through scheduler? I don't think so, as you cannot create a snapshot from a different host. Moreover it's better to give user feedback that service is disabled directly from the API.

From the other hand implementing a check in the API that blocks disabled services will decrease availability of volumes that are placed on disabled host in the maintenance window. I wonder what happens for example with snapshots when volume is migrated? Are they migrated to other backend as well?

Michael Dovgal (mdovgal) on 2016-09-26
Changed in cinder:
status: Confirmed → Fix Committed
status: Fix Committed → Confirmed
Michael Dovgal (mdovgal) on 2016-09-26
Changed in cinder:
assignee: aditi sharma (adi-sky17) → Michael Dovgal (mdovgal)

Fix proposed to branch: master
Review: https://review.openstack.org/377886

Changed in cinder:
status: Confirmed → In Progress

Change abandoned by michaeldovgal (<email address hidden>) on branch: master
Review: https://review.openstack.org/377886
Reason: It needs some rework due to merged patch with changes in _get_service_by_host https://review.openstack.org/#/c/344226/

Unassigning due to no activity for > 6 months.

Changed in cinder:
assignee: Michael Dovgal (mdovgal) → nobody
Changed in cinder:
status: In Progress → New
Sean McGinnis (sean-mcginnis) wrote :

More operations are now sent through the scheduler. If there are things still missing, specific bugs for those operations explaining why they need to be changed should be opened.

Changed in cinder:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers