Cinder

[rbd] When cinder-volume start, it take hours before state become 'up'

Bug #1707936 reported by Mehdi Abaakouk on 2017-08-01

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Cinder	In Progress	Undecided	Mehdi Abaakouk

Bug Description

Hi,

Since this revert: https://review.openstack.org/#/c/483298/1

Each time cinder-volume restart it takes hours before the 'state' become 'up'

During this warmup, API trows a ton of MessageTimeout.

Cinder is just unusable.

Regards,

Tags:

Revision history for this message

Mehdi Abaakouk (sileht) wrote on 2017-08-01:

The rbd pool have only ~ 200 volumes.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-01: Fix proposed to cinder (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/489630

Revision history for this message

Mehdi Abaakouk (sileht) wrote on 2017-08-01: Re: When cinder-volume start, it take hours before state become 'up'

rbd.Image().diff_iterate() will take a look of all objects of all volumes on all Ceph OSDs.

This shouldn't be used to get volume stats at startup and periodically.

Revision history for this message

Mehdi Abaakouk (sileht) wrote on 2017-08-01:

Also the experience to debug the issue is terrible. Nothing appears in logs during the volume_stats update.

The state is 'up'. You make a API call like attaching a volume. The state become 'down'. API got MessagingTimeout and nothing else.

I have understand the issue when I have enabled debug and see that cinder-volume was only logging "connecting to ceph (timeout=-1). _connect_to_rados" and nothing else.

The message appears every 1 to 10 seconds (I'm guessing that depends on the size of the volume)

I have manually added some additional logging and found that cinder-volume was stuck inside _get_usage_info() loops.

Eric Harney (eharney) on 2017-08-01

tags:	added: drivers rbd
summary:	- When cinder-volume start, it take hours before state become 'up' + [rbd] When cinder-volume start, it take hours before state become 'up'

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-01: Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/489718

Changed in cinder:
assignee:	nobody → Mehdi Abaakouk (sileht)
status:	New → In Progress

Revision history for this message

Mehdi Abaakouk (sileht) wrote on 2017-08-01:

v.diff_iterate without fast-diff and object-map feature is really slow because it have to dig into all rados objects to get the information.

So, on old deployement, usage of diff_iterate just make cinder-volume unusable.
The process is stuck in this loop just to compute size...

If you continue to use this, you should put somewhere that this feature are "required". And maybe, cinder can enable them, but rebuild the object map can be slow on big volume.

Gorka Eguileor (gorka) on 2017-08-04

tags:

added: ceph

Revision history for this message

Gorka Eguileor (gorka) wrote on 2017-08-04:

After looking into the matter of the stats I've realized that the RBD stats are wrong, and we shouldn't be returning physical used space in the first place as I explain in the improve provisioning spec [1].

I have proposed a patch in master [2] to fix our incorrect stats report which will also take care of this issue.

[1] https://review.openstack.org/490116
[2] https://review.openstack.org/486734

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-09-05: Change abandoned on cinder (stable/ocata)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/489630
Reason: better solution: https://review.openstack.org/#/c/486734/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-09-05: Change abandoned on cinder (master)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: master
Review: https://review.openstack.org/489718
Reason: better solution: https://review.openstack.org/#/c/486734/

Revision history for this message

Mohammed Naser (mnaser) wrote on 2017-12-31:

#10

I'd like to follow up on this issue, with the new changes, it gets better, but it is still problematic.

_get_usage_info() loops over all RBD volumes which can be very slow for a large cloud. Due to this, the cinder-volume process pretty much stops responding. It also consumes a lot of CPU usage.

I think it would be ideal if the usage calculations was done in a separate thread to not block everything. Alternatively, perhaps add an option to make it opt-in if operators are managing capacity externally.

Revision history for this message

Jeffrey Zhang (jeffrey4l) wrote on 2018-04-02:

#11

with this patch, this issue should be fixed. https://review.openstack.org/#/q/I78e552998e822fff37fb2c7cf890ebab65d9291f