get_volume_stats is called twice during volume driver initialization

Bug #1793676 reported by Chhavi Agarwal
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
New
Undecided
Unassigned

Bug Description

During cinder-volume service initialization there are 2 calls for get_volume_stats with refresh=True, which impacts performance.

During the service initialization init_host_with_rpc, it checks the driver capability for replication_status from get_volume_stats and updates the service replication status.

First call during init_host_with_rpc, this call is just made to fetch the replication capability of the volume driver and set the service.

        stats = self.driver.get_volume_stats(refresh=True)
        try:
            service = self._get_service()
        except exception.ServiceNotFound:
            with excutils.save_and_reraise_exception():
                LOG.error("Service not found for updating replication_status.")

        if service.replication_status != (
                fields.ReplicationStatus.FAILED_OVER):
            if stats and stats.get('replication_enabled', False):
                service.replication_status = fields.ReplicationStatus.ENABLED
            else:
                service.replication_status = fields.ReplicationStatus.DISABLED

Second call is under periodic call _report_driver_status() which is required to refresh service capabilities and update volume stats.

We can avoid the First call of get_volume_stats call in the init_host_with_rpc() as it is just been used to check the driver replication capability.
if instead of using get_volume_stats, we can have replication_enabled as a driver property which returns if driver is capable enough to perform replication. As its a static property during driver initialization. And can still use get_volume_stats for updating the current replication status.

Would like to know the thoughts of the core on the same.

Revision history for this message
John Griffith (john-griffith) wrote :

Hi chhavi, thanks for filing this. There's a catch with that check actually, in particular with replication. The problem is we didn't want to do the replication checks until *after* we were certain that all services were up and had reported their capabilities (including replication). Note this also ties in the cluster replication/HA initialization as well IIRC.

I don't really consider this a performance hit, get_stats is a pretty easy call. That being said, if you see a way to improve this while maintaining HA and volume replication functionality that'd be great. I think it's more of an enhancement than a bug though.

Revision history for this message
Chhavi Agarwal (chhagarw) wrote :

Thanks John for the response, Agree to the point that for replication check we need all the services to be up.
What I am trying to propose, that for just getting the replication status do we really need to make get_volume_stats() call as this call fetches the complete volume stats. Rather if we can have a new method get_replication_status() which can be used to fetch the replication capabilities.

Revision history for this message
Chhavi Agarwal (chhagarw) wrote :

Or we can have replication_enabled as a driver property, which we currently do have in most of the storage drivers, to get the replication capabilities.

Revision history for this message
John Griffith (john-griffith) wrote :

@Chhavi
Yeah, the only thing is that just having it configured doesn't mean it's actually set up properly and working. A driver would be smart if it actually verifies it has connectivity and the connection settings work properly.

I get that the stats call might be expensive for *some* backends, but I'm just not completely sure how serious that is. Also, given the fact that this is a periodic and it's just going to run again in 60 seconds (or whatever your period is set to) I'm not sure how much this really saves you?

I wouldn't object to a replication specific init I guess; but I'm still having a hard time figuring out the real value. Fact is, we can't do anything with that backend until a full stats update call anyway to get all the capacity info etc.

The property is fine, but that comes with the stats so I'm not sure what you're thinking there, unless you are wanting to just limit the response? One other thing to keep in mind is you could do something like have the second call set refresh=False if you're worried about taxing the backend?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.