NetApp utilization metrics slow

Bug #1804239 reported by Maurice Escher on 2018-11-20
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Manila
Medium
Lucio Seki

Bug Description

Hi,

the NetApp cDOT driver cluster node utilization metrics slow down my requests that much and/or are running into timeout errors, that the service is unusable for high utilized back ends.

I suggest to implement a switch to allow operators to disable this feature.

Regards,
Maurice

Tom Barron (tpb) on 2018-11-21
tags: added: driver
Goutham Pacha Ravi (gouthamr) wrote :

Hey Maurice,

What are the utilization metrics that you're talking about? Performance metrics?

Lucio Seki (lseki) on 2019-03-07
Changed in manila:
assignee: nobody → Lucio Seki (lseki)
Jason Grosso (jgrosso) on 2019-03-07
Changed in manila:
status: New → Triaged
importance: Undecided → Medium

Hello guys, I've found a possible scenario that may be slowing down requests in a dhss=true environment.

When running with dhss=true, the share manager will try to fetch the pools for each share server [0], calling the function get_share_server_pools(share_server) [1] in driver responsible for gathering only the pools associated to the specific share server.

The NetApp cDOT driver always associates all pools to all vservers, so it just ignores the share server parameter and instead returns information about all pools. If your environment have many share servers created, the driver will repeatedly request the same information to the storage what looks totally unnecessary.

Caching pools data periodically and retrieving just the cached information to the manager would probably improve driver performance for a multi share server environment.

[0] https://github.com/openstack/manila/blob/10bb9e8efc3b2c8c8c2d6168d1e215fb354b355e/manila/share/manager.py#L3571
[1] https://github.com/openstack/manila/blob/10bb9e8efc3b2c8c8c2d6168d1e215fb354b355e/manila/share/drivers/netapp/dataontap/cluster_mode/lib_base.py#L300

Hello Fernando,

Yeah you are right.. That's a really great finding.

The caching of these values with a periodic task will overcome this issue if the backends deployed are with DHSS=True.

But if the issue is seen in DHSS=False env's we need to do something else.
One fix for this sceanrio which we already have is to provide an extra param(netapp_enable_perf_polling) to disable performance polling.
https://review.openstack.netapp.com/#/c/1053/

But as discussed we have a limitation when we use this extra param(netapp_enable_perf_polling) to disable performance polling.
If we need to disable performance polling, it needs to be disabled on all the backends as goodness function and filter functions depend on the utilization value(which is derived using perf metrics of the node).
When perf polling is disabled, we are setting the utilization value to be 50.
Hence there will be a problem while provisioning when only few backends are disabled for performance as the utilization values are not derived using the same criteria across all the backends.
The customers need to be aware of this limitation.

Thanks,
Naresh

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers