NetApp utilization metrics slow

Bug #1804239 reported by Maurice Escher
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Shared File Systems Service (Manila)
Triaged
Medium
Felipe Rodrigues

Bug Description

Hi,

the NetApp cDOT driver cluster node utilization metrics slow down my requests that much and/or are running into timeout errors, that the service is unusable for high utilized back ends.

I suggest to implement a switch to allow operators to disable this feature.

Regards,
Maurice

Tags: driver netapp
Tom Barron (tpb)
tags: added: driver
Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

Hey Maurice,

What are the utilization metrics that you're talking about? Performance metrics?

Revision history for this message
Maurice Escher (maurice-escher) wrote :
Lucio Seki (lseki)
Changed in manila:
assignee: nobody → Lucio Seki (lseki)
Jason Grosso (jgrosso)
Changed in manila:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Fernando Ferraz (fernando-ferraz) wrote :

Hello guys, I've found a possible scenario that may be slowing down requests in a dhss=true environment.

When running with dhss=true, the share manager will try to fetch the pools for each share server [0], calling the function get_share_server_pools(share_server) [1] in driver responsible for gathering only the pools associated to the specific share server.

The NetApp cDOT driver always associates all pools to all vservers, so it just ignores the share server parameter and instead returns information about all pools. If your environment have many share servers created, the driver will repeatedly request the same information to the storage what looks totally unnecessary.

Caching pools data periodically and retrieving just the cached information to the manager would probably improve driver performance for a multi share server environment.

[0] https://github.com/openstack/manila/blob/10bb9e8efc3b2c8c8c2d6168d1e215fb354b355e/manila/share/manager.py#L3571
[1] https://github.com/openstack/manila/blob/10bb9e8efc3b2c8c8c2d6168d1e215fb354b355e/manila/share/drivers/netapp/dataontap/cluster_mode/lib_base.py#L300

Revision history for this message
Naresh Kumar Gunjalli (nareshkumarg) wrote :

Hello Fernando,

Yeah you are right.. That's a really great finding.

The caching of these values with a periodic task will overcome this issue if the backends deployed are with DHSS=True.

But if the issue is seen in DHSS=False env's we need to do something else.
One fix for this sceanrio which we already have is to provide an extra param(netapp_enable_perf_polling) to disable performance polling.
https://review.openstack.netapp.com/#/c/1053/

But as discussed we have a limitation when we use this extra param(netapp_enable_perf_polling) to disable performance polling.
If we need to disable performance polling, it needs to be disabled on all the backends as goodness function and filter functions depend on the utilization value(which is derived using perf metrics of the node).
When perf polling is disabled, we are setting the utilization value to be 50.
Hence there will be a problem while provisioning when only few backends are disabled for performance as the utilization values are not derived using the same criteria across all the backends.
The customers need to be aware of this limitation.

Thanks,
Naresh

Revision history for this message
Felipe Rodrigues (felipefutty) wrote :

Hi all,

I am working on this bug and I could not reproduce. I tested with the overload environment: 200 share servers (each one with one share).

I ran a script to extend the size of each share (200). It worked finely (no timeout error). Also, I run that script without the utilization metrics (using the same approach to skip as [1]), it took the same time as with metrics.

Maurice, could you give more details ?

1. has the workaround provided by [1] fixed this bug ??? If so, we may improve the fix to cache instead of skipping.

2. Which operation is returning timeout error ??

3. Is it API, scheduler or share error ? If possible, provide the log for us

4. How is the environment: number of shares, servers and so on ?

Thanks, Felipe.

[1] https://github.com/sapcc/manila/commit/cfeef704fc2921471848fb1002678a8b264887f5

Changed in manila:
assignee: Lucio Seki (lseki) → Felipe Rodrigues (felipefutty)
Revision history for this message
Maurice Escher (maurice-escher) wrote :

Hi Felipe,

thanks for working on this.

1. yes - this is a dumb workaround that just disables the feature, but it is doing this job ;)
And yes, I agree this can be improved.

2. & 3. I can't easily reproduce either, it happened only in production and I don't dare to turn it back on to get fresh errors. I think it were mostly share service errors.
I found this snippet in my error mails: http://paste.openstack.org/show/798855/

4. an affected example environment is a FAS8080 with about 850 shares, 80 share servers, 50% used space

Hope this helps,
Maurice

Revision history for this message
Vida Haririan (vhariria) wrote :
Revision history for this message
Felipe Rodrigues (felipefutty) wrote :

Thank you, Maurice.

Regarding the issue that Fernando has pointed out, we are working on providing
a patch to fix it. We created a cache to be used for all share server pool status.
Measuring the time for collecting the pool for the share servers (no cache):

Server | Time update pool (s)|
-------|---------------------|
1 | 0.44 |
-------|---------------------|
10 | 4.01 |
-------|---------------------|
50 | 20.02 |
-------|---------------------|
100 40.40 |
-------|---------------------|
200 | 80.27 |
-----------------------------

With the cache, the time is always around 1 second, no matter the number of servers
We are not sure if the patch will fix the bug, though, because we are still struggling
to reproduce the Timeout error.

Maurice, in order to reproduce, could you provide some extra information:

1. How many nodes are in the cluster ?!

2. How many backends are you using ?

3. How many aggregates (pools) ?

4. What does the cluster respond when sending this XML http://paste.openstack.org/show/799078/ ?

Best regards, Felipe.

Revision history for this message
Maurice Escher (maurice-escher) wrote :

Hi Felipe,

the cache solution sounds good :)

Here comes the extra info:
1. 2 nodes (named stnpa3-01, stnpa3-02)
2. We have one manila-share process (in our case: container) running per backend. In this installation we have 5 backends
3. 2 aggregates (1 per node)
4. http://paste.openstack.org/show/799206/ though I used version="1.150" because the system is currently running ONTAP NetApp Release 9.5P10D2.

BR,
Maurice

Revision history for this message
Felipe Rodrigues (felipefutty) wrote :

Hi maurice,

I have tried to reproduce the bug with the environment that you reported:

-> almost a thousand shares
-> more than hundred share servers
-> 2 backends
-> 30 mounted shares with users writing and erasing data constantly
-> decreasing the time for updating metrics (instead of 60, 10 seconds)

Look at the log during a long period of time, I could not get the "Timeout" reported error, though.

It might have something in your production manila that we cannot reproduce in our lab.

Given that it is a ONTAP error, we are working in a workaround for catching the error and setting it for the default utilization metrics. some questions:

-> Are you using the 'utilization' for filter e goodness functions ?

-> How often the error occur ?

Best regards, Felipe.

Revision history for this message
Douglas Viroel (dviroel) wrote :

Hi Felipe and Maurice,

Even not reproducing the error reported, we saw that the NetApp performance component already treat any storage error and set the 'utilization' value to the default (50)[1][2].
Depending on the real utilization of the system, it is expected that sometimes it would answer a timeout error, but this won't cause any trouble to the function that is calling the performance library.

Felipe also proposed an improvement on pools status update[3] that should reduce the number of call made to the storage when working on DHSS=True mode, thus also avoiding many performance counter calls.

Now, I also believe that if an operator isn't using the 'utilization' value provided in the pool info, there is no need to keep the driver calling the performance library to retrieve such info. In this case, we could add a new back end config option that can disable the 'utilization' metrics update.

So @Maurice, let us know if the fix[3] proposed will mitigate your timeout issues, of if you want a back end config option that disable 'utilization' metrics update, in case you don't use it at all.

Thanks

[1] https://opendev.org/openstack/manila/src/branch/master/manila/share/drivers/netapp/dataontap/cluster_mode/performance.py#L338-L348
[2] https://opendev.org/openstack/manila/src/branch/master/manila/share/drivers/netapp/dataontap/cluster_mode/performance.py#L107-L114
[3] https://review.opendev.org/#/c/760696/

Revision history for this message
Maurice Escher (maurice-escher) wrote :

Hi Felipe and Douglas,

thanks for the insights.

Ideally I want to use the utilization for the goodness functions, correctly guessed. Therefore ultimately I don't need a config option to disable the metrics collection.
And the cache is a really good outcome!

Re [1]:
The storage error is not raised and it falls back to a default, therefore logging an "exception" seems wrong to me - imho it should not be higher than a "warning" in this case.

Best regards,
Maurice

[1] https://opendev.org/openstack/manila/src/branch/master/manila/share/drivers/netapp/dataontap/cluster_mode/performance.py#L338-L348

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers