scheduler falsely reports share service down

Bug #1804208 reported by Maurice Escher on 2018-11-20
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Shared File Systems Service (Manila)
Medium
Douglas Viroel

Bug Description

Hi,

with a low/default service_down_time in config and a high number (I've seen it with 5) of manila-share services it can happen that a service gets wrongly reported as down in scheduling.

I believe this is because

https://github.com/openstack/manila/blob/stable/rocky/manila/scheduler/host_manager.py#L573-L582

collects the heartbeat data first and then loops over the services.
E.g. the last service in the loop may be reached after the service_down_time time has passed, the service normally should have received a new heartbeat in the meantime, but the loop operates on old data and does not know.

I propose to let service_is_up do a live check against the database each time, or at least make it configurable for the caller.

I hope my explanation is understandable.

Cheers,
Maurice

Tom Barron (tpb) on 2018-11-21
tags: added: edge scale
Changed in manila:
importance: Undecided → High
tags: added: backport-potential
Jason Grosso (jgrosso) on 2019-03-22
Changed in manila:
status: New → Triaged
Jason Grosso (jgrosso) on 2019-03-27
Changed in manila:
status: Triaged → New
Jason Grosso (jgrosso) wrote :

This bug will be discussed under the Edge PTG topic

Changed in manila:
status: New → Opinion
status: Opinion → Triaged
Lucio Seki (lseki) wrote :

Hi Maurice,

I started investigating this bug. Tried to reproduce this issue, but no success.

I deployed a DevStack, then configured 8 LVMShareDriver backends and 7 NetAppDriver backends.
Running `manila service-list` a couple seconds after restarting m-shr, I can see all the services up.
The service_down_time is not set, so it's using the default value.

Could you add some details about the environment and the steps you used to find this bug?

Regards,
Lucio

Maurice Escher (maurice-escher) wrote :

Hi Lucio,

thanks for investigating.

Maybe it needs actual payload to be visible - I've seen it with 2000 shares and reporting those takes a while.

Now that I think again, I remember especially the server_pools_mapping being too large. I disabled it as an additional workaround. I don't use the PoolWeigher anyhow, which is the only consumer of this share statistics afaik.

BR,
Maurice

Lucio Seki (lseki) wrote :

Thanks for the details, Maurice.

I managed to reproduce the issue with 500 shares using Dummy driver.

Now I'll verify if your approach doing a live check would fix the issue.

Cheers,
Lucio

Lucio Seki (lseki) on 2019-04-04
Changed in manila:
assignee: nobody → Lucio Seki (lseki)
Lucio Seki (lseki) wrote :

Seems that `_update_host_state_map` is not the only place to fix.

While creating 500 shares, the manila-scheduler log starts printing "Share service is down." several times. However, running `manila service-list` still shows the manila-share service as `up`.

But when I restart manila-share service, running `manila service-list` shows manila-share service as `down` while it's exporting its 500 shares.

Lucio Seki (lseki) wrote :

Actually, even modifying `_upodate_host_state_map` as Maurice suggested, it still shows "Share service is down." while creating the 500 shares:

    def _update_host_state_map(self, context):
        # Get resource usage across the available share nodes:
        topic = CONF.share_topic
        share_services = db.service_get_all_by_topic(context, topic)

        active_hosts = set()
        for service in share_services:
            # Get an updated state of the service
            updated_service = db.service_get(context, service['id'])
            host = updated_service['host']

            # Warn about down services and remove them from host_state_map
            if (not utils.service_is_up(updated_service) or
                    updated_service['disabled']):
                LOG.warning("Share service is down. (host: %s).", host)
                continue
            ...

But despite the warning message, the shares are being created successfully, and `manila service-list` showing manila share-service as `up`.
It's only shown as `down` while exporting the shares, upon manila-share service restart.

Lucio Seki (lseki) wrote :

Sorry, please ignore the comments #4-#6.
It's normal to manila-share service be shown as `down` for a while until re-exporting all the shares.
If it still remains `down` after a long time after restarting, it should be another issue to be addressed in a new bug report.

So I didn't manage to reproduce the issue yet.

Download full text (16.6 KiB)

Hello Team,

I believe I was able to reproduce this issue in my env.
- I added 9 NetApp Share Backends to Manila(Backend details shown in the Session 5th output).
- I am trying to create manila shares continuously in 3 different sessions.
- In an another parallel session I am grepping the logs to see if any Backend is reported as down because of the above bug. -- 4th session
- In an another session I am running "manila service-list | grep ontap | grep down" continuously. -- 5th session

Note : The services were not restarted during this time frame.
       The services were restarted at least 20 mins before running this activity.

I have captured the data for a 10 second timeframe (Fri Apr 12 06:20:39 EDT 2019) to Fri Apr 12 06:20:48 EDT 2019.
Here is the session output from the 5th session.

################################# Session 5 output #################################
root@25-nareshtwo:/home/stack# date
Fri Apr 12 06:20:39 EDT 2019
root@25-nareshtwo:/home/stack# manila service-list
+----+------------------+----------------------------+---------------+---------+-------+----------------------------+
| Id | Binary | Host | Zone | Status | State | Updated_at |
+----+------------------+----------------------------+---------------+---------+-------+----------------------------+
| 1 | manila-share | 25-nareshtwo@london | manila-zone-0 | enabled | down | 2019-04-12T09:58:12.000000 |
| 2 | manila-share | 25-nareshtwo@paris | manila-zone-1 | enabled | down | 2019-04-12T09:58:12.000000 |
| 3 | manila-scheduler | 25-nareshtwo | nova | enabled | up | 2019-04-12T10:20:36.000000 |
| 4 | manila-data | 25-nareshtwo | nova | enabled | up | 2019-04-12T10:20:43.000000 |
| 5 | manila-share | 25-nareshtwo@ontap2 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 6 | manila-share | 25-nareshtwo@ontap6 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 7 | manila-share | 25-nareshtwo@ontapreplica6 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 8 | manila-share | 25-nareshtwo@ontapreplica2 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 9 | manila-share | 25-nareshtwo@ontap33 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 10 | manila-share | 25-nareshtwo@ontap3 | nova | enabled | up | 2019-04-12T10:20:42.000000 |
| 11 | manila-share | 25-nareshtwo@ontapreplica3 | nova | enabled | up | 2019-04-12T10:20:42.000000 |
| 12 | manila-share | 25-nareshtwo@ontap4 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 13 | manila-share | 25-nareshtwo@ontapreplica4 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
+----+------------------+----------------------------+---------------+---------+-------+----------------------------+
root@25-nareshtwo:/home/stack# manila service-list | grep ontap | grep down
root@25-nareshtwo:/home/stack# manila service-list | grep ontap | grep down
root@25-nareshtwo:/home/stack# manila...

wiley (gfhjgfhdfjd) on 2019-05-10
summary: - scheduler falsely reports share service down
+ Tramadol Online :: YourRxPills.com
wiley (gfhjgfhdfjd) on 2019-05-10
summary: - Tramadol Online :: YourRxPills.com
+ scheduler falsely reports share now with buy tramadol online service
+ down
summary: - scheduler falsely reports share now with buy tramadol online service
- down
+ Reports share now with buy tramadol online service down
description: updated
tags: added: yourrxpills
removed: backport-potential edge scale
tags: added: yourrxpills.com
removed: yourrxpills
wiley (gfhjgfhdfjd) on 2019-05-19
summary: - Reports share now with buy tramadol online service down
+ Buy Tramadol online without Prescription in USA
summary: - Buy Tramadol online without Prescription in USA
+ scheduler falsely reports share service down
description: updated
tags: removed: yourrxpills.com
tags: added: backport-potential edge scale
wiley (gfhjgfhdfjd) on 2019-06-06
description: updated
summary: - scheduler falsely reports share service down
+ buy tramadols online without prescription
wiley (gfhjgfhdfjd) on 2019-06-07
description: updated
Colin Watson (cjwatson) on 2019-06-07
description: updated
summary: - buy tramadols online without prescription
+ scheduler falsely reports share service down
Jason Grosso (jgrosso) wrote :

Do we have any updates on this issue?

Jason Grosso (jgrosso) on 2019-11-14
Changed in manila:
importance: High → Medium
Lucio Seki (lseki) wrote :

Not yet. Not sure if we'll be able to work on this issue during Ussuri.

Vida Haririan (vhariria) wrote :

Do we have any updates on this issue?

Douglas Viroel (dviroel) on 2020-07-10
Changed in manila:
assignee: Lucio Seki (lseki) → Douglas Viroel (dviroel)
kiran pawar (kiranpawar89) wrote :

Lucio, Naresh, Douglas,
What are exact steps of reproduction for this bug ?

Douglas Viroel (dviroel) wrote :

Hi Kiran,

I'm not working in reproduce/fix this issue at this moment.
We don't know exactly how to reproduce the issue, but you might start adding many backends to your environment, to see if you can get the Scheduler overloaded.

Not sure if Maurice already fixed this in his environment, or if he has more information to share with us.

Thanks

kiran pawar (kiranpawar89) wrote :

I had created 6 generic backends in devstack and created around 350 shares (between size 1 to 6). During share creation process, I kept checking manila service-list and found that all services were up during whole time. Then restarted m-shr service and checked service-list and again found that all services were up. Seems like I am not able to reproduce this bug for now.

Maurice,
Can you give try and let us know if you can reproduce ? If yes, please share detailed steps.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers