scheduler falsely reports share service down

Bug #1804208 reported by Maurice Escher on 2018-11-20
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Manila
High
Lucio Seki

Bug Description

Hi,

with a low/default service_down_time in config and a high number (I've seen it with 5) of manila-share services it can happen that a service gets wrongly reported as down in scheduling.

I believe this is because

https://github.com/openstack/manila/blob/stable/rocky/manila/scheduler/host_manager.py#L573-L582

collects the heartbeat data first and then loops over the services.
E.g. the last service in the loop may be reached after the service_down_time time has passed, the service normally should have received a new heartbeat in the meantime, but the loop operates on old data and does not know.

I propose to let service_is_up do a live check against the database each time, or at least make it configurable for the caller.

I hope my explanation is understandable.

Cheers,
Maurice

Tom Barron (tpb) on 2018-11-21
tags: added: edge scale
Changed in manila:
importance: Undecided → High
tags: added: backport-potential
Jason Grosso (jgrosso) on 2019-03-22
Changed in manila:
status: New → Triaged
Jason Grosso (jgrosso) on 2019-03-27
Changed in manila:
status: Triaged → New
Jason Grosso (jgrosso) wrote :

This bug will be discussed under the Edge PTG topic

Changed in manila:
status: New → Opinion
status: Opinion → Triaged
Lucio Seki (lseki) wrote :

Hi Maurice,

I started investigating this bug. Tried to reproduce this issue, but no success.

I deployed a DevStack, then configured 8 LVMShareDriver backends and 7 NetAppDriver backends.
Running `manila service-list` a couple seconds after restarting m-shr, I can see all the services up.
The service_down_time is not set, so it's using the default value.

Could you add some details about the environment and the steps you used to find this bug?

Regards,
Lucio

Maurice Escher (maurice-escher) wrote :

Hi Lucio,

thanks for investigating.

Maybe it needs actual payload to be visible - I've seen it with 2000 shares and reporting those takes a while.

Now that I think again, I remember especially the server_pools_mapping being too large. I disabled it as an additional workaround. I don't use the PoolWeigher anyhow, which is the only consumer of this share statistics afaik.

BR,
Maurice

Lucio Seki (lseki) wrote :

Thanks for the details, Maurice.

I managed to reproduce the issue with 500 shares using Dummy driver.

Now I'll verify if your approach doing a live check would fix the issue.

Cheers,
Lucio

Lucio Seki (lseki) on 2019-04-04
Changed in manila:
assignee: nobody → Lucio Seki (lseki)
Lucio Seki (lseki) wrote :

Seems that `_update_host_state_map` is not the only place to fix.

While creating 500 shares, the manila-scheduler log starts printing "Share service is down." several times. However, running `manila service-list` still shows the manila-share service as `up`.

But when I restart manila-share service, running `manila service-list` shows manila-share service as `down` while it's exporting its 500 shares.

Lucio Seki (lseki) wrote :

Actually, even modifying `_upodate_host_state_map` as Maurice suggested, it still shows "Share service is down." while creating the 500 shares:

    def _update_host_state_map(self, context):
        # Get resource usage across the available share nodes:
        topic = CONF.share_topic
        share_services = db.service_get_all_by_topic(context, topic)

        active_hosts = set()
        for service in share_services:
            # Get an updated state of the service
            updated_service = db.service_get(context, service['id'])
            host = updated_service['host']

            # Warn about down services and remove them from host_state_map
            if (not utils.service_is_up(updated_service) or
                    updated_service['disabled']):
                LOG.warning("Share service is down. (host: %s).", host)
                continue
            ...

But despite the warning message, the shares are being created successfully, and `manila service-list` showing manila share-service as `up`.
It's only shown as `down` while exporting the shares, upon manila-share service restart.

Lucio Seki (lseki) wrote :

Sorry, please ignore the comments #4-#6.
It's normal to manila-share service be shown as `down` for a while until re-exporting all the shares.
If it still remains `down` after a long time after restarting, it should be another issue to be addressed in a new bug report.

So I didn't manage to reproduce the issue yet.

Download full text (16.6 KiB)

Hello Team,

I believe I was able to reproduce this issue in my env.
- I added 9 NetApp Share Backends to Manila(Backend details shown in the Session 5th output).
- I am trying to create manila shares continuously in 3 different sessions.
- In an another parallel session I am grepping the logs to see if any Backend is reported as down because of the above bug. -- 4th session
- In an another session I am running "manila service-list | grep ontap | grep down" continuously. -- 5th session

Note : The services were not restarted during this time frame.
       The services were restarted at least 20 mins before running this activity.

I have captured the data for a 10 second timeframe (Fri Apr 12 06:20:39 EDT 2019) to Fri Apr 12 06:20:48 EDT 2019.
Here is the session output from the 5th session.

################################# Session 5 output #################################
root@25-nareshtwo:/home/stack# date
Fri Apr 12 06:20:39 EDT 2019
root@25-nareshtwo:/home/stack# manila service-list
+----+------------------+----------------------------+---------------+---------+-------+----------------------------+
| Id | Binary | Host | Zone | Status | State | Updated_at |
+----+------------------+----------------------------+---------------+---------+-------+----------------------------+
| 1 | manila-share | 25-nareshtwo@london | manila-zone-0 | enabled | down | 2019-04-12T09:58:12.000000 |
| 2 | manila-share | 25-nareshtwo@paris | manila-zone-1 | enabled | down | 2019-04-12T09:58:12.000000 |
| 3 | manila-scheduler | 25-nareshtwo | nova | enabled | up | 2019-04-12T10:20:36.000000 |
| 4 | manila-data | 25-nareshtwo | nova | enabled | up | 2019-04-12T10:20:43.000000 |
| 5 | manila-share | 25-nareshtwo@ontap2 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 6 | manila-share | 25-nareshtwo@ontap6 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 7 | manila-share | 25-nareshtwo@ontapreplica6 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 8 | manila-share | 25-nareshtwo@ontapreplica2 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 9 | manila-share | 25-nareshtwo@ontap33 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 10 | manila-share | 25-nareshtwo@ontap3 | nova | enabled | up | 2019-04-12T10:20:42.000000 |
| 11 | manila-share | 25-nareshtwo@ontapreplica3 | nova | enabled | up | 2019-04-12T10:20:42.000000 |
| 12 | manila-share | 25-nareshtwo@ontap4 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 13 | manila-share | 25-nareshtwo@ontapreplica4 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
+----+------------------+----------------------------+---------------+---------+-------+----------------------------+
root@25-nareshtwo:/home/stack# manila service-list | grep ontap | grep down
root@25-nareshtwo:/home/stack# manila service-list | grep ontap | grep down
root@25-nareshtwo:/home/stack# manila...

wiley (gfhjgfhdfjd) on 2019-05-10
summary: - scheduler falsely reports share service down
+ Tramadol Online :: YourRxPills.com
wiley (gfhjgfhdfjd) on 2019-05-10
summary: - Tramadol Online :: YourRxPills.com
+ scheduler falsely reports share now with buy tramadol online service
+ down
summary: - scheduler falsely reports share now with buy tramadol online service
- down
+ Reports share now with buy tramadol online service down
description: updated
tags: added: yourrxpills
removed: backport-potential edge scale
tags: added: yourrxpills.com
removed: yourrxpills
wiley (gfhjgfhdfjd) on 2019-05-19
summary: - Reports share now with buy tramadol online service down
+ Buy Tramadol online without Prescription in USA
summary: - Buy Tramadol online without Prescription in USA
+ scheduler falsely reports share service down
description: updated
tags: removed: yourrxpills.com
tags: added: backport-potential edge scale
wiley (gfhjgfhdfjd) on 2019-06-06
description: updated
summary: - scheduler falsely reports share service down
+ buy tramadols online without prescription
wiley (gfhjgfhdfjd) on 2019-06-07
description: updated
Colin Watson (cjwatson) on 2019-06-07
description: updated
summary: - buy tramadols online without prescription
+ scheduler falsely reports share service down
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers