scheduler falsely reports share service down

Bug #1804208 reported by Maurice Escher
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Shared File Systems Service (Manila)
Expired
Medium
kiran pawar

Bug Description

Hi,

with a low/default service_down_time in config and a high number (I've seen it with 5) of manila-share services it can happen that a service gets wrongly reported as down in scheduling.

I believe this is because

https://github.com/openstack/manila/blob/stable/rocky/manila/scheduler/host_manager.py#L573-L582

collects the heartbeat data first and then loops over the services.
E.g. the last service in the loop may be reached after the service_down_time time has passed, the service normally should have received a new heartbeat in the meantime, but the loop operates on old data and does not know.

I propose to let service_is_up do a live check against the database each time, or at least make it configurable for the caller.

I hope my explanation is understandable.

Cheers,
Maurice

Tom Barron (tpb)
tags: added: edge scale
Changed in manila:
importance: Undecided → High
tags: added: backport-potential
Jason Grosso (jgrosso)
Changed in manila:
status: New → Triaged
Jason Grosso (jgrosso)
Changed in manila:
status: Triaged → New
Revision history for this message
Jason Grosso (jgrosso) wrote :

This bug will be discussed under the Edge PTG topic

Changed in manila:
status: New → Opinion
status: Opinion → Triaged
Revision history for this message
Lucio Seki (lseki) wrote :

Hi Maurice,

I started investigating this bug. Tried to reproduce this issue, but no success.

I deployed a DevStack, then configured 8 LVMShareDriver backends and 7 NetAppDriver backends.
Running `manila service-list` a couple seconds after restarting m-shr, I can see all the services up.
The service_down_time is not set, so it's using the default value.

Could you add some details about the environment and the steps you used to find this bug?

Regards,
Lucio

Revision history for this message
Maurice Escher (maurice-escher) wrote :

Hi Lucio,

thanks for investigating.

Maybe it needs actual payload to be visible - I've seen it with 2000 shares and reporting those takes a while.

Now that I think again, I remember especially the server_pools_mapping being too large. I disabled it as an additional workaround. I don't use the PoolWeigher anyhow, which is the only consumer of this share statistics afaik.

BR,
Maurice

Revision history for this message
Lucio Seki (lseki) wrote :

Thanks for the details, Maurice.

I managed to reproduce the issue with 500 shares using Dummy driver.

Now I'll verify if your approach doing a live check would fix the issue.

Cheers,
Lucio

Lucio Seki (lseki)
Changed in manila:
assignee: nobody → Lucio Seki (lseki)
Revision history for this message
Lucio Seki (lseki) wrote :

Seems that `_update_host_state_map` is not the only place to fix.

While creating 500 shares, the manila-scheduler log starts printing "Share service is down." several times. However, running `manila service-list` still shows the manila-share service as `up`.

But when I restart manila-share service, running `manila service-list` shows manila-share service as `down` while it's exporting its 500 shares.

Revision history for this message
Lucio Seki (lseki) wrote :

Actually, even modifying `_upodate_host_state_map` as Maurice suggested, it still shows "Share service is down." while creating the 500 shares:

    def _update_host_state_map(self, context):
        # Get resource usage across the available share nodes:
        topic = CONF.share_topic
        share_services = db.service_get_all_by_topic(context, topic)

        active_hosts = set()
        for service in share_services:
            # Get an updated state of the service
            updated_service = db.service_get(context, service['id'])
            host = updated_service['host']

            # Warn about down services and remove them from host_state_map
            if (not utils.service_is_up(updated_service) or
                    updated_service['disabled']):
                LOG.warning("Share service is down. (host: %s).", host)
                continue
            ...

But despite the warning message, the shares are being created successfully, and `manila service-list` showing manila share-service as `up`.
It's only shown as `down` while exporting the shares, upon manila-share service restart.

Revision history for this message
Lucio Seki (lseki) wrote :

Sorry, please ignore the comments #4-#6.
It's normal to manila-share service be shown as `down` for a while until re-exporting all the shares.
If it still remains `down` after a long time after restarting, it should be another issue to be addressed in a new bug report.

So I didn't manage to reproduce the issue yet.

Revision history for this message
Naresh Kumar Gunjalli (nareshkumarg) wrote :
Download full text (16.6 KiB)

Hello Team,

I believe I was able to reproduce this issue in my env.
- I added 9 NetApp Share Backends to Manila(Backend details shown in the Session 5th output).
- I am trying to create manila shares continuously in 3 different sessions.
- In an another parallel session I am grepping the logs to see if any Backend is reported as down because of the above bug. -- 4th session
- In an another session I am running "manila service-list | grep ontap | grep down" continuously. -- 5th session

Note : The services were not restarted during this time frame.
       The services were restarted at least 20 mins before running this activity.

I have captured the data for a 10 second timeframe (Fri Apr 12 06:20:39 EDT 2019) to Fri Apr 12 06:20:48 EDT 2019.
Here is the session output from the 5th session.

################################# Session 5 output #################################
root@25-nareshtwo:/home/stack# date
Fri Apr 12 06:20:39 EDT 2019
root@25-nareshtwo:/home/stack# manila service-list
+----+------------------+----------------------------+---------------+---------+-------+----------------------------+
| Id | Binary | Host | Zone | Status | State | Updated_at |
+----+------------------+----------------------------+---------------+---------+-------+----------------------------+
| 1 | manila-share | 25-nareshtwo@london | manila-zone-0 | enabled | down | 2019-04-12T09:58:12.000000 |
| 2 | manila-share | 25-nareshtwo@paris | manila-zone-1 | enabled | down | 2019-04-12T09:58:12.000000 |
| 3 | manila-scheduler | 25-nareshtwo | nova | enabled | up | 2019-04-12T10:20:36.000000 |
| 4 | manila-data | 25-nareshtwo | nova | enabled | up | 2019-04-12T10:20:43.000000 |
| 5 | manila-share | 25-nareshtwo@ontap2 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 6 | manila-share | 25-nareshtwo@ontap6 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 7 | manila-share | 25-nareshtwo@ontapreplica6 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 8 | manila-share | 25-nareshtwo@ontapreplica2 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 9 | manila-share | 25-nareshtwo@ontap33 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 10 | manila-share | 25-nareshtwo@ontap3 | nova | enabled | up | 2019-04-12T10:20:42.000000 |
| 11 | manila-share | 25-nareshtwo@ontapreplica3 | nova | enabled | up | 2019-04-12T10:20:42.000000 |
| 12 | manila-share | 25-nareshtwo@ontap4 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
| 13 | manila-share | 25-nareshtwo@ontapreplica4 | nova | enabled | up | 2019-04-12T10:20:41.000000 |
+----+------------------+----------------------------+---------------+---------+-------+----------------------------+
root@25-nareshtwo:/home/stack# manila service-list | grep ontap | grep down
root@25-nareshtwo:/home/stack# manila service-list | grep ontap | grep down
root@25-nareshtwo:/home/stack# manila...

wiley (gfhjgfhdfjd)
summary: - scheduler falsely reports share service down
+ Tramadol Online :: YourRxPills.com
wiley (gfhjgfhdfjd)
summary: - Tramadol Online :: YourRxPills.com
+ scheduler falsely reports share now with buy tramadol online service
+ down
summary: - scheduler falsely reports share now with buy tramadol online service
- down
+ Reports share now with buy tramadol online service down
description: updated
tags: added: yourrxpills
removed: backport-potential edge scale
tags: added: yourrxpills.com
removed: yourrxpills
wiley (gfhjgfhdfjd)
summary: - Reports share now with buy tramadol online service down
+ Buy Tramadol online without Prescription in USA
summary: - Buy Tramadol online without Prescription in USA
+ scheduler falsely reports share service down
description: updated
tags: removed: yourrxpills.com
tags: added: backport-potential edge scale
wiley (gfhjgfhdfjd)
description: updated
summary: - scheduler falsely reports share service down
+ buy tramadols online without prescription
wiley (gfhjgfhdfjd)
description: updated
Colin Watson (cjwatson)
description: updated
summary: - buy tramadols online without prescription
+ scheduler falsely reports share service down
Revision history for this message
Jason Grosso (jgrosso) wrote :

Do we have any updates on this issue?

Jason Grosso (jgrosso)
Changed in manila:
importance: High → Medium
Revision history for this message
Lucio Seki (lseki) wrote :

Not yet. Not sure if we'll be able to work on this issue during Ussuri.

Revision history for this message
Vida Haririan (vhariria) wrote :

Do we have any updates on this issue?

Revision history for this message
Vida Haririan (vhariria) wrote :
Douglas Viroel (dviroel)
Changed in manila:
assignee: Lucio Seki (lseki) → Douglas Viroel (dviroel)
Revision history for this message
kiran pawar (kpdev) wrote :

Lucio, Naresh, Douglas,
What are exact steps of reproduction for this bug ?

Revision history for this message
Douglas Viroel (dviroel) wrote :

Hi Kiran,

I'm not working in reproduce/fix this issue at this moment.
We don't know exactly how to reproduce the issue, but you might start adding many backends to your environment, to see if you can get the Scheduler overloaded.

Not sure if Maurice already fixed this in his environment, or if he has more information to share with us.

Thanks

Revision history for this message
kiran pawar (kpdev) wrote :

I had created 6 generic backends in devstack and created around 350 shares (between size 1 to 6). During share creation process, I kept checking manila service-list and found that all services were up during whole time. Then restarted m-shr service and checked service-list and again found that all services were up. Seems like I am not able to reproduce this bug for now.

Maurice,
Can you give try and let us know if you can reproduce ? If yes, please share detailed steps.

Douglas Viroel (dviroel)
Changed in manila:
assignee: Douglas Viroel (dviroel) → Carlos Eduardo (silvacarlose)
Vida Haririan (vhariria)
Changed in manila:
status: Triaged → Incomplete
assignee: Carlos Eduardo (silvacarlose) → Maurice Escher (maurice-escher)
Revision history for this message
Vida Haririan (vhariria) wrote :

This bug was discussed at https://meetings.opendev.org/meetings/manila/2022/manila.2022-06-09-15.00.log.html.

As discussed, status was set to incomplete pending feedback from new assignee.

Hi Maurice,
Can you give try and let us know if you can reproduce ? If yes, please share detailed steps.

Vida Haririan (vhariria)
Changed in manila:
milestone: none → zed-3
Vida Haririan (vhariria)
Changed in manila:
assignee: Maurice Escher (maurice-escher) → nobody
assignee: nobody → Maurice Escher (maurice-escher)
milestone: zed-3 → none
assignee: Maurice Escher (maurice-escher) → nobody
Revision history for this message
Vida Haririan (vhariria) wrote :
Revision history for this message
kiran pawar (kpdev) wrote :

Can we consider the fix of issue something like below :-
1. check if service is up using difference of timestamps
2. if True, return True
3. Else fetch latest updated_at/created_at from db again and check for difference of timestamps
4. return value from step 3.

Reading additionally once(from db) for service which are not up as per step 1, would be good tradeoff I believe. Basically its live check for service if it fails to report as 'up'.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Shared File Systems Service (Manila) because there has been no activity for 60 days.]

Changed in manila:
status: Incomplete → Expired
kiran pawar (kpdev)
Changed in manila:
assignee: nobody → kiran pawar (kpdev)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.