Distributed Cloud is very slow to indicate unmanaged subcloud has gone offline

Bug #1854894 reported by Gerry Kopec on 2019-12-03
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Bart Wensley

Bug Description

Brief Description
-----------------
Running system with 10 subclouds with status unmanaged/online/complete. When powering subclouds off prior to delete, it took many minutes for availability status to change from online to offline.

For instance powered down subcloud8 at: Mon Dec 2 21:27:06, but it took 25 minutes to go offline.
2019-12-02 21:52:22.310 886413 INFO dcmanager.manager.subcloud_audit_manager [-] Setting new availability status: offline on subcloud: subcloud8

Then powered down 7 subclouds simultaneously at: Mon Dec 2 22:02:58 UTC 2019, but took over 80 minutes for all subclouds to go offline:
2019-12-02 22:53:08.006 886413 INFO dcmanager.manager.subcloud_audit_manager [-] Setting new availability status: offline on subcloud: subcloud3
2019-12-02 22:58:23.654 886413 INFO dcmanager.manager.subcloud_audit_manager [-] Setting new availability status: offline on subcloud: subcloud4
2019-12-02 23:03:39.302 886413 INFO dcmanager.manager.subcloud_audit_manager [-] Setting new availability status: offline on subcloud: subcloud1
2019-12-02 23:08:54.950 886413 INFO dcmanager.manager.subcloud_audit_manager [-] Setting new availability status: offline on subcloud: subcloud5
2019-12-02 23:14:10.598 886413 INFO dcmanager.manager.subcloud_audit_manager [-] Setting new availability status: offline on subcloud: subcloud2
2019-12-02 23:19:26.246 886413 INFO dcmanager.manager.subcloud_audit_manager [-] Setting new availability status: offline on subcloud: subcloud7
2019-12-02 23:24:41.894 886413 INFO dcmanager.manager.subcloud_audit_manager [-] Setting new availability status: offline on subcloud: subcloud6

Severity
--------
Minor

Steps to Reproduce
------------------
Setup DC system with a number of subclouds and get then into managed/online/complete/in-sync state.
Then unmanage all subclouds, then power down them down.

Expected Behavior
------------------
Expecting a change in state from online to offline within a few minutes. Note that audit timeout is set to default/not changed in conf files locally.
controller-0:/scratch/dimensioning# grep subcloud_audit /etc/dcmanager/*
controller-0:/scratch/dimensioning# grep subcloud_audit /etc/dcorch/*
controller-0:/scratch/dimensioning#

Actual Behavior
----------------
Takes 25-80 minutes to change state

Reproducibility
---------------
first attempt

System Configuration
--------------------
All-in-one duplex plus worker, DC system controller

Branch/Pull Time/Commit
-----------------------
2019-11-06_10-52-51

Last Pass
---------
first attempt

Timestamp/Logs
--------------
see Description

Test Activity
-------------
DC engineering

tags: added: stx.distcloud
Ghada Khalil (gkhalil) wrote :

Marking for stx.4.0 -- this should be looked at as a performance improvement

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.4.0
Kevin Smith (kevin.smith.wrs) wrote :
Download full text (17.9 KiB)

The subcloud audit runs every 3 minutes by default, looping over all subclouds in a single audit interval running its checks to decide if a subcloud is online/offline. The checks must fail twice over consecutive audit intervals in order for a subcloud to be declared offline, so the average time a subcloud should take to go offline should be on average 4.5 minutes with a max of 6 minutes.

The problem I believe, is that it takes a long time for the initial retrieval of the keystone client for a subcloud to timeout (~4.5 mins per subcloud according to below logs). When all the subclouds are powered down at the same time this causes a single subcloud audit to take > 40 minutes, and the next one to take > 50 minutes, as the second pass also hits timeouts trying to inform dcorch of the subcloud state changes when setting each subcloud offline.

First pass of audit logs (snipped):
2019-12-04 18:39:18.647 728803 INFO dcmanager.manager.subcloud_audit_manager [-] Triggered subcloud audit.
2019-12-04 18:39:59.172 729366 INFO eventlet.wsgi.server [req-6131930f-560b-448e-a1ea-5536d9e0de13 a7a1ad96b9a341e28cd1b92f3b836a6b - - default default] fd01:6::2 "GET /v1.0/subclouds/ HTTP/1.1" status: 200 len: 7380 time: 0.0553961
2019-12-04 18:41:26.310 728803 WARNING keystoneauth.identity.generic.base [-] Failed to discover available identity versions when contacting http://[fd01:7::2]:5000/v3. Attempting to parse version from URL.: ConnectFailure: Unable to establish connection to http://[fd01:7::2]:5000/v3: HTTPConnectionPool(host='fd01:7::2', port=5000): Max retries exceeded with url: /v3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbeb647fa50>: Failed to establish a new connection: [Errno 110] ETIMEDOUT',))
2019-12-04 18:42:45.896 728803 INFO dcmanager.manager.patch_audit_manager [-] Triggered patch audit.
2019-12-04 18:43:33.542 728803 ERROR dcmanager.manager.subcloud_audit_manager [-] Identity or Platform endpoint for online subcloud: subcloud1 not found.: ConnectFailure: Unable to establish connection to http://[fd01:7::2]:5000/v3/auth/tokens: HTTPConnectionPool(host='fd01:7::2', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbeb647f790>: Failed to establish a new connection: [Errno 110] ETIMEDOUT',))
2019-12-04 18:45:41.158 728803 WARNING keystoneauth.identity.generic.base [-] Failed to discover available identity versions when contacting http://[fd01:8::2]:5000/v3. Attempting to parse version from URL.: ConnectFailure: Unable to establish connection to http://[fd01:8::2]:5000/v3: HTTPConnectionPool(host='fd01:8::2', port=5000): Max retries exceeded with url: /v3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbeb6991a90>: Failed to establish a new connection: [Errno 110] ETIMEDOUT',))
2019-12-04 18:47:45.917 728803 INFO dcmanager.manager.patch_audit_manager [-] Triggered patch audit.
2019-12-04 18:47:48.390 728803 ERROR dcmanager.manager.subcloud_audit_manager [-] Identity or Platform endpoint for online subcloud: subcloud2 not found.: ConnectFailure: Unable to establish connection to http://[f...

Changed in starlingx:
assignee: nobody → Bart Wensley (bartwensley)

Fix proposed to branch: master
Review: https://review.opendev.org/703478

Changed in starlingx:
status: Triaged → In Progress

Reviewed: https://review.opendev.org/703478
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=1b5372b52cd58042e9623d6e9e076fc2d430f312
Submitter: Zuul
Branch: master

commit 1b5372b52cd58042e9623d6e9e076fc2d430f312
Author: Bart Wensley <email address hidden>
Date: Mon Jan 20 12:28:20 2020 -0600

    Configure keystone http timeouts for dcmanager

    The dcmanager currently has no http_connect_timeout set for
    keystone connections. That can result in an attempt to contact
    keystone (e.g. to get a token) taking several minutes to
    timeout if the keystone api is not reachable (e.g. if a subcloud
    is powered down).

    Changing the http_connect_timeout to 10s and configuring
    http_request_max_retries as 3 (that is also the default but
    adding this to the puppet module allows for easy changes in the
    future).

    Change-Id: I6a62846e7f4e75e9b2f0705f59818243ea909e41
    Partial-Bug: 1854894
    Signed-off-by: Bart Wensley <email address hidden>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers