Distributed Cloud - Unable to unmanage a subcloud that has gone offline

Bug #1860999 reported by Tee Ngo
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bart Wensley

Bug Description

Brief Description
-----------------
Occasionally a subcloud that had been shutdown cannot be unmanaged and deleted.

Severity
--------
Critical - with this bug, one is unable to delete a subcloud

Steps to Reproduce
------------------
Shutdown the subcloud
Run dcmanager subcloud unmanage <subcloud-name> command to unmanage the subcloud

Expected Behavior
------------------
The offline subcloud can be unmanaged and deleted.

Actual Behavior
----------------
[root@controller-0 ~(keystone_admin)]# dcmanager subcloud unmanage 8
Unable to update subcloud
ERROR (app) Unable to unmanage subcloud 8

The issue is that the dcmanager is attempting to tell dcorch that the subcloud is offline, but the RPC times out (after 60 seconds):

2020-01-27 15:14:11.982 1860648 INFO dcmanager.manager.service [req-c5070101-aa47-443f-839a-a2ae1534d27c a981317518794800b4623fa6914d66bc - - default default] Handling update_subcloud request for: 8
2020-01-27 15:14:11.982 1860648 INFO dcmanager.manager.subcloud_manager [req-c5070101-aa47-443f-839a-a2ae1534d27c a981317518794800b4623fa6914d66bc - - default default] Updating subcloud 8.
2020-01-27 15:15:11.981 1860944 ERROR dcmanager.api.controllers.v1.subclouds [req-c5070101-aa47-443f-839a-a2ae1534d27c a981317518794800b4623fa6914d66bc - - default default] Timed out waiting for a reply to message ID b7acc016d13e4c48a4a523043ed1a0e3: MessagingTimeout: Timed out waiting for a reply to message ID b7acc016d13e4c48a4a523043ed1a0e3

The problem was introduced into dcorch by the fernet key syncing code. When the subcloud is unmanaged, the dcorch attempts to reset the fernet keys in the subcloud (see update_subcloud_states in dcorch/engine/service.py). This attempts to get a keystone client for the subcloud:

2020-01-27 15:14:12.000 3352435 INFO dcorch.engine.generic_sync_manager [req-c5070101-aa47-443f-839a-a2ae1534d27c a981317518794800b4623fa6914d66bc - - default default] disabling subcloud subcloud10
2020-01-27 15:14:12.004 3352435 INFO dcorch.drivers.openstack.sdk_platform [req-c5070101-aa47-443f-839a-a2ae1534d27c a981317518794800b4623fa6914d66bc - - default default] get new keystone client for subcloud subcloud10

However, since the subcloud is not there, the attempt to connect to keystone fails (times out), but that takes more than four minutes:

2020-01-27 15:18:26.769 3352435 INFO dcorch.engine.fernet_key_manager [req-c5070101-aa47-443f-839a-a2ae1534d27c a981317518794800b4623fa6914d66bc - - default default] Fail to update fernet repo subcloud: subcloud10, Unable to establish connection to http://[fd01:10::2]:5000/v3/auth/tokens: HTTPConnectionPool(host='fd01:10::2', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe08239fe50>: Failed to establish a new connection: [Errno 110] ETIMEDOUT',))

This probably needs to be fixed by configuring the timeout here so that the connection times out much quicker. With the recent fix Barton Wensley made (https://opendev.org/starlingx/distcloud/commit/86d536ac52efb3cdbb5430ac88ccc38384194a9c), this may be as simple as setting the http_connect_timeout and http_request_max_retries in /etc/dcorch/dcorch.conf as Bart did in /etc/dcmanager/dcmanager.conf.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
IPv6 Distributed Cloud

Branch/Pull Time/Commit
-----------------------
Jan. 21st master

Last Pass
---------
This is an intermittent issue.

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or for large collect files use: https://files.starlingx.kube.cengn.ca/)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Test Activity
-------------
Evaluation

 Workaround
 ----------
 Setting the http_connect_timeout and http_request_max_retries in /etc/dcorch/dcorch.conf as done in /etc/dcmanager/dcmanager.conf

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
tags: added: stx.4.0 stx.distcloud
Changed in starlingx:
status: New → Triaged
assignee: nobody → Dariush Eslimi (deslimi)
Changed in starlingx:
assignee: Dariush Eslimi (deslimi) → Bart Wensley (bartwensley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/707258

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/707258
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=0389c7fbb1630988acd385140c9fc16835aae090
Submitter: Zuul
Branch: master

commit 0389c7fbb1630988acd385140c9fc16835aae090
Author: Bart Wensley <email address hidden>
Date: Tue Feb 11 15:21:09 2020 -0600

    Fix subcloud manage/unmanage issues caused by identity sync

    Recently identity (keystone) sync functionality was added to the
    dcorch. This changed the behaviour of the update_subcloud_states
    RPC. The dcmanager expects this RPC to be handled quickly and
    a reply sent almost immediately (timeout is 60s). Instead, the
    dcorch is now performing an identity sync when handling this
    RPC, which involves sending multiple messages to a subcloud and
    waiting for replies. This causes the update_subcloud_states RPC
    to time out sometimes (especially if a subcloud is unreachable)
    and the dcmanager/dcorch states to get out of sync, with no
    recovery mechanism in place.

    To fix this, I have create a new initial sync manager in the
    dcorch. When the dcorch handles the update_subcloud_states RPC,
    it will now just update the subcloud to indicate that an initial
    sync is required and then reply to the RPC immediately. The
    initial sync manager will perform the initial sync in the
    background (separate greenthreads) and enable the subcloud when
    it has completed. I also enhanced the dcmanager subcloud audit
    to periodically send a state update for each subcloud to the
    dcorch, which will correct any state mismatches that might
    occur.

    Change-Id: I70b98d432c3ed56b9532117f69f02d4a0cff5742
    Closes-Bug: 1860999
    Closes-Bug: 1861157
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/716140

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (15.6 KiB)

Reviewed: https://review.opendev.org/716140
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=04b49dd093ab850f4520cdb85638221120dd7568
Submitter: Zuul
Branch: f/centos8

commit 25c9d6ed3861f2d783404fcf84b186441ab9cd4d
Author: albailey <email address hidden>
Date: Wed Mar 25 15:43:32 2020 -0500

    Removing ddt from unit tests

    This cleanup should assist in transitioning to
    stestr and fixtures, as well as py3 support.

    The ddt data is primarily unused, only subcloud, route
    and endpoints were being loaded.

    The information in the data files was out of date,
    and not necessarily matching the current product model.

    Story: 2004515
    Task: 39160
    Change-Id: Iddd7ed4664b0d59dbc58aae5c3fedd74c9a138c0
    Signed-off-by: albailey <email address hidden>

commit 7f3827f24d2fb3cb546d3caf71d505d23187b0dc
Author: Tao Liu <email address hidden>
Date: Thu Mar 12 09:46:29 2020 -0400

    Keystone token and resource caching

    Add the following misc. changes to dcorch and dcmanager components:
    - Cache the master resource in dcorch audit
    - Consolidate the openstack drivers to common module, combine the
      dcmanager and dcorch sysinv client. (Note: the sdk driver that
      used by nova, neutron and cinder will be cleaned as part of
      story 2006588).
    - Update the common sdk driver:
      . in order to avoid creating new keystone client multiple times
      . to add a option for caching region clients, in addition to the
        keystone client
      . finally, to randomize the token early renewal duration
    - Change subcloud audit manager, patch audit manager,
      and sw update manager to:
      utilize the sdk driver which caches the keystone client and token

    Test cases:
    1. Manage/unmanage subclouds
    2. Platform resources sync and audit
    3. Verify the keystone token is cached until the token is
       expired
    4. Add/delete subclouds
    5. Managed subcloud goes offline/online (power off/on)
    6. Managed subcloud goes offline/online (delete/add a static route)
    7. Apply a patch to all subclouds via patch Orchestration

    Story: 2007267
    Task: 38865

    Change-Id: I75e0cf66a797a65faf75e7c64dafb07f54c2df06
    Signed-off-by: Tao Liu <email address hidden>

commit 3a1bf60caddfa2e807d4f5996ff94fea7dde5477
Author: Jessica Castelino <email address hidden>
Date: Wed Mar 11 16:23:21 2020 -0400

    Cleanup subcloud details when subcloud add fails

    Failure during add subcloud prevents subcloud from being added again
    with the same name as the subcloud details are not cleaned up
    properly. Fixes have been added for proper cleanup of dcorch database
    tables, ansible subcloud inventory files, keystone endpoints, keystone
    region, and addn_hosts_dc file when failure is encountered.

    Test cases:
    1. Add subcloud
    2. Add subcloud with "--deploy-playbook"
    3. Delete subcloud
    4. Raise explicit exception in dcorch/objects/subcloud.py
    5. Raise explicit exception in dcmanager/manager/subcloud_manager.py

    Change-Id: Iedf172c3e9c3c4bdb9b9482dc5d46f072b3ccf61
    ...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.