DC Subcloud stayed out-of-synch for long period of time

Bug #1877584 reported by Nimalini Rasa
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Gerry Kopec

Bug Description

Brief Description
-----------------
After bootstraping and managed the subcloud, it stayed out of synch for long time, almost 2 hours since the subcloud is added, and one hour since the subcloud is managed.

Severity
--------
Major

Steps to Reproduce
------------------
Bringing up DC

Expected Behavior
------------------
Expected the subcloud to become in-synch in short time

Actual Behavior
----------------
Subcloud took unusual long time to become in-synch

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system, IPV6, DC subcloud

Branch/Pull Time/Commit
-----------------------
2020-05-07

Last Pass
---------
N/A

Timestamp/Logs
--------------
2020-05-08 11:14:05.466 subcloud added

2020-05-08 11:44:40.652 116306 ERROR dccommon.drivers.openstack.sdk_platform [-] keystone_client region subcloud5 error: Unable to establish connection to https://[fd01:4::2]:5001/v3/auth/tokens: HTTPSConnectionPool(host='fd01:4::2', port=5001): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f56db5c31d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)): ConnectFailure: Unable to establish connection to https://[fd01:4::2]:5001/v3/auth/tokens: HTTPSConnectionPool(host='fd01:4::2', port=5001): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f56db5c31d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2020-05-08 11:44:40.652 116306 INFO dcmanager.manager.subcloud_audit_manager [-] Identity or Platform endpoint for subcloud5 not found, ignoring for offline subcloud.

Test Activity
-------------
Regression Testing

Revision history for this message
Nimalini Rasa (nrasa) wrote :
description: updated
Revision history for this message
Nimalini Rasa (nrasa) wrote :
Revision history for this message
Bart Wensley (bartwensley) wrote :

This is the same issue recently reported in https://bugs.launchpad.net/starlingx/+bug/1877419. See that bug for some analysis.

tags: added: stx.distcloud
Changed in starlingx:
assignee: nobody → Gerry Kopec (gerry-kopec)
Revision history for this message
Difu Hu (difuhu) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - based on the description, the subclouds are not online for more than an hour

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.4.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/730613

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/730615

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/730616

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/730613
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=566076a0a8808ee958f5edb4fb9333f426bf1490
Submitter: Zuul
Branch: master

commit 566076a0a8808ee958f5edb4fb9333f426bf1490
Author: Gerry Kopec <email address hidden>
Date: Wed May 20 02:30:03 2020 -0400

    Fix slow subcloud manage due to sysinv auth

    Synchronize keystone services project and sysinv user ids between the
    system controller and subcloud during subcloud bootstrap. This fixes
    issue where the subcloud would go offline on initial manage for 1 hour
    until keystone token expires.

    Change-Id: I2b36861df0858920eed308bd8e9a12b49b68a191
    Closes-Bug: 1877419
    Closes-Bug: 1877584
    Signed-off-by: Gerry Kopec <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/730615
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d9e622e3cd5009c235bdc88dfb5de1cfd8580591
Submitter: Zuul
Branch: master

commit d9e622e3cd5009c235bdc88dfb5de1cfd8580591
Author: Gerry Kopec <email address hidden>
Date: Mon May 25 02:39:56 2020 -0400

    Support sync of services and sysinv id for subcloud

    Update keystone and sysinv bootstrap manifests to update services
    project and sysinv user id and associated assignments in keystone
    database on subclouds to match system controller. This prevents
    subcloud sysinv keystone tokens from being invalidated during initial
    subcloud sync causing long delays in subcloud going in sync with system
    controller.

    Change-Id: I4e0a8efea7d197d6963623f05fc865f47d02f033
    Partial-Bug: 1877419
    Partial-Bug: 1877584
    Depends-On: https://review.opendev.org/#/c/730613
    Signed-off-by: Gerry Kopec <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/730616
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=c4f56624a5fd93377442c0b879eb556028bcb39a
Submitter: Zuul
Branch: master

commit c4f56624a5fd93377442c0b879eb556028bcb39a
Author: Gerry Kopec <email address hidden>
Date: Mon May 25 07:21:04 2020 -0400

    Support sync of services and sysinv id for subcloud

    Add services project and sysinv user ids to subcloud static hieradata.

    Change-Id: If17c6dc8c3f2c776eb539f6b168b926425b6bef0
    Partial-Bug: 1877419
    Partial-Bug: 1877584
    Depends-On: https://review.opendev.org/#/c/730613
    Signed-off-by: Gerry Kopec <email address hidden>

Revision history for this message
Difu Hu (difuhu) wrote :

This issue is not seen with latest fix.

Revision history for this message
Difu Hu (difuhu) wrote :

This issue is not seen with latest fix.
Verified on: load 2020-05-29_20-00-00, lab DC-1 wcp_80_91

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/762919

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.