DC subcloud always showing online after subcloud powered off

Bug #1884560 reported by Peng Peng
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Gerry Kopec

Bug Description

Brief Description
-----------------
SX Subcloud showed "managed/online/in-sync". Unmanaged subcloud, then power off subcloud controller. After more than 10 mins, "dcmanage subcloud list" still show online.

Severity
--------
Major

Steps to Reproduce
------------------
Unmanaged subcloud
power off subcloud controller
wait for 10 mins,
check "dcmanage subcloud list"

Expected Behavior
------------------
subcloud show offline

Actual Behavior
----------------
subcloud show online

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Distributed Cloud

Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2020-06-20_14-11-05"

Last Pass
---------
2020-06-18_20-00-00

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list
+----+-----------+------------+--------------+---------------+-------------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+---------------+-------------+
| 1 | subcloud1 | managed | online | complete | in-sync |
| 6 | subcloud2 | managed | online | complete | out-of-sync |
+----+-----------+------------+--------------+---------------+-------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud unmanage 1
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 1 |
| name | subcloud1 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | unmanaged |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:12::0/64 |
| management_start_ip | fd01:12::2 |
| management_end_ip | fd01:12::11 |
| management_gateway_ip | fd01:12::1 |
| systemcontroller_gateway_ip | fd01:11::1 |
| group_id | 1 |
| created_at | 2020-06-21T05:02:44.550936 |
| updated_at | 2020-06-22T14:40:37.355855 |
+-----------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list
+----+-----------+------------+--------------+---------------+-------------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+---------------+-------------+
| 1 | subcloud1 | unmanaged | online | complete | unknown |
| 6 | subcloud2 | managed | online | complete | out-of-sync |
+----+-----------+------------+--------------+---------------+-------------+
[sysadmin@controller-0 ~(keystone_admin)]$ watch dcmanager subcloud list
[sysadmin@controller-0 ~(keystone_admin)]$ watch dcmanager subcloud list
[sysadmin@controller-0 ~(keystone_admin)]$ ssh subcloud1
ssh: connect to host subcloud1 port 22: No route to host
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list
+----+-----------+------------+--------------+---------------+-------------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+---------------+-------------+
| 1 | subcloud1 | unmanaged | online | complete | unknown |
| 6 | subcloud2 | managed | online | complete | out-of-sync |
+----+-----------+------------+--------------+---------------+-------------+

Test Activity
-------------
Regression Testing

Revision history for this message
Peng Peng (ppeng) wrote :
tags: added: stx.retestneeded
tags: added: stx.distcloud
Revision history for this message
Nimalini Rasa (nrasa) wrote :
Download full text (3.2 KiB)

Similar issue was seen when mgmt vlan was brought down on the switch side, all subcloud stayed online, in-sync.
Load:2020-06-11_20-00-00

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list
+-----+-------------+------------+--------------+---------------+---------+
| id | name | management | availability | deploy status | sync |
+-----+-------------+------------+--------------+---------------+---------+
| 361 | subcloud31 | managed | online | complete | in-sync |
| 362 | subcloud32 | managed | online | complete | in-sync |
| 363 | subcloud33 | managed | online | complete | in-sync |
| 364 | subcloud34 | managed | online | complete | in-sync |
| 367 | subcloud37 | managed | online | complete | in-sync |
| 368 | subcloud38 | managed | online | complete | in-sync |
| 369 | subcloud39 | managed | online | complete | in-sync |
| 370 | subcloud40 | managed | online | complete | in-sync |
| 372 | subcloud42 | managed | online | complete | in-sync |
| 374 | subcloud44 | managed | online | complete | in-sync |
| 376 | subcloud46 | managed | online | complete | in-sync |
| 377 | subcloud47 | managed | online | complete | in-sync |
| 378 | subcloud48 | managed | online | complete | in-sync |
| 431 | subcloud1 | managed | online | complete | in-sync |

[sysadmin@controller-0 ~(keystone_admin)]$ tail -f /var/log/dcmanager/dcmanager.log
2020-06-22 15:56:47.511 281916 ERROR dccommon.drivers.openstack.sdk_platform [-] keystone_client region subcloud16 error: Unable to establish connection to https://[fd01:110::2]:5001/v3/auth/tokens: HTTPSConnectionPool(host='fd01:110::2', port=5001): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f4af67e1a50>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',)): ConnectFailure: Unable to establish connection to https://[fd01:110::2]:5001/v3/auth/tokens: HTTPSConnectionPool(host='fd01:110::2', port=5001): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f4af67e1a50>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))
2020-06-22 15:56:47.511 281916 ERROR dcmanager.audit.subcloud_audit_manager [-] Identity or Platform endpoint for online subcloud: subcloud16 not found.: ConnectFailure: Unable to establish connection to https://[fd01:110::2]:5001/v3/auth/tokens: HTTPSConnectionPool(host='fd01:110::2', port=5001): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f4af67e1a50>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))
2020-06-22 15:56:47.515 281503 INFO dcmanager.manager.service [req-c3a5246c-f971-4c09-9ed2-a6ea0a1a10d5 - - - - -] Handling update_subcloud_availability request for: subcloud16
2020-06-22 15:56:47.538 281916 INFO dcmanager.audit.subcloud_audit_manager [-] Notifying dcmanag...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - issue appears to be newly introduced. Last pass was 2 days ago. Needs further investigation.

tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Gerry Kopec (gerry-kopec)
Revision history for this message
Nimalini Rasa (nrasa) wrote :
Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (3.6 KiB)

The subcloud audit has stalled completely. For 1884560 I see the last logs from the subcloud audit here:
2020-06-21 20:39:44.917 126591 INFO dcmanager.audit.subcloud_audit_manager [-] Triggered subcloud audit.
2020-06-21 20:39:44.924 126591 INFO dcmanager.audit.subcloud_audit_manager [-] Waiting for subcloud audits to complete.

Immediately before this, the subcloud was unmanaged:
2020-06-21 20:39:25.642 126274 INFO dcmanager.manager.service [req-b2be186a-ace3-4e0f-a7a5-54a85b841008 6f8a527c496342c9a3ab6f3fbc5b12ee - - default default] Handling update_subcloud request for: 6
2020-06-21 20:39:25.642 126274 INFO dcmanager.manager.subcloud_manager [req-b2be186a-ace3-4e0f-a7a5-54a85b841008 6f8a527c496342c9a3ab6f3fbc5b12ee - - default default] Updating subcloud 6.
2020-06-21 20:39:25.668 126274 INFO dcmanager.manager.subcloud_manager [req-b2be186a-ace3-4e0f-a7a5-54a85b841008 6f8a527c496342c9a3ab6f3fbc5b12ee - - default default] Notifying dcorch, subcloud:subcloud2 management: unmanaged, availability:online
2020-06-21 20:39:25.680 126274 INFO dcmanager.manager.subcloud_manager [req-b2be186a-ace3-4e0f-a7a5-54a85b841008 6f8a527c496342c9a3ab6f3fbc5b12ee - - default default] Updating all subclouds, endpoint: None sync: unknown

The logs indicate the subcloud audit is in the following loop waiting for all audit greenthreads to complete:
        # Wait for all greenthreads to complete
        LOG.info('Waiting for subcloud audits to complete.')
        for thread in self.subcloud_workers.values():
            thread.wait()

This means one of the audit threads have not returned (or raised an exception). I did some digging and I think the problem is the sysinv API – we are using the cgtsclient Client class and we are creating it without specifying a timeout. This class creates an HTTPClient (from httplib2) without a timeout and the httplib2 code indicates that this results in the default socket timeout being used, which is “no timeout”.

So what happens is this:
- Subcloud audit is happily running and has cached the subcloud keystone/sysinv clients.
- Subcloud is unmanaged and immediately powered down.
- Subcloud audit runs again and uses the cached clients (this is key because there are timeouts when creating a new keystone client).
- Subcloud audit uses cgtsclient to attempt to get the service groups from the subcloud. This hangs forever.
- Subcloud audit never runs again.

This means the subcloud will never go offline (as reported in the LP). I think this problem only started when the subcloud audit was restructured and moved into a separate process – I don’t think we were caching the subcloud clients before then, so we would have always timed out when trying to create the client.

I think the solution here is to supply a “reasonable” timeout when creating the cgtsclient – this needs to be pretty short (maybe 30s or less) because it blocks the next subcloud audit cycle). We also need to look at all the other clients the subcloud audit is using (e.g. patching) and ensure they all have a timeout implemented.

Finally – it would be good to protect that wait loop above from hanging forever if one of the audit greenthreads gets stuck. Reading through the ev...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/738089

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/738089
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=8ecba753119da83e80f4933bd24dac311b5979f1
Submitter: Zuul
Branch: master

commit 8ecba753119da83e80f4933bd24dac311b5979f1
Author: Gerry Kopec <email address hidden>
Date: Thu Jun 25 14:16:45 2020 -0400

    Add timeouts to DC sysinv and patching clients

    If a subcloud is powered off or loses communication with the system
    controller while the subcloud audit is waiting for a response to a
    sysinv or patching request, the subcloud audit will hang forever.
    Add some failsafe 10 minute timeouts to ensure the audit can recover.

    Change-Id: I369225119f89ab5a2f05137aba955dda982d51c5
    Closes-Bug: 1884560
    Closes-Bug: 1870413
    Signed-off-by: Gerry Kopec <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Verified on

BUILD_ID="2020-06-24_22-16-59"

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list
+----+------------+------------+--------------+---------------+---------+
| id | name | management | availability | deploy status | sync |
+----+------------+------------+--------------+---------------+---------+
| 1 | subcloud2 | managed | online | complete | in-sync |
| 2 | subcloud6 | managed | online | complete | in-sync |
| 3 | subcloud7 | managed | online | complete | in-sync |
| 4 | subcloud8 | managed | online | complete | in-sync |
| 5 | subcloud1 | managed | online | complete | in-sync |
| 6 | subcloud10 | managed | online | complete | in-sync |
| 7 | subcloud3 | managed | online | complete | in-sync |
| 8 | subcloud4 | managed | online | complete | in-sync |
| 9 | subcloud5 | managed | online | complete | in-sync |
| 10 | subcloud9 | unmanaged | offline | complete | unknown |
+----+------------+------------+--------------+---------------+---------+

tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.