Subcloud is offline for system with k8s certificates at expiry date

Bug #2002452 reported by Carmen Rata
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Carmen Rata

Bug Description

Subcloud with valid certs (show certs output), valid routes (subcloud/systemcontroller pingable), correct gateway, appears offline to systemcontroller.

Severity

Minor

Steps to Reproduce

Check subcloud status if SystemController has uptime of more than a year,
e.g. uptime_days => 382, which is above the 1 year k8s certificates expiry date.

Expected Behavior

Subcloud is in "online" state. The dcmanager audit should be successful.
K8s certificates have been rotated.

Actual Behavior

dcmanager receives 401 and fails the periodic subcloud audit.

2022-11-04 08:02:51.942 433961 WARNING cgtsclient.common.http [-] Request returned failure status.: HTTPInternalServerError: (401)2022-11-04 08:02:51.947 433961 ERROR dcmanager.audit.subcloud_audit_manager [-] Error in periodic subcloud audit loop: HTTPInternalServerError: (401)Reason: Unauthorized

Reproducibility

Happened once.

System Configuration

Distributed cloud.

Logs

The status update stopped for the subclouds after below 401 error messages started to appear in audit.log

1st - 401 error message:

2022-11-04 03:41:07.685 433961 INFO dcmanager.audit.subcloud_audit_manager [-] Triggered subcloud audit: patch=(True) firmware=(True) kube=(True)
2022-11-04 03:41:12.503 433961 WARNING cgtsclient.common.http [-] Request returned failure status.
2022-11-04 03:41:12.504 433961 ERROR dcmanager.audit.subcloud_audit_manager [-] Error in periodic subcloud audit loop: HTTPInternalServerError: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Date': 'Fri, 04 Nov 2022 03:41:12 GMT', 'Content-Length': '129', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}

sysinv.log

sysinv 2022-11-04 03:41:12.499 3494410 ERROR wsme.api [-] Server-side error: "(401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Date': 'Fri, 04 Nov 2022 03:41:12 GMT', 'Content-Length': '129', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}

audit.log

2022-11-04 00:49:49.603 434242 INFO dcmanager.audit.subcloud_audit_worker_manager [-] Identity or Platform endpoint not found, ignoring for offline subcloud.

dcmanager.log

2022-11-05 11:00:59.921 3497389 INFO dcmanager.manager.subcloud_manager
Ignoring subcloud sync_status update for subcloud,
availability:offline management:managed endpoint:dc-cert sync:in-sync

keystone-all.log

2022-11-06 04:41:48.921 2809622 WARNING keystone.server.flask.application
Failed to validate token: TokenNotFound: Failed to validate token
2022-11-06 04:41:48.924 432835 WARNING keystonemiddleware.auth_token [-] Authorization failed for token: InvalidToken: Token authorization failed
2022-11-06 04:41:49.486 2809622 WARNING keystone.server.flask.application
Failed to validate token: TokenNotFound: Failed to validate token

Alarms

Critical alarm: Subcloud is offline.

Workaround

Perform a host swact.

Carmen Rata (crata)
Changed in starlingx:
assignee: nobody → Carmen Rata (crata)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/869782

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
summary: - Subcloud is offline for system with certificates at expiry date
+ Subcloud is offline for system with k8s certificates at expiry date
Ghada Khalil (gkhalil)
tags: added: stx.security
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/869782
Committed: https://opendev.org/starlingx/config/commit/8cd5f76083f21ac0684825bbd9450edda1f9f5ea
Submitter: "Zuul (22348)"
Branch: master

commit 8cd5f76083f21ac0684825bbd9450edda1f9f5ea
Author: Carmen Rata <email address hidden>
Date: Wed Jan 11 03:56:44 2023 +0000

    Fix subcloud going offline if certificates expire

    K8s certificates rotation after they reach the expiry date requires
    restart of sysinv services, both sysinv-conductor and sysinv-inv.
    The sysinv services cache k8s client object and get credentials
    from admin.conf. Restaring only the sysinv-conductor and missing the
    restart of the sysinv api causes the certificates not to be updated
    and this way affecting subcloud management functionality.
    The fix updates the script "kube-cert-rotation.sh" to restart all
    sysinv services and not only sysinv-conductor.
    The script "kube-cert-rotation.sh" requires to be installed with
    "700" permission.

    Tests performed:
    PASS: kube-cert-rotation.sh script gets installed correctly in
    directory /usr/bin and is set with permissions "700".
    PASS: kube-cert-rotation.sh script executes without errors when run
    to renew K8s certificates.
    PASS: After K8s certificates are renewed, all sysinv services get
    restarted.
    PASS: Executed successfully kube-cert-rotation.sh in AIO-SX and DC
    system configurations.

    Closes-Bug: 2002452
    Signed-off-by: Carmen Rata <email address hidden>
    Change-Id: Ie74a47226280b9362558ebfa158a4bf91209e957

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.