platform-integ-apps fails to reach applied state after SX to DX migration

Bug #1970443 reported by Enzo Candotti
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Enzo Candotti

Bug Description

Brief Description

platform-integ-apps fails to reach applied state after SX system is migrated to DX system

Severity
major

Steps to Reproduce

1)install subcloud as SX
2)Create "migrate-subcloud-overrides.yaml" file on centralcontroller.
3)Run migrate_sx_to_dx.yml on central cloud and verify that there are not errors in the output

4)When the subcloud is online, managed, and login to subcloud and verify that sx is converted to duplex

[sysadmin@controller-0 ~(keystone_admin)]$ system show
+------------------------+--------------------------------------+
| Property | Value |
+------------------------+--------------------------------------+
| contact | None |
| created_at | 2021-12-05T18:36:08.995159+00:00 |
| description | None |
| distributed_cloud_role | subcloud |
| https_enabled | True |
| latitude | None |
| location | None |
| longitude | None |
| name | dc-subcloud12 |
| region_name | subcloud12 |
| sdn_enabled | False |
| security_feature | spectre_meltdown_v1 |
| service_project_name | services |
| shared_services | [] |
| software_version | 21.12 |
| system_mode | duplex |
| system_type | All-in-one |
| timezone | UTC |
| updated_at | 2021-12-07T19:47:27.308407+00:00 |
| uuid | 0c86a371-c387-4104-9d75-e6948454ffe3 |
| vswitch_type | none |
+------------------------+--------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

5)But the platform-integ-apps failed to reach applied state and it stuck at 25% forever

[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+-----------------------------------+----------------------------------------+----------+---------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+----------------------------------------+----------+---------------------------------------+
| cert-manager | 1.0-25 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.1-17 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-59 | oidc-auth-manifest | manifest.yaml | applied | completed |
| platform-integ-apps | 1.0-42 | platform-integration-manifest | manifest.yaml | applying | processing chart: stx-rbd-provisioner |
| | | | | | , overall completion: 25.0% |
| | | | | | |
| rook-ceph-apps | 1.0-13 | rook-ceph-manifest | manifest.yaml | uploaded | completed |
| vault | 1.0-22 | vault-manifest | vault_manifest.yaml | applied | completed |
+--------------------------+---------+-----------------------------------+----------------------------------------+----------+---------------------------------------+

when tried to apply again, it failed

Expected Behavior

after migration the app should be applied properly

Actual Behavior

platform-integ-apps fails to reach applied state

Reproducibility

100%

System Configuration
SX subcloud. Seen on AIO-SX standalone too

Branch/Pull Time/Commit
21.12

Last Pass

21.05

Timestamp/Logs

/var/log/armada/platform-integ-apps-apply_2021-12-07-21-12-42.log

2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller [-] [chart=kube-system-rbd-provisioner]: Error while installing release stx-rbd-provisioner: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
>---status = StatusCode.UNKNOWN
>---details = "release stx-rbd-provisioner failed: timed out waiting for the condition"
>---debug_error_string = "{"created":"@1638913371.198181944","description":"Error received from peer ipv4:127.0.0.1:24134","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"release stx-rbd-provisioner failed: timed out waiting for the condition","grpc_status":2}"
>
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller Traceback (most recent call last):
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 465, in install_release
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller metadata=self.metadata)
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 923, in __call__
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller raise _InactiveRpcError(state)
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller >--status = StatusCode.UNKNOWN
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller >--details = "release stx-rbd-provisioner failed: timed out waiting for the condition"
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller >--debug_error_string = "{"created":"@1638913371.198181944","description":"Error received from peer ipv4:127.0.0.1:24134","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"release stx-rbd-provisioner failed: timed out waiting for the condition","grpc_status":2}"
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller >
2021-12-07 21:42:51.198 178 ERROR armada.handlers.tiller ^[[00m
2021-12-07 21:42:51.199 178 DEBUG armada.handlers.tiller [-] [chart=kube-system-rbd-provisioner]: Helm getting release status for release=stx-rbd-provisioner, version=0 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:531^[[00m

/var/logs/pods/kube-system_rbd-provisioner-759dfb8b6b-cfbnf_fedb37bb-ca78-4e48-93b2-9d14d98327da/rbd-provisioner/26.log

2021-12-07T21:44:52.687682317Z stderr F F1207 21:44:52.687211 1 main.go:80] Error getting server version: the server has asked for the client to provide credentials

Alarms
-

Test Activity
Regression

Workaround
-

More info:
The problem might be that certificate is missing controller-0's cluster host IP in SANs

description: updated
Changed in starlingx:
assignee: nobody → Enzo Candotti (ecandotti)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/839394

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/839394
Committed: https://opendev.org/starlingx/stx-puppet/commit/0709f70b023bad917f594805a1a00ef33faaf10e
Submitter: "Zuul (22348)"
Branch: master

commit 0709f70b023bad917f594805a1a00ef33faaf10e
Author: Enzo Candotti <email address hidden>
Date: Tue Apr 26 11:38:12 2022 -0300

    Upgrade k8s certificates during SX-DX migration

    It was seen that during a SX-DX migration, k8s certificate is missing
    controller-0's cluster host IP in SANs. This is making problems on
    controller-0 after it unlocks as duplex.

    This change updates platform::kubernetes::certsans::runtime to add
    controller-0 and controller-1 cluster host IP in the config file
    used when regenerating apiserver cert files.

    Test Plan:
    PASS: run a migration on AIO-SX standalone and AIO-SX subcloud.
    PASS: Swact to make controller-1 as active, modify OAM IP,
    check apiserver.crt on both controllers, and verify cluster host IP for
    both controllers are added in SANs.

    Closes-bug: 1970443

    Signed-off-by: Enzo Candotti <email address hidden>
    Change-Id: I2e225df2c402f4439da737ade8a5a0e57b96f673

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to platform-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/platform-armada-app/+/855090
Committed: https://opendev.org/starlingx/platform-armada-app/commit/48859ba190cc931d32c7fa4977bc2b883e59cc26
Submitter: "Zuul (22348)"
Branch: master

commit 48859ba190cc931d32c7fa4977bc2b883e59cc26
Author: Enzo Candotti <email address hidden>
Date: Mon Aug 29 17:54:29 2022 -0300

    Add pre-upgrade helm hooks to StorageClass

    After the introduction of FluxCD, there have been issues on
    platform-integ-apps auto re-apply on SX to DX migration.

    This change adds the following helm hooks that were missing from
    StorageClass:
    - "helm.sh/hook": "pre-upgrade, pre-install"
    - "helm.sh/hook-delete-policy": "before-hook-creation"

    Test Plan:

    PASS: CentOS - Create platform-integ-apps tgz with these changes and
    test the migration.
    PASS: Debian - Create platform-integ-apps tgz with these changes and
    test the migration.

    Closes-bug: 1970443

    Signed-off-by: Enzo Candotti <email address hidden>
    Change-Id: I3aa2e8f6cbb7504f9a4bf368d90f130dd5131106

Ghada Khalil (gkhalil)
tags: added: stx.8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.