SX configure failed by platform-integ-apps apply-failed

Bug #2000080 reported by Erickson Silva de Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Erickson Silva de Oliveira

Bug Description

Brief Description
-----------------
SX system installation failed by platform-integ-apps apply-failed

Severity
--------
Minor - Issue stopped the auto-installer from declaring a successful install. But the workaround is to re-apply platform-integ-apps

Steps to Reproduce
------------------
Install system

Expected Behavior
------------------
DM config platform-integ-apps success

Actual Behavior
----------------
DM config platform-integ-apps failed

Reproducibility
---------------
this is the first time saw this issue

System Configuration
--------------------
One node system

Branch/Pull Time/Commit
-----------------------

Timestamp/Logs
--------------
[2022-12-18 20:09:47,042] 541 DEBUG MainThread ssh.exec_cmd:: Running command: system --os-endpoint-type internalURL --os-region-name RegionOne application-list
[2022-12-18 20:09:47,042] 351 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2022-12-18 20:09:47,093] 548 DEBUG MainThread ssh.exec_cmd:: Expecting [.@controller-[01] .(keystone_admin)]\$ in prompt
[2022-12-18 20:09:55,425] 473 DEBUG MainThread ssh.expect :: Output:
------------------------{}------{}+{}---------------------------------------{}{}--------------{}{}------{}{}------------------------

application version manifest name manifest file status progress
------------------------{}------{}+{}---------------------------------------{}{}--------------{}{}------{}{}------------------------

cert-manager 22.12-1 cert-manager-fluxcd-manifests fluxcd-manifests applied completed
nginx-ingress-controller 22.12-1 nginx-ingress-controller-fluxcd-manifests fluxcd-manifests applied completed
oidc-auth-apps 22.12-1 oidc-auth-apps-fluxcd-manifests fluxcd-manifests uploaded completed
platform-integ-apps 22.12-55 platform-integ-apps-fluxcd-manifests fluxcd-manifests applying retrieving docker images
------------------------{}------{}+{}---------------------------------------{}{}--------------{}{}------{}{}------------------------
[sysadmin@controller-0 ~(keystone_admin)]$
[2022-12-18 20:09:55,426] 351 DEBUG MainThread ssh.send :: Send 'echo $?'
[2022-12-18 20:09:55,478] 473 DEBUG MainThread ssh.expect :: Output:
0
[sysadmin@controller-0 ~(keystone_admin)]$
[2022-12-18 20:10:00,484] 534 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2022-12-18 20:10:00,484] 541 DEBUG MainThread ssh.exec_cmd:: Running command: system --os-endpoint-type internalURL --os-region-name RegionOne application-list
[2022-12-18 20:10:00,484] 351 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2022-12-18 20:10:00,534] 548 DEBUG MainThread ssh.exec_cmd:: Expecting [.@controller-[01] .(keystone_admin)]\$ in prompt
[2022-12-18 20:10:09,264] 473 DEBUG MainThread ssh.expect :: Output:
------------------------{}------{}+{}---------------------------------------{}{}--------------{}{}----------{}{}----------------------------------------------------------------------------

application version manifest name manifest file status progress
------------------------{}------{}+{}---------------------------------------{}{}--------------{}{}----------{}{}----------------------------------------------------------------------------

cert-manager 22.12-1 cert-manager-fluxcd-manifests fluxcd-manifests applied completed
nginx-ingress-controller 22.12-1 nginx-ingress-controller-fluxcd-manifests fluxcd-manifests applied completed
oidc-auth-apps 22.12-1 oidc-auth-apps-fluxcd-manifests fluxcd-manifests uploaded completed
platform-integ-apps 22.12-55 platform-integ-apps-fluxcd-manifests fluxcd-manifests apply-failed Unexpected process termination while application-apply was in progress. The
          application status has changed from 'applying' to 'apply-failed'.

------------------------{}------{}+{}---------------------------------------{}{}--------------{}{}----------{}{}----------------------------------------------------------------------------
[sysadmin@controller-0 ~(keystone_admin)]$

[2022-12-18 20:10:00,484] 351 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2022-12-18 20:10:00,534] 548 DEBUG MainThread ssh.exec_cmd:: Expecting [.@controller-[01] .(keystone_admin)]\$ in prompt
[2022-12-18 20:10:09,264] 473 DEBUG MainThread ssh.expect :: Output:
------------------------{}------{}+{}---------------------------------------{}{}--------------{}{}----------{}{}----------------------------------------------------------------------------

application version manifest name manifest file status progress
------------------------{}------{}+{}---------------------------------------{}{}--------------{}{}----------{}{}----------------------------------------------------------------------------

cert-manager 22.12-1 cert-manager-fluxcd-manifests fluxcd-manifests applied completed
nginx-ingress-controller 22.12-1 nginx-ingress-controller-fluxcd-manifests fluxcd-manifests applied completed
oidc-auth-apps 22.12-1 oidc-auth-apps-fluxcd-manifests fluxcd-manifests uploaded completed
platform-integ-apps 22.12-55 platform-integ-apps-fluxcd-manifests fluxcd-manifests apply-failed Unexpected process termination while application-apply was in progress. The
          application status has changed from 'applying' to 'apply-failed'.

------------------------{}------{}+{}---------------------------------------{}{}--------------{}{}----------{}{}----------------------------------------------------------------------------
[sysadmin@controller-0 ~(keystone_admin)]$

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "fb6677e3bb5854c537a578d135dfa0627a399bb3d85278065584b8ea677e69ec": plugin type="multus" name="multus-cni-network" failed (add): Multus: [platform-deployment-manager/platform-deployment-manager-699b9bb75b-gj7xp/]: error getting pod: Get "https://[10.96.0.1]:443/api/v1/namespaces/platform-deployment-manager/pods/platform-deployment-manager-699b9bb75b-gj7xp?timeout=1m0s": dial tcp 10.96.0.1:443: connect: no route to host

Test Activity
-------------
installation

Workaround
-------------
Run the system application-apply platform-integ-apps command manually and it passes

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/868118

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/868119

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on integ (master)

Change abandoned by "Erickson Silva de Oliveira <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/868119
Reason: It is not necessary.

Ghada Khalil (gkhalil)
summary: - SX DM configure failed by platform-integ-apps apply-failed
+ SX configure failed by platform-integ-apps apply-failed
tags: added: stx.apps
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/868118
Committed: https://opendev.org/starlingx/ha/commit/b4fb57c610e5d0768735845bc7fdb77aebd2138d
Submitter: "Zuul (22348)"
Branch: master

commit b4fb57c610e5d0768735845bc7fdb77aebd2138d
Author: Erickson Silva de Oliveira <email address hidden>
Date: Mon Dec 19 17:44:59 2022 +0000

    Increase retries and timeouts on "audit-enabled" of
    mgr-restful-plugin in SM database

    When the system is unstable, using a lot of CPU, it takes
    more time for the communication between the components to happen.

    So it's necessary to increase the maximum of retries and timeouts
    in the "audit-enable" of the mgr-restful-plugin to prevent errors
    from happening.

    Test Plan:
    PASS: mgr-restful-plugin restarted by SM (AIO-SX)

    Closes-bug: 2000080

    Signed-off-by: Erickson Silva de Oliveira <email address hidden>
    Change-Id: I0f8462fef20196a3bb913fa7d374a86a2c6565f1

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ha (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/871860

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.8.0
Changed in starlingx:
assignee: nobody → Erickson Silva de Oliveira (esilvade)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Low → Medium
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/871860
Committed: https://opendev.org/starlingx/ha/commit/da832e0ad655da5b28dd55716bcc74e59c724de0
Submitter: "Zuul (22348)"
Branch: master

commit da832e0ad655da5b28dd55716bcc74e59c724de0
Author: Alyson Deives Pereira <email address hidden>
Date: Thu Jan 26 20:12:04 2023 -0300

    Remove disable dependence between ceph-manager and sysinv-conductor

    When the system is unstable, using a lot of CPU, it takes
    more time for the communication between the components to happen,
    such as the communication between mgr-restful-plugin and ceph-mgr.
    This communication failure may result on a failed audit by SM,
    which then restarts the mgr-restful-plugin.

    Change [1] resolved this issue by increasing the timeout and retries
    of mgr-restful-plugin in SM database.

    However, there is a disable dependence chain between mgr-restful,
    ceph-manager, and sysinv-conductor which results on sysinv-conductor
    being restarted if mgr-restful-plugin or ceph-manager is also disabled
    by SM. This can impact platform-integ-apps apply or any other action
    being executed by sysinv-conductor.

    The ceph manager -> sysinv-conductor dependence is not necessary
    anymore after the changes [2] and [3] were merged, thus this change
    removes this dependence.

    TEST PLAN:
    PASS: AIO-SX: bootstrap, unlock and apply platform-integ-apps
    PASS: Force ceph-manager to be restarted by SM, and verify that
          sysinv-conductor keeps running

    Related-Bug: 2000080

    [1] https://review.opendev.org/c/starlingx/ha/+/868118
    [2] https://review.opendev.org/c/starlingx/utilities/+/856320
    [3] https://review.opendev.org/c/starlingx/utilities/+/860570

    Signed-off-by: Alyson Deives Pereira <email address hidden>
    Change-Id: I949ccebd509b8099870b3dfda252a60b6b423715

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.