Openstack manifest apply hung applying cinder manifest

Bug #1833323 reported by Brent Rowsell
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Tee Ngo

Bug Description

Brief Description
-----------------
I was applying the openstack application and it was stuck applying the cinder chart.
Upon investigation it appears the armada container was terminated when sysinv restarted due to SM detecting an audit failure

Manual database recovery was required which is not an acceptable solution. The application framework must gracefully handle process restarts

Severity
--------
Major

Steps to Reproduce
------------------
See above

Expected Behavior
------------------
Application applied without errors

Actual Behavior
----------------
See above

Reproducibility
---------------
Seen once so far

System Configuration
--------------------
Seen on AIO-SX, low latency profile, but expect that it is applicable to all configs

Branch/Pull Time/Commit
-----------------------
"2019-06-03 18:37:28"
Tarball built on June 4th

Last Pass
---------
Same load lineup

Timestamp/Logs
--------------
2019-06-18T18:27:31.000 controller-0 bash: info HISTORY: PID=147344 UID=0 system application-apply stx-openstack
2019-06-18 18:27:32.356 92510 INFO sysinv.conductor.kube_app [-] Application (stx-openstack) apply started.
2019-06-18 18:56:48.881 92510 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-cinder, overall completion: 73.0%

| 2019-06-18T18:24:17.497 | 262 | service-group-scn | vim-services | go-active | active |
| 2019-06-18T18:59:44.939 | 263 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2019-06-18T18:59:45.288 | 264 | service-scn | ceph-manager | enabled-active | disabling | disable state requested
| 2019-06-18T18:59:45.290 | 265 | service-scn | sysinv-conductor | enabled-active | disabling | disable state requested
| 2019-06-18T18:59:45.290 | 266 | service-scn | sysinv-inv | enabled-active | disabling | disable state requested

sysinv restarted
2019-06-18 18:59:45.415 92988 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2019-06-18 18:59:45.416 92988 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-06-18 18:59:45.416 92988 INFO oslo_service.service [-] Waiting on 1 children to exit
2019-06-18 18:59:45.416 99204 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-06-18 18:59:45.431 92988 INFO oslo_service.service [-] Child 99204 exited with status 0
2019-06-18 18:59:45.432 92988 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2019-06-18 18:59:45.433 92988 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-06-18 18:59:45.433 92988 INFO oslo_service.service [-] Waiting on 1 children to exit
2019-06-18 18:59:45.433 99229 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-06-18 18:59:45.437 92988 INFO oslo_service.service [-] Child 99229 exited with status 0
2019-06-18 18:59:45.538 92510 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
2019-06-18 18:59:45.539 92510 INFO sysinv.openstack.common.service [-] Caught SIGTERM, exiting
2019-06-18 19:00:44.954 13035 INFO sysinv.agent.manager [-] ilvg_get_nova_ilvg_by_ihost() Timeout.
2019-06-18 19:00:44.961 13035 INFO sysinv.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672

Test Activity
-------------
Other

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Logs

summary: - Openstack manifest apply hung applyinf cinder manifest
+ Openstack manifest apply hung applying cinder manifest
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 release gating; application apply failure.

description: updated
tags: added: stx.containers
tags: added: stx.2.0
Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Tee Ngo (teewrs)
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670180

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/670180
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=e4feb091aa032ff32043c108878970554ee686fe
Submitter: Zuul
Branch: master

commit e4feb091aa032ff32043c108878970554ee686fe
Author: Tee Ngo <email address hidden>
Date: Wed Jul 10 16:03:25 2019 -0400

    Audit application status upon sysinv-conductor startup

    If sysinv conductor process is abruptly terminated due to power
    loss, OOM, missed audit cycle, etc... while an application upload/
    apply/update/remove is in progres; the status of the application
    is stuck in uploading/applying/updating/removing preventing any
    subsequent system applications to resume (or revert) the operation.

    This is the first commit of multi-commit change to improve the
    robustness of the application framework.

    Partial-Bug: 1833323
    Change-Id: I73129d5621c77b50c2e29c078e6b99089244129f
    Signed-off-by: Tee Ngo <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/671552

Revision history for this message
Frank Miller (sensfan22) wrote :

Reviewed by the containers PL (Frank) and TL (Brent) and changed priority to high. Container stability/robustness is required for stx.2.0.

Changed in starlingx:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/672607

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/671552
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=3f7bbbd94c71c7b8c615704d7edd8ecd62c2f1c4
Submitter: Zuul
Branch: master

commit 3f7bbbd94c71c7b8c615704d7edd8ecd62c2f1c4
Author: Tee Ngo <email address hidden>
Date: Thu Jul 18 15:01:40 2019 -0400

    Add system application-abort command

    This commit provides a basic capability to abort an application
    operation that is in progress.

    - The new abort command is only applicable to application that
      is being applied, updated or removed. The abort request
      will be rejected if the application has any other status
      including "recovering" from a failed update.
    - The abort processing thread sets "abort" flag for the specified
      app, terminates Armada request and removes dangling locks.
    - By setting the "abort" flag, the corresponding app processing
      thread will bail out at the next opportunity that is advantageous
      and friendly to subsequent command to resume or revert the
      operation.
    - Terminating Armada request entails stopping Armada service.
      This logic will be revisited in the future when the Armada
      lock restriction is lifted and there is more soak time.
    - In the event the user aborted while openstack compute-kits charts
      group was being deployed, wait for Tiller to finish the
      "pending install" charts before making a new Armada request to
      reapply the app.

    Tests:
      - upload, apply, and remove of stx-openstack app
      - upload, apply, update of a test app (a scaled down version
        of stx-openstack app)
      - abort during the apply of stx-openstack app during
        a) images download and b) manifest apply
      - abort during the update of the test app during
        a) images download and b) manifest apply which triggers
        recovery with and without Armada request respectively
      - abort during the remove of stx-openstack app
      - abort while compute-kits is being deployed then reapply

    Partial-Bug: 1833323
    Change-Id: I382bd9ce82d504b7d221c8079f9f46c6798eb3b1
    Signed-off-by: Tee Ngo <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fault (master)

Fix proposed to branch: master
Review: https://review.opendev.org/673111

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/673348

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fault (master)

Reviewed: https://review.opendev.org/673111
Committed: https://git.openstack.org/cgit/starlingx/fault/commit/?id=b4c088c6a4215a8aea2da306c4bc7fe616f6bce6
Submitter: Zuul
Branch: master

commit b4c088c6a4215a8aea2da306c4bc7fe616f6bce6
Author: Tee Ngo <email address hidden>
Date: Fri Jul 26 17:53:13 2019 -0400

    Define alarm group, type and ids for application

    The newly introduced fault constants will be used to raise and
    clear application related alarms.

    Closes-Bug: 1833323
    Change-Id: I992ab7a788cfab8d52d2e6a498519c591148f588
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fault (master)

Fix proposed to branch: master
Review: https://review.opendev.org/673615

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fault (master)

Reviewed: https://review.opendev.org/673615
Committed: https://git.openstack.org/cgit/starlingx/fault/commit/?id=37589bf657679bfa39994ec316053948b60a289f
Submitter: Zuul
Branch: master

commit 37589bf657679bfa39994ec316053948b60a289f
Author: Tee Ngo <email address hidden>
Date: Tue Jul 30 14:54:28 2019 -0400

    Add missing fields to application event definitions

    This commit is a follow up of the previous commit
    b4c088c6a4215a8aea2da306c4bc7fe616f6bce6 which missed
    some fields for fm-doc compile.

    Closes-Bug: 1833323
    Change-Id: I80e42ea49d1e69ac3d6b199a1b26dfb7c977dd1f
    Signed-off-by: Tee Ngo <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/673348
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=8a797c9864dffcc4fa3530588841122d48d71b03
Submitter: Zuul
Branch: master

commit 8a797c9864dffcc4fa3530588841122d48d71b03
Author: Tee Ngo <email address hidden>
Date: Mon Jul 29 14:15:53 2019 -0400

    Raise/clear alarms based on Kubernetes app status

    This is the final commit for LP1833323.

    A major alarm will be raised when an application upload,
    apply/update or remove fails. A warning alarm will be raised
    when an application apply/update is in progress to discourage
    host maintenance (e.g. unlock) as this could be disruptive to
    the operation. The alarm is cleared once the operation is
    successfully completed.

    During an application update, if a failure occurs, automatic
    recovery will kick in to revert back to the original app
    version. If this is successful, the alarm that was previously
    raised at the point of update failure will be cleared.

    In the event sysinv conductor is unexpectedly restarted (e.g.
    controller swact, failed audit) while an application operation
    is in progress, the application status will be reset and an
    application alarm raised/cleared accordingly.

    This commit also includes the fix for a minor bug introduced
    in commit 3f7bbbd94c71c7b8c615704d7edd8ecd62c2f1c4 to add
    application-abort command which has no user impact but would
    result in a misleading log.

    Tests:
      - Verify a warning alarm is generated at the beginning of
        an application apply/update and is cleared at the end.
      - Induce an upload failure, verify a warning alarm is
        generated.
      - Abort during an application-apply, verify a major alarm
        is generated. Retry the apply, verify the alarm is
        cleared at the end of the apply.
      - Abort during an application-update, verify a major alarm
        is generated at the point of failure. Verify that the
        alarm is cleared once automatic recovery has completed.
      - Abort during an application-remove, verify a major alarm
        is generated. Retry the remove, verify the alarm is
        cleared at the end of the apply.

    Closes-Bug: 1833323
    Depends-On: I992ab7a788cfab8d52d2e6a498519c591148f588
    Change-Id: I9bd2cabe318b75b88768ea493992ffd37fb777b0
    Signed-off-by: Tee Ngo <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fault (master)

Fix proposed to branch: master
Review: https://review.opendev.org/673870

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fault (master)

Reviewed: https://review.opendev.org/673870
Committed: https://git.openstack.org/cgit/starlingx/fault/commit/?id=6026848266a4164ed4371e17a77b9b2a50b8d04a
Submitter: Zuul
Branch: master

commit 6026848266a4164ed4371e17a77b9b2a50b8d04a
Author: Tee Ngo <email address hidden>
Date: Wed Jul 31 12:26:39 2019 -0400

    Set management affecting severity for specific app alarms

    In this commit, the Management_Affecting_Severity is
    changed from none to warning for 2 types of k8s app alarms:
      - Application Apply In Progress (alarm id: 750.004) and
      - Application Update In Progress (alarm id: 750.005)

    This change will disallow a patch or upgrade while any
    of these alarms exists.

    The Degrade_Affecting_Severity setting will remain to
    be 'none' for all app alarms as logically these events
    should not degrade the host to the point of disabling
    filesystem resizing or controller swact.

    Closes-Bug: 1833323
    Change-Id: Ie5a85bb4480f8ce6dcb63dfd2e26dacdbceeb366
    Signed-off-by: Tee Ngo <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/672607
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=2b1f3797ba422066f4e238b415e53d38eb4876b3
Submitter: Zuul
Branch: master

commit 2b1f3797ba422066f4e238b415e53d38eb4876b3
Author: Tee Ngo <email address hidden>
Date: Wed Jul 24 19:30:11 2019 -0400

    Disallow unlock when an application operation is in progress

    Unless forced by the user, the unlock request will be rejected
    if application apply, update or recovery is still in progress.
    This prevents the app operation from being stuck or timed out
    waiting for the host and its pods to recover from the unlock.

    Closes-Bug: 1833323
    Change-Id: I0ef1f56914c9187dba6a775b72f70cc4babbf90a
    Signed-off-by: Tee Ngo <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/678126

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (r/stx.2.0)

Reviewed: https://review.opendev.org/678126
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=37f05b3d0c22fd7b0621c9a80bf3ab08f24c15ec
Submitter: Zuul
Branch: r/stx.2.0

commit 37f05b3d0c22fd7b0621c9a80bf3ab08f24c15ec
Author: Tee Ngo <email address hidden>
Date: Wed Jul 24 19:30:11 2019 -0400

    Disallow unlock when an application operation is in progress

    Unless forced by the user, the unlock request will be rejected
    if application apply, update or recovery is still in progress.
    This prevents the app operation from being stuck or timed out
    waiting for the host and its pods to recover from the unlock.

    Closes-Bug: 1833323
    Change-Id: I0ef1f56914c9187dba6a775b72f70cc4babbf90a
    Signed-off-by: Tee Ngo <email address hidden>

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
Revision history for this message
Raviteja naidu Jagalmarri (raviteja0218) wrote :

Change tested today in Duplex and it is Disallowing unlock when an application operation is in progress.

Revision history for this message
Raviteja naidu Jagalmarri (raviteja0218) wrote :

Build info:

OS="centos"
SW_VERSION="19.08"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.2.0"

JOB="STX_BUILD_2.0"
<email address hidden>"
BUILD_NUMBER="40"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-26 23:30:00 +0000"
[sysadmin@controller-0 ~(keystone_admin)]$

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.