platform-integration-manifest apply failed to Install release: stx-rbd-provisioner

Bug #1841435 reported by Peng Peng
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Daniel Badea

Bug Description

Brief Description
-----------------
Applying platform-integration-manifest application during lab initial period failed. platform-integ-apps-apply.log shows,
Failed to Install release: stx-rbd-provisioner

Severity
--------
Major

Steps to Reproduce
------------------
Applying platform-integration-manifest application during lab initial period

TC-name: installation

Expected Behavior
------------------
apply success

Actual Behavior
----------------
apply failed

Reproducibility
---------------
Seen once

System Configuration
--------------------
Two node system

Lab-name: IP_5-6

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-08-25_20-59-00

Last Pass
---------
2019-08-20_20-59-00

Timestamp/Logs
--------------
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada debug_error_string = "{"created":"@1566804311.754052177","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release stx-rbd-provisioner failed: timed out waiting for the condition","grpc_status":2}"
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada >
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada During handling of the above exception, another exception occurred:
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada Traceback (most recent call last):
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 223, in handle_result
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada result = get_result()
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 234, in <lambda>
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 212, in deploy_chart
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada prefix, known_releases)
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 234, in execute
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada timeout=timer)
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 478, in install_release
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada raise ex.ReleaseException(release, status, 'Install')
2019-08-26 07:25:11.862 36 ERROR armada.handlers.armada armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: stx-rbd-provisioner - Tiller Message: b'Release "stx-rbd-provisioner" failed: timed out waiting for the condition'

Test Activity
-------------
installation

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

Issue seen twice in past week only on same AIO-DX lab. Workaround si to apply manaully. Marking stx.3.0 gating.

tags: added: stx.3.0 stx.containers
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Daniel Badea (daniel.badea)
Revision history for this message
Frank Miller (sensfan22) wrote :

Assigning to Daniel to investigate as he has made a recent change in this area.

Revision history for this message
Daniel Badea (daniel.badea) wrote :

Root cause:
- controller-1 is assigned a personality 20 seconds before generating overrides for platform-integ-apps
- rbd-provisioner replica set size is 2 but controller-1 is locked at that time
- platform-integ-apps fails because a 2nd rbd-provisioner pod doesn't come up (controller-1 is locked)

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

Experiencing this also on hw AIO-DX lab (WP 13-14)
2019-08-26_20-59-00

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (9.1 KiB)

2019-08-29 16:54:04.007 36 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:173
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller [-] [chart=kube-system-rbd-provisioner]: Error while installing release stx-rbd-provisioner: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "release stx-rbd-provisioner failed: timed out waiting for the condition"
        debug_error_string = "{"created":"@1567097702.361180542","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release stx-rbd-provisioner failed: timed out waiting for the condition","grpc_status":2}"
>
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller Traceback (most recent call last):
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 465, in install_release
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller metadata=self.metadata)
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller raise _Rendezvous(state, None, None, deadline)
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller status = StatusCode.UNKNOWN
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller details = "release stx-rbd-provisioner failed: timed out waiting for the condition"
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller debug_error_string = "{"created":"@1567097702.361180542","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release stx-rbd-provisioner failed: timed out waiting for the condition","grpc_status":2}"
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller >
2019-08-29 16:55:02.361 36 ERROR armada.handlers.tiller
2019-08-29 16:55:02.365 36 DEBUG armada.handlers.tiller [-] [chart=kube-system-rbd-provisioner]: Helm getting release status for release=stx-rbd-provisioner, version=0 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:531
2019-08-29 16:55:02.447 36 DEBUG armada.handlers.tiller [-] [chart=kube-system-rbd-provisioner]: GetReleaseStatus= name: "stx-rbd-provisioner"
info {
  status {
    code: FAILED
  }
  first_deployed {
    seconds: 1567095902
    nanos: 74506647
  }
  last_deployed {
    seconds: 1567095902
    nanos: 74506647
  }
  Description: "Release \"stx-rbd-provisioner\" failed: timed out waiting for the condition"
}
namespace: "kube-system"
 get_release_status /usr/local/lib/python3.6/dist-packages/armada/...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

The workaround listed above does not seem to help.
Still failed on reapply

a. Deployment manager continues to attempt to unlock controller-1
2019-08-29 18:18:24.700 112975 INFO sysinv.api.controllers.v1.host [-] controller-1 ihost check_unlock_application
2019-08-29 18:18:24.703 112975 INFO sysinv.api.controllers.v1.host [-] host unlock check didn't pass, so set the ihost_action back to None and re-raise the exception
2019-08-29 18:18:24.713 112975 WARNING wsme.api [-] Client-side error: Rejected: Can not unlock host controller-1 while an application is being applied, updated or recovered. Please try again later.

ceph -s shows ceph is health_ok

platform-integ-apps-apply.log
2019-08-29 18:18:02.612 7736 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:173
2019-08-29 18:19:02.681 7736 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:173
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller [-] [chart=kube-system-rbd-provisioner]: Error while installing release stx-rbd-provisioner: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "release stx-rbd-provisioner failed: timed out waiting for the condition"
        debug_error_string = "{"created":"@1567102801.090806895","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release stx-rbd-provisioner failed: timed out waiting for the condition","grpc_status":2}"
>
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller Traceback (most recent call last):
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 465, in install_release
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller metadata=self.metadata)
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller raise _Rendezvous(state, None, None, deadline)
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller status = StatusCode.UNKNOWN
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller details = "release stx-rbd-provisioner failed: timed out waiting for the condition"
2019-08-29 18:20:01.091 7736 ERROR armada.handlers.tiller debug_error_string = "{"created":"@1567102801.090806895","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release stx-rbd-provisioner failed: timed out waiting for the condition","grpc_status":2}"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/679457

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/679458

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/679457
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=eab24ecf80bba5ccc0177a7c2571e493675c9792
Submitter: Zuul
Branch: master

commit eab24ecf80bba5ccc0177a7c2571e493675c9792
Author: Daniel Badea <email address hidden>
Date: Fri Aug 30 12:39:19 2019 +0000

    re-apply platform apps when controller state changes

    platform-integ-apps needs to be re-applied when hosts
    are locked/unlocked as some charts expect a specific
    number of pod replicas to be available.

    This extends support for restarting applications added
    for stx-openstack by commit "Changes to stx-openstack
    application automatic re-apply behaviour"
    ( b1895200a44986b1a447c28262bf22edeba5f652 )

    Notes:
    1. platform apps re-apply is evaluated when:
       - host is removed
       - host services are enabled (after unlock)
       unless system restore is in progress
    2. log messages updated to factor in application name
    3. fault and re-apply flag management moved to kube_app.py
    4. each app has its own re-apply flag

    Change-Id: I6328a9f883a4807446ea4a56a9d14c844dabdf72
    Closes-bug: 1841435
    Signed-off-by: Daniel Badea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/679458
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=6801c4b294649c1035bf1e625a89f5956bb35562
Submitter: Zuul
Branch: master

commit 6801c4b294649c1035bf1e625a89f5956bb35562
Author: Daniel Badea <email address hidden>
Date: Fri Aug 30 14:01:27 2019 +0000

    rbd-provisioner replicas match unlocked controllers

    rbd-provisioner replica set size should match the
    number of unlocked controllers. When controller-1
    is unlocked and platform-integ-apps is applied then
    overrides are generated with 2 replicas and app
    apply times out waiting for the pod that should be
    scheduled on controller-1.

    Change-Id: Ibdeab100929e9161bf7872b826e94de4fc4fa33e
    Depends-On: I6328a9f883a4807446ea4a56a9d14c844dabdf72
    Closes-bug: 1841435
    Signed-off-by: Daniel Badea <email address hidden>

Revision history for this message
Peng Peng (ppeng) wrote :

not seeing this issue recently.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.