Intermittent application(platform-integ-apps) apply failure when lock and unlock on standby controller

Bug #1856230 reported by Anujeyan Manokeran on 2019-12-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Stefan Dinescu

Bug Description

Brief Description
-----------------
          System application apply was in failed state after standby controller was locked and unlocked. This was observed in storage system.

system application-list
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| platform-integ-apps | 1.0-10 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+

-12-12 16:48:54.400 13 INFO armada.handlers.armada [-] Downloading tarball from: http://controller:8080/helm_charts/stx-platform/helm-toolkit-0.1.0.tgz^[[00m
2019-12-12 16:48:54.400 13 WARNING armada.handlers.armada [-] Disabling server validation certs to extract charts^[[00m
2019-12-12 16:48:54.422 13 INFO armada.handlers.armada [-] Downloading tarball from: http://controller:8080/helm_charts/stx-platform/ceph-pools-audit-0.1.0.tgz^[[00m
2019-12-12 16:48:54.422 13 WARNING armada.handlers.armada [-] Disabling server validation certs to extract charts^[[00m
2019-12-12 16:48:54.446 13 DEBUG armada.handlers.tiller [-] Tiller ListReleases() with timeout=300, request=limit: 32
status_codes: UNKNOWN
status_codes: DEPLOYED
status_codes: DELETED
status_codes: DELETING
status_codes: FAILED
status_codes: PENDING_INSTALL
status_codes: PENDING_UPGRADE
status_codes: PENDING_ROLLBACK
 get_results /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:215^[[00m
2019-12-12 16:49:15.376 13 INFO armada.handlers.lock [-] Releasing lock^[[00m
2019-12-12 16:49:24.727 13 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1576169354.703282535","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Socket closed","grpc_status":14}"
>
2019-12-12 16:49:24.727 13 ERROR armada.cli Traceback (most recent call last):
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2019-12-12 16:49:24.727 13 ERROR armada.cli self.invoke()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2019-12-12 16:49:24.727 13 ERROR armada.cli resp = self.handle(documents, tiller)
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2019-12-12 16:49:24.727 13 ERROR armada.cli return future.result()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2019-12-12 16:49:24.727 13 ERROR armada.cli return self.__get_result()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2019-12-12 16:49:24.727 13 ERROR armada.cli raise self._exception
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2019-12-12 16:49:24.727 13 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2019-12-12 16:49:24.727 13 ERROR armada.cli return armada.sync()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 189, in sync
2019-12-12 16:49:24.727 13 ERROR armada.cli known_releases = self.tiller.list_releases()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 252, in list_releases
2019-12-12 16:49:24.727 13 ERROR armada.cli releases = get_results()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 220, in get_results
2019-12-12 16:49:24.727 13 ERROR armada.cli for message in response:
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 364, in __next__
2019-12-12 16:49:24.727 13 ERROR armada.cli return self._next()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 358, in _next
2019-12-12 16:49:24.727 13 ERROR armada.cli raise self
2019-12-12 16:49:24.727 13 ERROR armada.cli grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-12-12 16:49:24.727 13 ERROR armada.cli status = StatusCode.UNAVAILABLE
2019-12-12 16:49:24.727 13 ERROR armada.cli details = "Socket closed"
2019-12-12 16:49:24.727 13 ERROR armada.cli debug_error_string = "{"created":"@1576169354.703282535","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Socket closed","grpc_status":14}"
2019-12-12 16:49:24.727 13 ERROR armada.cli >
2019-12-12 16:49:24.727 13 ERROR armada.cli ^[[00m
"/var/log/armada/platform-integ-apps-apply_2019-12-12-16-48-52.log" [readonly] 72L, 8939C

Severity
--------
Major

Steps to Reproduce
------------------

1. Lock and unlock the standby controller (controller-1)
2. Verify application apply using system application list

Expected Behavior
------------------
Application apply success all the time when locking and unlocking controller

Actual Behavior
----------------
Application apply failure

Reproducibility
---------------
Intermittent issue . Seen once out of 2 lock and unlock and standby controller.

System Configuration
--------------------
Storage system

Branch/Pull Time/Commit
-----------------------
2019-12-11_10-44-34

Last Pass
---------
Intermittent issue

Timestamp/Logs
--------------
2019-11-21T18:52:54.000

Test Activity
-------------
Regression

Anujeyan Manokeran (anujeyan) wrote :
Ghada Khalil (gkhalil) wrote :

Jeyan, Did a manual re-apply work?

tags: added: stx.containers
Bob Church (rchurch) wrote :

Looks like we are allowing a lock to occur while an application apply is in progress. The lock sets the NoExecute taint on controller-1 which prevents the required application pods to run on controller-1.

2019-12-12 16:48:51.761 106194 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-10) apply started.
2019-12-12 16:48:52.972 106194 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-10/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-10/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-10/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-10/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply_2019-12-12-16-48-52.log'
2019-12-12 16:49:13.002 107210 INFO sysinv.api.controllers.v1.host [-] controller-1 ihost check_lock
2019-12-12 16:49:13.003 107210 INFO sysinv.api.controllers.v1.host [-] controller-1 ihost check_lock_controller
2019-12-12 16:49:14.399 107210 INFO sysinv.api.controllers.v1.host [-] controller-1 action=lock ihost_val_prenotify: {'ihost_action': 'lock'} ihost_val: {'ihost_action': 'lock', 'task': u'Locking'}
2019-12-12 16:49:14.400 107210 INFO sysinv.api.controllers.v1.host [-] controller-1 host.update.ihost_val_prenotify {'ihost_action': 'lock'}
2019-12-12T16:49:14.654 controller-0 VIM_Infrastructure-Worker-0_Thread[107171] INFO kubernetes_client.py.68 Adding services=disabled:NoExecute taint to node controller-1
2019-12-12 16:49:24.841 106194 INFO sysinv.conductor.kube_app [-] platform-integ-apps app failed applying. Retrying.
2019-12-12 16:49:24.839 106194 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-10/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply_2019-12-12-16-48-52.log for details.
2019-12-12T17:18:25.509 controller-0 VIM_Infrastructure-Worker-0_Thread[107171] INFO kubernetes_client.py.103 Removing services:NoExecute taint from node controller-1

Ghada Khalil (gkhalil) on 2019-12-13
Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
Ghada Khalil (gkhalil) wrote :

A similar issue is also reported in https://bugs.launchpad.net/starlingx/+bug/1856078

Ghada Khalil (gkhalil) wrote :

As per Frank, the above LP reports a different issue

Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - intermittent issue. Workaround is to re-issue the application apply

tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: Bob Church (rchurch) → Stefan Dinescu (stefandinescu)
Yang Liu (yliu12) on 2020-01-16
tags: added: stx.retestneeded
Stefan Dinescu (stefandinescu) wrote :

It is possible that the issue has been by the fix for https://bugs.launchpad.net/starlingx/+bug/1850189

The fix for that issue was to implement a retry mechanism for platform-integ-apps in case they fail to apply. since this issue was also intermittent, it is very possible for that fix to catch this error as well.

Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments