Intermittent application(platform-integ-apps) apply failure when lock and unlock on standby controller

Bug #1856230 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Stefan Dinescu

Bug Description

Brief Description
-----------------
          System application apply was in failed state after standby controller was locked and unlocked. This was observed in storage system.

system application-list
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| platform-integ-apps | 1.0-10 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+

-12-12 16:48:54.400 13 INFO armada.handlers.armada [-] Downloading tarball from: http://controller:8080/helm_charts/stx-platform/helm-toolkit-0.1.0.tgz^[[00m
2019-12-12 16:48:54.400 13 WARNING armada.handlers.armada [-] Disabling server validation certs to extract charts^[[00m
2019-12-12 16:48:54.422 13 INFO armada.handlers.armada [-] Downloading tarball from: http://controller:8080/helm_charts/stx-platform/ceph-pools-audit-0.1.0.tgz^[[00m
2019-12-12 16:48:54.422 13 WARNING armada.handlers.armada [-] Disabling server validation certs to extract charts^[[00m
2019-12-12 16:48:54.446 13 DEBUG armada.handlers.tiller [-] Tiller ListReleases() with timeout=300, request=limit: 32
status_codes: UNKNOWN
status_codes: DEPLOYED
status_codes: DELETED
status_codes: DELETING
status_codes: FAILED
status_codes: PENDING_INSTALL
status_codes: PENDING_UPGRADE
status_codes: PENDING_ROLLBACK
 get_results /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:215^[[00m
2019-12-12 16:49:15.376 13 INFO armada.handlers.lock [-] Releasing lock^[[00m
2019-12-12 16:49:24.727 13 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1576169354.703282535","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Socket closed","grpc_status":14}"
>
2019-12-12 16:49:24.727 13 ERROR armada.cli Traceback (most recent call last):
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2019-12-12 16:49:24.727 13 ERROR armada.cli self.invoke()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2019-12-12 16:49:24.727 13 ERROR armada.cli resp = self.handle(documents, tiller)
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2019-12-12 16:49:24.727 13 ERROR armada.cli return future.result()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2019-12-12 16:49:24.727 13 ERROR armada.cli return self.__get_result()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2019-12-12 16:49:24.727 13 ERROR armada.cli raise self._exception
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2019-12-12 16:49:24.727 13 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2019-12-12 16:49:24.727 13 ERROR armada.cli return armada.sync()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 189, in sync
2019-12-12 16:49:24.727 13 ERROR armada.cli known_releases = self.tiller.list_releases()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 252, in list_releases
2019-12-12 16:49:24.727 13 ERROR armada.cli releases = get_results()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 220, in get_results
2019-12-12 16:49:24.727 13 ERROR armada.cli for message in response:
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 364, in __next__
2019-12-12 16:49:24.727 13 ERROR armada.cli return self._next()
2019-12-12 16:49:24.727 13 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 358, in _next
2019-12-12 16:49:24.727 13 ERROR armada.cli raise self
2019-12-12 16:49:24.727 13 ERROR armada.cli grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-12-12 16:49:24.727 13 ERROR armada.cli status = StatusCode.UNAVAILABLE
2019-12-12 16:49:24.727 13 ERROR armada.cli details = "Socket closed"
2019-12-12 16:49:24.727 13 ERROR armada.cli debug_error_string = "{"created":"@1576169354.703282535","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Socket closed","grpc_status":14}"
2019-12-12 16:49:24.727 13 ERROR armada.cli >
2019-12-12 16:49:24.727 13 ERROR armada.cli ^[[00m
"/var/log/armada/platform-integ-apps-apply_2019-12-12-16-48-52.log" [readonly] 72L, 8939C

Severity
--------
Major

Steps to Reproduce
------------------

1. Lock and unlock the standby controller (controller-1)
2. Verify application apply using system application list

Expected Behavior
------------------
Application apply success all the time when locking and unlocking controller

Actual Behavior
----------------
Application apply failure

Reproducibility
---------------
Intermittent issue . Seen once out of 2 lock and unlock and standby controller.

System Configuration
--------------------
Storage system

Branch/Pull Time/Commit
-----------------------
2019-12-11_10-44-34

Last Pass
---------
Intermittent issue

Timestamp/Logs
--------------
2019-11-21T18:52:54.000

Test Activity
-------------
Regression

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Jeyan, Did a manual re-apply work?

tags: added: stx.containers
Revision history for this message
Bob Church (rchurch) wrote :

Looks like we are allowing a lock to occur while an application apply is in progress. The lock sets the NoExecute taint on controller-1 which prevents the required application pods to run on controller-1.

2019-12-12 16:48:51.761 106194 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps (1.0-10) apply started.
2019-12-12 16:48:52.972 106194 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-10/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-10/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-10/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-10/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply_2019-12-12-16-48-52.log'
2019-12-12 16:49:13.002 107210 INFO sysinv.api.controllers.v1.host [-] controller-1 ihost check_lock
2019-12-12 16:49:13.003 107210 INFO sysinv.api.controllers.v1.host [-] controller-1 ihost check_lock_controller
2019-12-12 16:49:14.399 107210 INFO sysinv.api.controllers.v1.host [-] controller-1 action=lock ihost_val_prenotify: {'ihost_action': 'lock'} ihost_val: {'ihost_action': 'lock', 'task': u'Locking'}
2019-12-12 16:49:14.400 107210 INFO sysinv.api.controllers.v1.host [-] controller-1 host.update.ihost_val_prenotify {'ihost_action': 'lock'}
2019-12-12T16:49:14.654 controller-0 VIM_Infrastructure-Worker-0_Thread[107171] INFO kubernetes_client.py.68 Adding services=disabled:NoExecute taint to node controller-1
2019-12-12 16:49:24.841 106194 INFO sysinv.conductor.kube_app [-] platform-integ-apps app failed applying. Retrying.
2019-12-12 16:49:24.839 106194 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-10/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply_2019-12-12-16-48-52.log for details.
2019-12-12T17:18:25.509 controller-0 VIM_Infrastructure-Worker-0_Thread[107171] INFO kubernetes_client.py.103 Removing services:NoExecute taint from node controller-1

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

A similar issue is also reported in https://bugs.launchpad.net/starlingx/+bug/1856078

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Frank, the above LP reports a different issue

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - intermittent issue. Workaround is to re-issue the application apply

tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: Bob Church (rchurch) → Stefan Dinescu (stefandinescu)
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Stefan Dinescu (stefandinescu) wrote :

It is possible that the issue has been by the fix for https://bugs.launchpad.net/starlingx/+bug/1850189

The fix for that issue was to implement a retry mechanism for platform-integ-apps in case they fail to apply. since this issue was also intermittent, it is very possible for that fix to catch this error as well.

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue is not seen any more build 2020-01-28_18-49-15.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.