Cert-manager app failed to apply after controller-1 of the system controller is upgraded

Bug #1883953 reported by Tee Ngo
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------
After the controller-1 of the system controller is upgraded to 20.06, cert-manager app is in apply-failed state.

Severity
--------
Major

Steps to Reproduce
------------------
- import 20.06 load
- apply upgrade patch to enable upgrade to 20.06
- Execute the following commands to upgrade the system controller
    - system upgrade-start --force
         Note: the --force option allows upgrade while the system has non-service impacting alarms
    - system host-lock controller-1
    - system host-upgrade controller-1
         Note: to monitor the upgrade, run command "system upgrade-show"
    - system host-unlock cotroller-1
    - system host-swact controller-0
    - system host-lock controller-0
    - system host-upgrade controller-0
    - system host-unlock controller-0

Expected Behavior
------------------
All applied apps remain applied after system controller is upgraded to 20.06

Actual Behavior
----------------
Both platform-integ-apps and cert-manager apps were in the failed-state after the upgrade. However, only the platform-integ-apps could be reapplied successfully.

Armada apply logs:
~~~~~~~~~~~~~~~~~~
2020-06-12 00:05:01.158 16 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Connect Failed"
        debug_error_string = "{"created":"@1591920300.173834933","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2721,"referenced_errors":[{"created":"@1591920300.173832262","description":"Pick Cancelled","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":241,"referenced_errors":[{"created":"@1591920300.173807252","description":"Connect Failed","file":"src/core/ext/filters/client_channel/subchannel.cc","file_line":689,"grpc_status":14,"referenced_errors":[{"created":"@1591920300.173749536","description":"Failed to connect to remote host: OS Error","errno":101,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":210,"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv6:[fd04::e662]:44134"}]}]}]}"
>
2020-06-12 00:05:01.158 16 ERROR armada.cli Traceback (most recent call last):
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2020-06-12 00:05:01.158 16 ERROR armada.cli self.invoke()
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2020-06-12 00:05:01.158 16 ERROR armada.cli resp = self.handle(documents, tiller)
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2020-06-12 00:05:01.158 16 ERROR armada.cli return future.result()
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2020-06-12 00:05:01.158 16 ERROR armada.cli return self.__get_result()
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2020-06-12 00:05:01.158 16 ERROR armada.cli raise self._exception
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2020-06-12 00:05:01.158 16 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2020-06-12 00:05:01.158 16 ERROR armada.cli return armada.sync()
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 189, in sync
2020-06-12 00:05:01.158 16 ERROR armada.cli known_releases = self.tiller.list_releases()
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 252, in list_releases
2020-06-12 00:05:01.158 16 ERROR armada.cli releases = get_results()
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 220, in get_results
2020-06-12 00:05:01.158 16 ERROR armada.cli for message in response:
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 364, in __next__
2020-06-12 00:05:01.158 16 ERROR armada.cli return self._next()
2020-06-12 00:05:01.158 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 358, in _next
2020-06-12 00:05:01.158 16 ERROR armada.cli raise self
2020-06-12 00:05:01.158 16 ERROR armada.cli grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2020-06-12 00:05:01.158 16 ERROR armada.cli status = StatusCode.UNAVAILABLE
2020-06-12 00:05:01.158 16 ERROR armada.cli details = "Connect Failed"
2020-06-12 00:05:01.158 16 ERROR armada.cli debug_error_string = "{"created":"@1591920300.173834933","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2721,"referenced_errors":[{"created":"@1591920300.173832262","description":"Pick Cancelled","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":241,"referenced_errors":[{"created":"@1591920300.173807252","description":"Connect Failed","file":"src/core/ext/filters/client_channel/subchannel.cc","file_line":689,"grpc_status":14,"referenced_errors":[{"created":"@1591920300.173749536","description":"Failed to connect to remote host: OS Error","errno":101,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":210,"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv6:[fd04::e662]:44134"}]}]}]}"

After several manual reapply that failed, another attempt to reapply the app following the removal of armada_service (sudo docker rm armada_service), the apply went a bit further but still failed

2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller [-] [chart=cert-manager]: Error while updating release cm-cert-manager: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "timed out waiting for the condition"
        debug_error_string = "{"created":"@1592236499.587783833","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"timed out waiting for the condition","grpc_status":2}"
>
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller Traceback (most recent call last):
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 426, in update_release
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller metadata=self.metadata)
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller raise _Rendezvous(state, None, None, deadline)
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller status = StatusCode.UNKNOWN
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller details = "timed out waiting for the condition"
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller debug_error_string = "{"created":"@1592236499.587783833","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"timed out waiting for the condition","grpc_status":2}"
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller >
2020-06-15 15:54:59.588 16 ERROR armada.handlers.tiller ^[[00m
2020-06-15 15:54:59.589 16 DEBUG armada.handlers.tiller [-] [chart=cert-manager]: Helm getting release status for release=cm-cert-manager, version=0 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:539^[[00m
2020-06-15 15:54:59.825 16 DEBUG armada.handlers.tiller [-] [chart=cert-manager]: GetReleaseStatus= name: "cm-cert-manager"
info {
  status {
    code: FAILED
  }
  first_deployed {
    seconds: 1591892571
    nanos: 383032032
  }
  last_deployed {
    seconds: 1592234698
    nanos: 398297163
  }
  Description: "Upgrade \"cm-cert-manager\" failed: timed out waiting for the condition"
}
namespace: "cert-manager"
 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:547^[[00m
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada [-] Chart deploy [cert-manager] failed: armada.exceptions.tiller_exceptions.ReleaseException: Failed to Upgrade release: cm-cert-manager - Tiller Message: b'Upgrade "cm-cert-manager" failed: timed out waiting for the condition'
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada Traceback (most recent call last):
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 426, in update_release
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada metadata=self.metadata)
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada return _end_unary_response_blocking(state, call, False, None)
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada raise _Rendezvous(state, None, None, deadline)
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada status = StatusCode.UNKNOWN
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada details = "timed out waiting for the condition"
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada debug_error_string = "{"created":"@1592236499.587783833","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"timed out waiting for the condition","grpc_status":2}"
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada >
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada During handling of the above exception, another exception occurred:
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada Traceback (most recent call last):
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 225, in handle_result
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada result = get_result()
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 236, in <lambda>
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 214, in deploy_chart
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada chart, cg_test_all_charts, prefix, known_releases)
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 162, in execute
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada recreate_pods=recreate_pods)
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 431, in update_release
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada raise ex.ReleaseException(release, status, 'Upgrade')
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada armada.exceptions.tiller_exceptions.ReleaseException: Failed to Upgrade release: cm-cert-manager - Tiller Message: b'Upgrade "cm-cert-manager" failed: timed out waiting for the condition'
2020-06-15 15:54:59.826 16 ERROR armada.handlers.armada ^[[00m
2020-06-15 15:54:59.827 16 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['cert-manager']^[[00m
2020-06-15 15:55:00.180 16 INFO armada.handlers.lock [-] Releasing lock^[[00m
2020-06-15 15:55:00.186 16 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['cert-manager']
2020-06-15 15:55:00.186 16 ERROR armada.cli Traceback (most recent call last):
2020-06-15 15:55:00.186 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2020-06-15 15:55:00.186 16 ERROR armada.cli self.invoke()
2020-06-15 15:55:00.186 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2020-06-15 15:55:00.186 16 ERROR armada.cli resp = self.handle(documents, tiller)
2020-06-15 15:55:00.186 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2020-06-15 15:55:00.186 16 ERROR armada.cli return future.result()
2020-06-15 15:55:00.186 16 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2020-06-15 15:55:00.186 16 ERROR armada.cli return self.__get_result()
2020-06-15 15:55:00.186 16 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2020-06-15 15:55:00.186 16 ERROR armada.cli raise self._exception
2020-06-15 15:55:00.186 16 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2020-06-15 15:55:00.186 16 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2020-06-15 15:55:00.186 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2020-06-15 15:55:00.186 16 ERROR armada.cli return armada.sync()
2020-06-15 15:55:00.186 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 252, in sync
2020-06-15 15:55:00.186 16 ERROR armada.cli raise armada_exceptions.ChartDeployException(failures)
2020-06-15 15:55:00.186 16 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['cert-manager']
2020-06-15 15:55:00.186 16 ERROR armada.cli ^[[00m

Reproducibility
---------------
Seen once

System Configuration
--------------------
IPv6 distributed cloud

Branch/Pull Time/Commit
-----------------------
Jun 6th load

Last Pass
---------
N/A. This is the first time distributed cloud upgrade is performed.

Timestamp/Logs
--------------

Test Activity
-------------
Developer Testing

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Download full text (3.2 KiB)

This was seen in load 2020-06-15_20-00-00 in ip-1-4.
 fm alarm-list
+----------+------------------------------------------------------------------------------------------+-------------------------+----------+------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+------------------------------------------------------------------------------------------+-------------------------+----------+------------------+
| 400.003 | Evaluation license key will expire on 30-dec-2020; there are 196 days remaining in this | host=controller-0 | minor | 2020-06-17T14:00 |
| | evaluation | | | :30.211280 |
| | | | | |
| 400.003 | Evaluation license key will expire on 30-dec-2020; there are 196 days remaining in this | host=controller-1 | minor | 2020-06-17T14:00 |
| | evaluation | | | :25.009131 |
| | | | | |
| 750.002 | Application Apply Failure | k8s_application=cert- | major | 2020-06-16T20:08 |
| | | manager | | :15.049402 |
| | | | | |
| 500.101 | Developer patch certificate is enabled | host=controller | critical | 2020-06-16T19:37 |
| | | | | :54.333533 |
| | | | | |
| 900.005 | System Upgrade in progress. | host=controller | minor | 2020-06-16T19:37 |
| | | | | :04.041417 |
| | | | | |
+----------+------------------------------------------------------------------------------------------+-------------------------+----------+------------------+
[sysadmin@controller-1 ~(keystone_admin)]$ timed out waiting for input: auto-logout
Connect...

Read more...

Ghada Khalil (gkhalil)
tags: added: stx.containers
tags: added: stx.5.0
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
assignee: nobody → Dan Voiculeasa (dvoicule)
importance: Undecided → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote :

There have been multiple occurrences of this issue where application apply fails with an error: Failed to create subchannel. It is not specific to the cert-mgr and appears to be a tiller/armada communication issue. This issue is intermittent and is not easily reproducible. However, it has been seen multiple times so far.
(Note: A few LPs were marked as Invalid since they were not reproducible). However, this is pointing to an issue that needs to be investigated further given the multiple reports seen.

https://bugs.launchpad.net/starlingx/+bug/1883555
https://bugs.launchpad.net/starlingx/+bug/1882485
https://bugs.launchpad.net/starlingx/+bug/1879990
https://bugs.launchpad.net/starlingx/+bug/1854224
https://bugs.launchpad.net/starlingx/+bug/1843803

Frank Miller (sensfan22)
Changed in starlingx:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.