cert-manger and platform-integ-apps alarm 750.006 after controller-0 unlock
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
Andrei Grosu |
Bug Description
Brief Description
-----------------
Applying applications intermittently fails because the postgres db cannot be reached.
Severity
--------
Minor
Expected Behavior
------------------
Apply should succeed and the logic should check/wait for the database service to be up and running , accepting connections.
Reproducibility
---------------
Intermittent , very low reproductibility.
System Configuration
-------
2 controllers, 2 storage, 1 worker nodes.
Logs
----
Armada apply for cert-manager at 18:19:16 fails
sysinv 2021-03-20 18:19:16.729 728680 INFO sysinv.
sysinv 2021-03-20 18:19:16.881 728680 INFO sysinv.
sysinv 2021-03-20 18:19:18.679 728680 ERROR sysinv.
Armada logs
get_results /usr/local/
2021-03-20 18:19:18.581 69 INFO armada.
2021-03-20 18:19:18.587 69 ERROR armada.cli [-] Caught unexpected exception: grpc._channel.
status = StatusCode.UNKNOWN
details = "write tcp [abcd:206:
debug_error_string = "{"created"
>
2021-03-20 18:19:18.587 69 ERROR armada.cli Traceback (most recent call last):
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/
2021-03-20 18:19:18.587 69 ERROR armada.cli self.invoke()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/
2021-03-20 18:19:18.587 69 ERROR armada.cli resp = self.handle(
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/
2021-03-20 18:19:18.587 69 ERROR armada.cli return future.result()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/lib/
2021-03-20 18:19:18.587 69 ERROR armada.cli return self.__get_result()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/lib/
2021-03-20 18:19:18.587 69 ERROR armada.cli raise self._exception
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/lib/
2021-03-20 18:19:18.587 69 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/
2021-03-20 18:19:18.587 69 ERROR armada.cli return armada.sync()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/
2021-03-20 18:19:18.587 69 ERROR armada.cli known_releases = self.tiller.
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/
2021-03-20 18:19:18.587 69 ERROR armada.cli releases = get_results()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/
2021-03-20 18:19:18.587 69 ERROR armada.cli for message in response:
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/
2021-03-20 18:19:18.587 69 ERROR armada.cli return self._next()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/
2021-03-20 18:19:18.587 69 ERROR armada.cli raise self
2021-03-20 18:19:18.587 69 ERROR armada.cli grpc._channel.
2021-03-20 18:19:18.587 69 ERROR armada.cli status = StatusCode.UNKNOWN
2021-03-20 18:19:18.587 69 ERROR armada.cli details = "write tcp [abcd:206:
2021-03-20 18:19:18.587 69 ERROR armada.cli debug_error_string = "{"created"
2021-03-20 18:19:18.587 69 ERROR armada.cli >
2021-03-20 18:19:18.587 69 ERROR armada.cli ^[[00m
command terminated with exit code 1
Comments
--------
It seems that the postgres db on active controller takes too long to accept requests.
In the logs, subsequent apply operations succeed, so the db eventually accepts connections.
The existing code simply checks that the pod is up and running, which might not mean that the postgres service in the pod is accepting connections.
The proposed fix is to add an extra explicit check for db connectivity.
Changed in starlingx: | |
status: | Triaged → In Progress |
lower priority as issue is intermittent, but would be nice to fix