cert-manger and platform-integ-apps alarm 750.006 after controller-0 unlock

Bug #1923587 reported by Andrei Grosu
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Andrei Grosu

Bug Description

Brief Description
-----------------

Applying applications intermittently fails because the postgres db cannot be reached.

Severity
--------
Minor

Expected Behavior
------------------

Apply should succeed and the logic should check/wait for the database service to be up and running , accepting connections.

Reproducibility
---------------
Intermittent , very low reproductibility.

System Configuration
--------------------

2 controllers, 2 storage, 1 worker nodes.

Logs
----

Armada apply for cert-manager at 18:19:16 fails

sysinv 2021-03-20 18:19:16.729 728680 INFO sysinv.conductor.kube_app [-] Armada apply command: 'armada apply --debug --enable-chart-cleanup /tmp/manifests/cert-manager/1.0-13/cert-manager-certmanager-manifest.yaml --values /tmp/overrides/cert-manager/1.0-13/cert-manager-cert-manager.yaml --values /tmp/overrides/cert-manager/1.0-13/cert-manager-psp-rolebinding.yaml '
sysinv 2021-03-20 18:19:16.881 728680 INFO sysinv.conductor.kube_app [-] Starting progress monitoring thread for app cert-manager
sysinv 2021-03-20 18:19:18.679 728680 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/cert-manager/1.0-13/cert-manager-certmanager-manifest.yaml with exit code 1. See /var/log/armada/cert-manager-apply_2021-03-20-18-19-15.log for details.

Armada logs

get_results /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:215^[[00m
2021-03-20 18:19:18.581 69 INFO armada.handlers.lock [-] Releasing lock^[[00m
2021-03-20 18:19:18.587 69 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "write tcp [abcd:206::a4ce:fec1:5423:e306]:37896->[abcd:204::1]:5432: write: connection timed out"
debug_error_string = "{"created":"@1616264357.608286155","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"write tcp [abcd:206::a4ce:fec1:5423:e306]:37896->[abcd:204::1]:5432: write: connection timed out","grpc_status":2}"
>
2021-03-20 18:19:18.587 69 ERROR armada.cli Traceback (most recent call last):
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/_init_.py", line 38, in safe_invoke
2021-03-20 18:19:18.587 69 ERROR armada.cli self.invoke()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2021-03-20 18:19:18.587 69 ERROR armada.cli resp = self.handle(documents, tiller)
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2021-03-20 18:19:18.587 69 ERROR armada.cli return future.result()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2021-03-20 18:19:18.587 69 ERROR armada.cli return self.__get_result()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2021-03-20 18:19:18.587 69 ERROR armada.cli raise self._exception
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2021-03-20 18:19:18.587 69 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2021-03-20 18:19:18.587 69 ERROR armada.cli return armada.sync()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 189, in sync
2021-03-20 18:19:18.587 69 ERROR armada.cli known_releases = self.tiller.list_releases()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 252, in list_releases
2021-03-20 18:19:18.587 69 ERROR armada.cli releases = get_results()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 220, in get_results
2021-03-20 18:19:18.587 69 ERROR armada.cli for message in response:
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/grpc/channel.py", line 364, in __next_
2021-03-20 18:19:18.587 69 ERROR armada.cli return self._next()
2021-03-20 18:19:18.587 69 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 358, in _next
2021-03-20 18:19:18.587 69 ERROR armada.cli raise self
2021-03-20 18:19:18.587 69 ERROR armada.cli grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2021-03-20 18:19:18.587 69 ERROR armada.cli status = StatusCode.UNKNOWN
2021-03-20 18:19:18.587 69 ERROR armada.cli details = "write tcp [abcd:206::a4ce:fec1:5423:e306]:37896->[abcd:204::1]:5432: write: connection timed out"
2021-03-20 18:19:18.587 69 ERROR armada.cli debug_error_string = "{"created":"@1616264357.608286155","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"write tcp [abcd:206::a4ce:fec1:5423:e306]:37896->[abcd:204::1]:5432: write: connection timed out","grpc_status":2}"
2021-03-20 18:19:18.587 69 ERROR armada.cli >
2021-03-20 18:19:18.587 69 ERROR armada.cli ^[[00m
command terminated with exit code 1

Comments
--------

It seems that the postgres db on active controller takes too long to accept requests.
In the logs, subsequent apply operations succeed, so the db eventually accepts connections.
The existing code simply checks that the pod is up and running, which might not mean that the postgres service in the pod is accepting connections.
The proposed fix is to add an extra explicit check for db connectivity.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

lower priority as issue is intermittent, but would be nice to fix

Changed in starlingx:
assignee: nobody → Andrei Grosu (agrosu1)
importance: Undecided → Low
status: New → Triaged
tags: added: stx.containers
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/786021
Committed: https://opendev.org/starlingx/config/commit/5edd3bdbe588e2c2e7a58cb839f030305613c30f
Submitter: "Zuul (22348)"
Branch: master

commit 5edd3bdbe588e2c2e7a58cb839f030305613c30f
Author: Andrei Grosu <email address hidden>
Date: Tue Apr 13 08:52:40 2021 +0000

    Check for connectivity to the tiller postgres backend.

    The existing code checks that the pod(s) are 'Running' but that
    might not be enough as the service inside the pod (postgres)
    might not be able to accept connections.

    Closes-Bug: 1923587
    Signed-off-by: Andrei Grosu <email address hidden>
    Change-Id: Ide49e4a38b805d5fc41d9f06d94393c69c6ed9d2

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Frank Miller (sensfan22) wrote :

Re-opening this LP as the original commit needed to be reverted:
https://review.opendev.org/c/starlingx/config/+/789011

Some re-work is required before a new commit can be proposed and this LP moved back to Fix Released.

Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/789828

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/790011

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by "Andrei Grosu <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/config/+/790011

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/789828
Committed: https://opendev.org/starlingx/config/commit/12fff41d7803c7cea2b34e356ac65d361ca57789
Submitter: "Zuul (22348)"
Branch: master

commit 12fff41d7803c7cea2b34e356ac65d361ca57789
Author: Andrei Grosu <email address hidden>
Date: Wed May 5 13:03:50 2021 +0000

    Handle empty 'helm list' result when there is nothing deployed

    The existing code assumes that there are always applications deployed
    and the result is never an empty list.
    The previous implementation ignored the return code when the subprocess
    was killed by the timeout handler.
    Split the method in two submethods for helm v2 and v3 implementations.

    Closes-Bug: 1923587
    Signed-off-by: Andrei Grosu <email address hidden>
    Signed-off-by: Angie Wang <email address hidden>
    Change-Id: Ib547bdb20c39e35c1538e3abb90108f7e3cad228

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.