Stx-openstack apply-fail after swact standby controller, lock, unlock standby controller

Bug #1917308 reported by Alexandru Dimofte
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Gustavo Santos

Bug Description

Brief Description
-----------------
Stx-openstack apply-failed after swacting standby controller and lock, unlocking standby controller. This is visible on Standard, Standard-EXT configurations on baremetal.

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Swact standby controller
Lock standby controller
Unlock standby controller
check: system application-list | grep openstack
normally it should be applied, but it fails

Expected Behavior
------------------
stx-openstack should apply fine, without any error

Actual Behavior
----------------
stx-openstack apply fails

Reproducibility
---------------
reproduced 3 times in a row.

System Configuration
--------------------
Multi-node system, Dedicated storage on barebetal

Branch/Pull Time/Commit
-----------------------
master

Last Pass
---------
20210226T024233Z

Timestamp/Logs
--------------
will be attached

Test Activity
--------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Dan Voiculeasa (dvoicule) wrote :
Download full text (3.6 KiB)

Did a short investigation since https://review.opendev.org/c/starlingx/config/+/773451 landed.

There is a small error observerd in the logs introduced by that that commit, but it is not the cause for the issue observed here. This will be the fix for that error:
diff --git a/sysinv/sysinv/sysinv/sysinv/conductor/manager.py b/sysinv/sysinv/sysinv/sysinv/conductor/manager.py
index b5189f65..6fb2616e 100644
--- a/sysinv/sysinv/sysinv/sysinv/conductor/manager.py
+++ b/sysinv/sysinv/sysinv/sysinv/conductor/manager.py
@@ -11908,8 +11908,8 @@ class ConductorManager(service.PeriodicService):
                 LOG.exception("Failed to regenerate the overrides for app %s. %s" %
                               (app.name, e))
         else:
- LOG.info("{} app active:{} status:{} does not warrant re-apply",
- app.name, app.active, app.status)
+ LOG.info("{} app active:{} status:{} does not warrant re-apply"
+ "".format(app.name, app.active, app.status))

     def app_lifecycle_actions(self, context, rpc_app, hook_info):
         """Perform any lifecycle actions for the operation and timing supplied.
--
2.30.0

Back to the issue:

Seems armada/kubernetes related.

sysinv 2021-03-01 11:36:32.372 2356122 INFO sysinv.conductor.kube_app [-] lifecycle hook for application stx-openstack (1.0-78-centos-stable-versioned) started {'lifecycle_type': u'manifest', 'relative_timing': u'pre', 'mode': u'auto', 'operation': u'apply', 'extra': {'was_applied': True}}.
sysinv 2021-03-01 11:36:32.372 2356122 INFO k8sapp_openstack.lifecycle.lifecycle_openstack [-] Wait if there are openstack charts in pending install...
sysinv 2021-03-01 11:36:32.781 2356122 ERROR sysinv.conductor.kube_app [-] Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.192.176:45960->10.10.59.10:5432: write: broken pipe
command terminated with exit code 1
: HelmTillerFailure: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.192.176:45960->10.10.59.10:5432: write: broken pipe
command terminated with exit code 1
2021-03-01 11:36:32.781 2356122 ERROR sysinv.conductor.kube_app Traceback (most recent call last):

var/log/containers$ grep -R "10.10.59.10" | grep armada-api
armada-api-b86d46465-xdbjt_armada_tiller-a00cf66fa21b19f28771a99a2aa85643c1fbfd2ed9d19d0f10c2a8ac7925cc1b.log:2021-03-01T10:44:38.71962272Z stderr F [storage/driver] 2021/03/01 10:44:38 list: failed to list: write tcp 172.16.192.176:60758->10.10.59.10:5432: write: broken pipe
armada-api-b86d46465-xdbjt_armada_tiller-a00cf66fa21b19f28771a99a2aa85643c1fbfd2ed9d19d0f10c2a8ac7925cc1b.log:2021-03-01T11:36:32.776510152Z stderr F [storage/driver] 2021/03/01 11:36:32 list: failed to list: write tcp 172.16.192.176:45960->10.10.59.10:5432: write: broken pipe
armada-api-b86d46465-xdbjt_armada_tiller-a00cf66fa21b19f28771a99a2aa85643c1fbfd2ed9d19d0f10c2a8ac7925cc1b.log:2021-03-01T11:38:56.600564874Z stderr F [storage/driver] 2021/03/01 11:38:56 list: failed to list: write tcp 172.16.192.176:35854->10.10.59.10:5432: write: broken pipe
armada-api-b86d46465-xdbjt_armada_til...

Read more...

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Critical
status: New → Triaged
tags: added: stx.5.0 stx.apps
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / critical - sanity issue introduced by recent commit

Revision history for this message
Bob Church (rchurch) wrote :

Attaching key logs from this.

But I don’t see any evidence of a network disconnect that would cause this. Right before this happens, controller-1 has just come online and finished DRDB syncing. It's possible we have some stale TCP connection to the helm postgres DB in the tiller container. Postgres logs report nothing off. Maybe running out of connections? Looking at the tiller process running the container there are quite a few threads running. I'm not sure if this is normal behavior.

I could not reproduce this in my local setup

Revision history for this message
Bob Church (rchurch) wrote :

It is possible that the 5 second timeout on the tiller command is not long enough based on the current responsiveness of the system

2021-03-01T11:36:32.683 controller-0 containerd[78918]: info time="2021-03-01T11:36:32.683511061Z" level=info msg="Exec for \"a00cf66fa21b19f28771a99a2aa85643c1fbfd2ed9d19d0f10c2a8ac7925cc1b\" with command [/bin/sh -c PATH=/bin:/usr/bin:/usr/local/bin:/tmp HELM_HOST=:24134 /bin/sh -c 'helm list --namespace openstack --pending --tiller-connection-timeout 5'], tty false and stdin false"
2021-03-01T11:36:32.683 controller-0 containerd[78918]: info time="2021-03-01T11:36:32.683554708Z" level=info msg="Exec for \"a00cf66fa21b19f28771a99a2aa85643c1fbfd2ed9d19d0f10c2a8ac7925cc1b\" returns URL \"http://127.0.0.1:45461/exec/efyQ6dTd\""

and that the command is being prematurely terminated before it could complete.

Ghada Khalil (gkhalil)
tags: added: stx.containers
Revision history for this message
chen haochuan (martin1982) wrote :

containers/armada-api-b86d46465-xdbjt_armada_tiller-a00cf66fa21b19f28771a99a2aa85643c1fbfd2ed9d19d0f10c2a8ac7925cc1b.log

2021-03-01T11:36:32.776492394Z stderr F [storage] 2021/03/01 11:36:32 listing all releases with filter
2021-03-01T11:36:32.776510152Z stderr F [storage/driver] 2021/03/01 11:36:32 list: failed to list: write tcp 172.16.192.176:45960->10.10.59.10:5432: write: broken pipe
2021-03-01T11:38:56.600546417Z stderr F [storage] 2021/03/01 11:38:56 listing all releases with filter
2021-03-01T11:38:56.600564874Z stderr F [storage/driver] 2021/03/01 11:38:56 list: failed to list: write tcp 172.16.192.176:35854->10.10.59.10:5432: write: broken pipe

./containers/kube-apiserver-controller-0_kube-system_kube-apiserver-cc15817b97d46e764dfc637ef85f5aa1ba44d41e5558c003e9c0eb536eb5ace5.log

2021-03-01T11:36:00.149582424Z stderr F Trace[275197162]: [560.289631ms] [558.719327ms] About to apply patch
2021-03-01T11:38:57.196163721Z stderr F E0301 11:38:57.196073 1 upgradeaware.go:357] Error proxying data from client to backend: write tcp 10.10.59.11:39104->10.10.59.11:10250: write: broken pipe
~

Revision history for this message
chen haochuan (martin1982) wrote :

I could reproduce this issue.

Depley duplex system with latest iso and stx-openstack apply successfully

on controller-0
$ system host-swact 1

on controller-1
$ system host-swact 2

then on controller-0(active controller)
$ system application-apply stx-openstack

Application apply fail

Revision history for this message
chen haochuan (martin1982) wrote :

For my reproduce system

./sysinv.log:9002:sysinv 2021-03-19 14:37:47.208 1359833 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.: HelmTillerFailure: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.192.72:52144->192.188.204.2:5432: write: broken pipe

And 192.188.204.2:5432, postgres listen on this port
[sysadmin@controller-0 log(keystone_admin)]$ sudo netstat -ltnp | grep 5432
tcp 0 0 0.0.0.0:5432 0.0.0.0:* LISTEN 1357087/postgres
tcp6 0 0 :::5432 :::* LISTEN 1357087/postgres

[sysadmin@controller-0 log(keystone_admin)]$ ps -aux | grep postgres
nfsnobo+ 130497 0.1 0.4 213352 126384 ? Ssl 03:35 0:57 /tiller --storage=sql --sql-dialect=postgres --sql-connection-string=postgresql://admin-helmv2:X1=_Vx3F7T-GGu6L@192.188.204.2:5432/helmv2?sslmode=disable -listen :24134 -probe-listen :24135 -logtostderr -v 5
postgres 1357087 0.0 0.1 312924 34880 ? S< 14:17 0:00 /usr/bin/postgres -D /var/lib/postgresql/20.12 -c config_file=/etc/postgresql/postgresql.conf

And cluster ip "172.16.192.72", is pod armada-api address.
./pods/kube-system_calico-node-tq7k8_5e123081-7506-414c-a679-58a88c7e2795/calico-node/1.log:1731:2021-03-19T03:35:49.722858338Z stdout F 2021-03-19 03:35:49.720 [INFO][41] int_dataplane.go 825: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"armada/armada-api-5fc6fb496c-qqdkm" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali7815eef50b5" profile_ids:"kns.armada" profile_ids:"ksa.armada.armada-api" ipv4_nets:"172.16.192.72/32" >

[sysadmin@controller-0 log(keystone_admin)]$ kubectl get po -n armada -o wide | grep "172.16.192.72"
armada-api-5fc6fb496c-qqdkm 2/2 Running 2 11h 172.16.192.72 controller-0 <none> <none>

so after swact, tiller in aramda-api pod, could not access posgresl service

Revision history for this message
chen haochuan (martin1982) wrote :

after wait for while, it could recovery

Changed in starlingx:
assignee: nobody → Gustavo Santos (gooshtavow)
Revision history for this message
Gustavo Santos (gooshtavow) wrote :

The armada-api pod, which runs helm 2, goes up with the following command when starting the tiller container:

tiller --storage=sql --sql-dialect=postgres --sql-connection-string=postgresql://admin-helmv2:PASSWORD@192.168.204.1:5432/helmv2?sslmode=disable -listen :24134 -probe-listen :24135 -logtostderr -v 5

Where 192.168.204.1 is the active controller's floating IP address. This creates a socket connecting the pod to the currently active controller. After performing a swact, this socket becomes invalid, because it still points to the now inactive controller, and that is why the broken pipe error happens.

Revision history for this message
Yvonne Ding (yding) wrote :

The issue can be reproduced on AIO-SX after lock/unlock controller with "20210401T032802Z" load.

2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app Traceback (most recent call last):
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 2294, in perform_app_apply
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app self.app_lifecycle_actions(None, None, rpc_app, lifecycle_hook_info_app_apply)
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1891, in app_lifecycle_actions
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app lifecycle_op.app_lifecycle_actions(context, conductor_obj, self, app, hook_info)
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app File "/opt/platform/helm/21.05/stx-openstack/1.0-78-centos-stable-versioned/plugins/k8sapp_openstack/lifecycle/lifecycle_openstack.py", line 56, in app_lifecycle_actions
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app return self.pre_manifest_apply(app, app_op, hook_info)
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app File "/opt/platform/helm/21.05/stx-openstack/1.0-78-centos-stable-versioned/plugins/k8sapp_openstack/lifecycle/lifecycle_openstack.py", line 144, in pre_manifest_apply
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app result = helm_utils.get_openstack_pending_install_charts()
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/helm/utils.py", line 205, in get_openstack_pending_install_charts
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app reason="Failed to obtain pending charts list: %s" % e)
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app HelmTillerFailure: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.192.111:33372->192.168.204.1:5432: write: connection reset by peer
2021-04-01 17:11:01.794 95532 ERROR sysinv.conductor.kube_app command terminated with exit code 1

Revision history for this message
Gustavo Santos (gooshtavow) wrote :

A code review with a possible fix has been opened for this bug: https://review.opendev.org/c/starlingx/config/+/783472

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (6.1 KiB)

I checked again today(20210408T015657Z) and I still see the issue:
sysinv 2021-04-08 17:40:06.274 920503 INFO sysinv.helm.utils [-] Caught HelmTillerFailure exception. Retrying... Exception: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.166.148:52966->10.10.59.10:5432: write: broken pipe
command terminated with exit code 1
sysinv 2021-04-08 17:40:06.691 920503 INFO sysinv.helm.utils [-] Caught HelmTillerFailure exception. Retrying... Exception: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.166.148:55046->10.10.59.10:5432: write: broken pipe
command terminated with exit code 1
sysinv 2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app [-] Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.166.148:55046->10.10.59.10:5432: write: broken pipe
command terminated with exit code 1
: HelmTillerFailure: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.166.148:55046->10.10.59.10:5432: write: broken pipe
command terminated with exit code 1
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app Traceback (most recent call last):
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 2294, in perform_app_apply
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app self.app_lifecycle_actions(None, None, rpc_app, lifecycle_hook_info_app_apply)
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1891, in app_lifecycle_actions
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app lifecycle_op.app_lifecycle_actions(context, conductor_obj, self, app, hook_info)
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app File "/opt/platform/helm/21.05/stx-openstack/1.0-78-centos-stable-versioned/plugins/k8sapp_openstack/lifecycle/lifecycle_openstack.py", line 56, in app_lifecycle_actions
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app return self.pre_manifest_apply(app, app_op, hook_info)
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app File "/opt/platform/helm/21.05/stx-openstack/1.0-78-centos-stable-versioned/plugins/k8sapp_openstack/lifecycle/lifecycle_openstack.py", line 144, in pre_manifest_apply
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app result = helm_utils.get_openstack_pending_install_charts()
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/retrying.py", line 68, in wrapped_f
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app return Retrying(*dargs, **dkw).call(f, *args, **kw)
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/retrying.py", line 229, in call
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app raise attempt.get()
2021-04-08 17:40:06.692 920503 ERROR sysinv.conductor.kube_app F...

Read more...

Revision history for this message
Gustavo Santos (gooshtavow) wrote :

Alexandru, can you provide a little more information about the system you've tested this on and if you got the error more than once? I wasn't able to reproduce the issue in several attempts on two different systems and I'm wondering why you're still getting the error.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Today I checked again if this issue is still there and I tested using a baremetal Standard configuration.
The steps were:
system host-swact controller-0
ssh controller-1
system host-lock controller-0
system host-unlock controller-0
watch system application-list (in 5-6 minutes stx-openstack will try a reapply but will fail)

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I manually checked again today this bug on baremetal: Duplex, Standard and Standard External. I reproduced it on Standard External only.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Gustavo Santos (gooshtavow) wrote :

Alexandru, I have opened a code review (https://review.opendev.org/c/starlingx/config/+/786092) for a new fix to this problem. Since I wasn't able to reproduce the issue even after the first fix, can you give this one a try before it gets merged? I also couldn't get the broken pipe error while testing this one several times.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening as there seems to be more code reviews required to address this issue.
Once a fix is merged in stx master, it will also need to be cherrypicked to the r/stx.5.0 release.

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Note:
This seems to be a generic issue with the containerized application framework after a swact.
https://bugs.launchpad.net/starlingx/+bug/1920650 reports the same issue with the oidc application.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/786092
Committed: https://opendev.org/starlingx/config/commit/ad8567f06485a10edf3857fbc87ae7d3058a1dfc
Submitter: "Zuul (22348)"
Branch: master

commit ad8567f06485a10edf3857fbc87ae7d3058a1dfc
Author: Gustavo Santos <email address hidden>
Date: Tue Apr 13 16:09:21 2021 -0300

    Restart tiller on openstack pending install check

    This is another attempt at fixing the same bug as the merged review
    https://review.opendev.org/c/starlingx/config/+/783472 had tried, since
    there were reports indicating that the bug would still occur on certain
    setups.

    This patch explicitly forces a tiller restart when catching the first
    HelmTillerFailure exception caused by the broken pipe error, instead of
    only trying to rerun the 'helm list' command, which was believed to be
    a reliable workaround to the problem, but didn't solve it in every
    possible scenario.

    Closes-Bug: #1917308
    Signed-off-by: Gustavo Santos <email address hidden>
    Change-Id: I38667609173ca5c6fed028f75742ae99efedf149

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Gustavo, please cherrypick your changes to the r/stx.5.0 release asap.

Bill Zvonar (billzvonar)
tags: added: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (r/stx.5.0)

Fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/config/+/788294

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/config/+/788294
Committed: https://opendev.org/starlingx/config/commit/70df83f1f949f7652300b7b26ed0b28d9b095cff
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 70df83f1f949f7652300b7b26ed0b28d9b095cff
Author: Gustavo Santos <email address hidden>
Date: Tue Apr 13 16:09:21 2021 -0300

    Restart tiller on openstack pending install check

    This is another attempt at fixing the same bug as the merged review
    https://review.opendev.org/c/starlingx/config/+/783472 had tried, since
    there were reports indicating that the bug would still occur on certain
    setups.

    This patch explicitly forces a tiller restart when catching the first
    HelmTillerFailure exception caused by the broken pipe error, instead of
    only trying to rerun the 'helm list' command, which was believed to be
    a reliable workaround to the problem, but didn't solve it in every
    possible scenario.

    Closes-Bug: #1917308
    Signed-off-by: Gustavo Santos <email address hidden>
    Change-Id: I38667609173ca5c6fed028f75742ae99efedf149
    (cherry picked from commit ad8567f06485a10edf3857fbc87ae7d3058a1dfc)

Ghada Khalil (gkhalil)
tags: added: in-r-stx50
removed: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/791092

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/791093

Revision history for this message
Angie Wang (angiewang) wrote :

Just a note, helm is using package sqlx to establish connection with postgres backend and sqlx is using Golong postgres driver. The "broken pipe" issue is an issue in Golang Postgres driver - https://github.com/lib/pq/issues/870 which was just fixed at the end of last year https://github.com/lib/pq/pull/1013. Has not been merged to sqlx https://github.com/jmoiron/sqlx/pull/715

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/791092
Committed: https://opendev.org/starlingx/integ/commit/b3540ccfdfa6956fb20c62e5e5bb76af56d2ab63
Submitter: "Zuul (22348)"
Branch: master

commit b3540ccfdfa6956fb20c62e5e5bb76af56d2ab63
Author: Robert Church <email address hidden>
Date: Wed May 12 22:36:23 2021 -0400

    Update the liveness probe to verify postgres connectivity

    Change the tillerLivenessProbeTemplate to test the connectivity to the
    postgres backend. We will override the periodSeconds and
    failureThreshold when installing the helm chart to trigger a restart of
    the tiller pod over a swact when the postgres DB/server moves from one
    controller to the other.

    This will help guarantee that the tiller connection is always
    re-established if the connectivity to the postgres backend fails.

    Change-Id: I7fbed33a8c821f6c9254f58d5953e2115cf4141a
    Related-Bug: #1917308
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/791093
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/d5460198dc0310a80580537fd8df76ae00e17f02
Submitter: "Zuul (22348)"
Branch: master

commit d5460198dc0310a80580537fd8df76ae00e17f02
Author: Robert Church <email address hidden>
Date: Wed May 12 22:45:38 2021 -0400

    Adjust armada's tiller container liveness probe

    With the liveness probe update in the armada helm chart to test the
    connectivity to the postgres backend, adjust the periodSeconds and
    failureThreshold to align with the minimum swact time to be expected for
    postgres switching from one controller to another.

    Reviewing logs from various H/W labs it appears that average postgres
    swact time ranges from 9s-20s, with the mean ~15s.

    Times can be observed with:
    2021-05-09T13:32:24.475 controller-1 OCF_pgsql(postgres)[396293]: info
                                         INFO: server shutting down
    2021-05-09T13:32:33.423 controller-0 OCF_pgsql(postgres)[147541]: info
                                         INFO: server starting

    Set the periodSeconds to 4 and the failureThreshold to 2 so that if the
    postgres server is not accessible, the tiller container will be
    restarted within the 9s minimum swact time. This will ensure that the
    next time tiller is required by Armada or used by the helmv2-cli that
    the connection to postgres backend has been re-established.

    Change-Id: I7454a737771d9a608d2fe69c5136d37da022007e
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/791092
    Related-Bug: #1917308
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/791599

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/791599
Committed: https://opendev.org/starlingx/integ/commit/4e1aa82e96d9b4caeff7e7b31632733c395c6ad0
Submitter: "Zuul (22348)"
Branch: master

commit 4e1aa82e96d9b4caeff7e7b31632733c395c6ad0
Author: Robert Church <email address hidden>
Date: Sat May 15 16:24:29 2021 -0400

    Update postgres liveness check to support IPv6 addresses

    Templating will add square brackets for IPv6 addresses which are
    interpreted as an array vs. a string. Quote this so that it interpreted
    correctly.

    Change-Id: I2b705015a74ea2e4e914b7a83cdceed37d49b766
    Related-Bug: #1917308
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The additional commits above will need to be merged in the r/stx.5.0 branch

tags: added: stx.cherrypickneeded
removed: in-r-stx50
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (r/stx.5.0)

Related fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/791777

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (r/stx.5.0)

Related fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/integ/+/791785

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/791785
Committed: https://opendev.org/starlingx/integ/commit/106331ecec1a77f3a04a3a15efcd4886d9104ea9
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 106331ecec1a77f3a04a3a15efcd4886d9104ea9
Author: Robert Church <email address hidden>
Date: Wed May 12 22:36:23 2021 -0400

    Update the liveness probe to verify postgres connectivity

    Change the tillerLivenessProbeTemplate to test the connectivity to the
    postgres backend. We will override the periodSeconds and
    failureThreshold when installing the helm chart to trigger a restart of
    the tiller pod over a swact when the postgres DB/server moves from one
    controller to the other.

    This will help guarantee that the tiller connection is always
    re-established if the connectivity to the postgres backend fails.

    Change-Id: I7fbed33a8c821f6c9254f58d5953e2115cf4141a
    Related-Bug: #1917308
    Signed-off-by: Robert Church <email address hidden>
    (cherry picked from commit b3540ccfdfa6956fb20c62e5e5bb76af56d2ab63)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (r/stx.5.0)

Related fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/integ/+/791943

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/791777
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/4555715323b25613768214d891e414959ac7b5d6
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 4555715323b25613768214d891e414959ac7b5d6
Author: Robert Church <email address hidden>
Date: Wed May 12 22:45:38 2021 -0400

    Adjust armada's tiller container liveness probe

    With the liveness probe update in the armada helm chart to test the
    connectivity to the postgres backend, adjust the periodSeconds and
    failureThreshold to align with the minimum swact time to be expected for
    postgres switching from one controller to another.

    Reviewing logs from various H/W labs it appears that average postgres
    swact time ranges from 9s-20s, with the mean ~15s.

    Times can be observed with:
    2021-05-09T13:32:24.475 controller-1 OCF_pgsql(postgres)[396293]: info
                                         INFO: server shutting down
    2021-05-09T13:32:33.423 controller-0 OCF_pgsql(postgres)[147541]: info
                                         INFO: server starting

    Set the periodSeconds to 4 and the failureThreshold to 2 so that if the
    postgres server is not accessible, the tiller container will be
    restarted within the 9s minimum swact time. This will ensure that the
    next time tiller is required by Armada or used by the helmv2-cli that
    the connection to postgres backend has been re-established.

    Change-Id: I7454a737771d9a608d2fe69c5136d37da022007e
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/791092
    Related-Bug: #1917308
    Signed-off-by: Robert Church <email address hidden>
    (cherry picked from commit d5460198dc0310a80580537fd8df76ae00e17f02)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/791943
Committed: https://opendev.org/starlingx/integ/commit/821de96615cb6f93fbc39f4baaa769029328d34d
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 821de96615cb6f93fbc39f4baaa769029328d34d
Author: Robert Church <email address hidden>
Date: Sat May 15 16:24:29 2021 -0400

    Update postgres liveness check to support IPv6 addresses

    Templating will add square brackets for IPv6 addresses which are
    interpreted as an array vs. a string. Quote this so that it interpreted
    correctly.

    Change-Id: I2b705015a74ea2e4e914b7a83cdceed37d49b766
    Related-Bug: #1917308
    Signed-off-by: Robert Church <email address hidden>
    (cherry picked from commit 4e1aa82e96d9b4caeff7e7b31632733c395c6ad0)

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Adding in-r-stx50 as the latest commits have been merged in the r/stx.5.0 release branch

tags: added: in-r-stx50
removed: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (f/centos8)

Related fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (f/centos8)

Related fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (f/centos8)

Related fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (f/centos8)
Download full text (52.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000

    Revert "Restore host filesystems with collected sizes"

    This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.

    Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.

    Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
    Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400

    Ensure apiserver keys are present before extract from tarball

    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.

    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <email address hidden>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400

    Update SX to DX migration to wait for coredns config

    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <email address hidden>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000

    Fixed missing apiserver-etcd-client certs

    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}

    This change adds a new task which brings
    the certs from /etc/kubernetes/pki

    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <email address hidden>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500

    Support boo...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.