cert-manager failed to override,apply on post-install

Bug #1876328 reported by ayyappa
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Sabeel Ansari

Bug Description

Brief Description
-----------------
Post installation, override the cm with the following values on standard,duplex system fails

replicaCount: 2
podLabels:
 test: pv

and also just override with podLabels also fails on simplex,duplex and standard systems

Severity
--------
Major

Steps to Reproduce
------------------
1)After installation, override the cm with the following values
[sysadmin@controller-0 ~(keystone_admin)]$ system helm-override-update --values cm_values.yaml cert-manager cert-manager cert-manager
+----------------+-----------------+
| Property | Value |
+----------------+-----------------+
| name | cert-manager |
| namespace | cert-manager |
| user_overrides | podLabels: |
| | test: pv |
| | replicaCount: 2 |
| | |
+----------------+-----------------+

2)Apply the application
[sysadmin@controller-0 ~(keystone_admin)]$ system application-apply cert-manager
+---------------+----------------------------------+
| Property | Value |
+---------------+----------------------------------+
| active | True |
| app_version | 1.0-0 |
| created_at | 2020-05-01T14:17:40.817460+00:00 |
| manifest_file | certmanager-manifest.yaml |
| manifest_name | cert-manager-manifest |
| name | cert-manager |
| progress | None |
| status | applying |
| updated_at | 2020-05-01T14:18:31.900064+00:00 |
+---------------+----------------------------------+
Please use 'system application-list' or 'system application-show cert-manager' to view the current progress.

3)The controller pod is scaled to standby controller, but an extra pod keeps in pending state with the following error which eventually fails the application apply

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods -n cert-manager -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cm-cert-manager-5ff78759f-5qm4j 0/1 Pending 0 84s <none> <none> <none> <none>
cm-cert-manager-7b8b94bf9f-27425 1/1 Running 0 84s 172.16.166.132 controller-1 <none> <none>
cm-cert-manager-7b8b94bf9f-cfwr8 1/1 Running 1 57m 172.16.192.75 controller-0 <none> <none>
cm-cert-manager-cainjector-56b68989b5-xx4ln 1/1 Running 1 57m 172.16.192.77 controller-0 <none> <none>
cm-cert-manager-webhook-7d5c897795-p6d64 1/1 Running 1 57m 172.16.192.76 controller-0 <none> <none>

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe pod cm-cert-manager-5ff78759f-5qm4j -n cert-manager
Name: cm-cert-manager-5ff78759f-5qm4j
Namespace: cert-manager
Priority: 0
Node: <none>
Labels: app=cert-manager
                app.kubernetes.io/component=controller
                app.kubernetes.io/instance=cm-cert-manager
                app.kubernetes.io/managed-by=Tiller
                app.kubernetes.io/name=cert-manager
                helm.sh/chart=cert-manager-v0.1.0
                pod-template-hash=5ff78759f
                test=pv
Annotations: prometheus.io/path: /metrics
                prometheus.io/port: 9402
                prometheus.io/scrape: true
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/cm-cert-manager-5ff78759f
Containers:
  cert-manager:
    Image: registry.local:9001/quay.io/jetstack/cert-manager-controller:v0.15.0-alpha.1
    Port: 9402/TCP
    Host Port: 0/TCP
    Args:
      --v=2
      --cluster-resource-namespace=$(POD_NAMESPACE)
      --leader-election-namespace=kube-system
      --acme-http01-solver-image=registry.local:9001/quay.io/jetstack/cert-manager-acmesolver:v0.15.0-alpha.1
    Environment:
      POD_NAMESPACE: cert-manager (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cm-cert-manager-token-hfhnn (ro)
Conditions:
  Type Status
  PodScheduled False
Volumes:
  cm-cert-manager-token-hfhnn:
    Type: Secret (a volume populated by a Secret)
    SecretName: cm-cert-manager-token-hfhnn
    Optional: false
QoS Class: BestEffort
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node.kubernetes.io/not-ready:NoExecute for 30s
                 node.kubernetes.io/unreachable:NoExecute for 30s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 27s (x10 over 8m3s) default-scheduler 0/2 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules.
[sysadmin@controller-0 ~(keystone_admin)]$

4)Also tried to override with just podLabel without replica on simplex subcloud on DC and the new pods stuck in pending state eventually failing the override apply

cat cm_values_override.yaml
podLabels:
 test: pv

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe pod cm-cert-manager-7c74dcf76b-m4zf6 -n cert-manager
Name: cm-cert-manager-7c74dcf76b-m4zf6
Namespace: cert-manager
Priority: 0
Node: <none>
Labels: app=cert-manager
                app.kubernetes.io/component=controller
                app.kubernetes.io/instance=cm-cert-manager
                app.kubernetes.io/managed-by=Tiller
                app.kubernetes.io/name=cert-manager
                helm.sh/chart=cert-manager-v0.1.0
                pod-template-hash=7c74dcf76b
                test=pv1
Annotations: prometheus.io/path: /metrics
                prometheus.io/port: 9402
                prometheus.io/scrape: true
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/cm-cert-manager-7c74dcf76b
Containers:
  cert-manager:
    Image: registry.local:9001/quay.io/jetstack/cert-manager-controller:v0.15.0-alpha.1
    Port: 9402/TCP
    Host Port: 0/TCP
    Args:
      --v=2
      --cluster-resource-namespace=$(POD_NAMESPACE)
      --leader-election-namespace=kube-system
      --acme-http01-solver-image=registry.local:9001/quay.io/jetstack/cert-manager-acmesolver:v0.15.0-alpha.1
    Environment:
      POD_NAMESPACE: cert-manager (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cm-cert-manager-token-2rmln (ro)
Conditions:
  Type Status
  PodScheduled False
Volumes:
  cm-cert-manager-token-2rmln:
    Type: Secret (a volume populated by a Secret)
    SecretName: cm-cert-manager-token-2rmln
    Optional: false
QoS Class: BestEffort
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node.kubernetes.io/not-ready:NoExecute for 30s
                 node.kubernetes.io/unreachable:NoExecute for 30s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.
  Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controller-0 Ready master 149m v1.18.1
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cm-cert-manager-5ff78759f-n6v4j 1/1 Running 1 146m
cm-cert-manager-7c74dcf76b-m4zf6 0/1 Pending 0 4m9s
cm-cert-manager-cainjector-56b68989b5-mzbgj 1/1 Running 2 146m
cm-cert-manager-webhook-7d5c897795-9qqgm 1/1 Running 1 146m
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods -n cert-manager -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cm-cert-manager-5ff78759f-n6v4j 1/1 Running 1 146m dead:beef::8e22:765f:6121:eb48 controller-0 <none> <none>
cm-cert-manager-7c74dcf76b-m4zf6 0/1 Pending 0 4m14s <none> <none> <none> <none>
cm-cert-manager-cainjector-56b68989b5-mzbgj 1/1 Running 2 146m dead:beef::8e22:765f:6121:eb4d controller-0 <none> <none>
cm-cert-manager-webhook-7d5c897795-9qqgm 1/1 Running 1 146m dead:beef::8e22:765f:6121:eb47 controller-0 <none> <none>
[sysadmin@controller-0 ~(keystone_admin)]$

Expected Behavior
------------------
The cm controller pods should be scaled on both the controller nodes without any errors

Actual Behavior
----------------
an extra pod stays in pending state with an error

Reproducibility
---------------
100%

System Configuration
--------------------

duplex system,wc_61_62_ipv4,ironpass_5_6

Branch/Pull Time/Commit
-----------------------
2020-04-28

Last Pass
---------
NA

Timestamp/Logs
--------------
2020-05-01T14:33:27.157256735Z

Test Activity
-------------
Feature testing

Workaround
----------
remove,delete and apply with default values in chart

Revision history for this message
ayyappa (mantri425) wrote :
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.4.0 for now until further investigation. Issue related to the stx.4.0 cert-mgr feature.

Need clarification from Greg Waines about which user scenario would require this override. I had assumed the replicaCount would be adjusted by the system based on the deployment config (SX vs DX)

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.4.0 stx.apps stx.security
Changed in starlingx:
assignee: nobody → Ghada Khalil (gkhalil)
ayyappa (mantri425)
description: updated
ayyappa (mantri425)
description: updated
ayyappa (mantri425)
description: updated
ayyappa (mantri425)
summary: - cert-manager failed to override,apply on duplex,standard system with
- replicaCount 2
+ cert-manager failed to override,apply on post-install
description: updated
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Ghada Khalil (gkhalil) → Sabeel Ansari (sansariwr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/730123

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/730123
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=66c60dd2250811f2154dbb2d5a672e3d14a8d20b
Submitter: Zuul
Branch: master

commit 66c60dd2250811f2154dbb2d5a672e3d14a8d20b
Author: Sabeel Ansari <email address hidden>
Date: Thu May 21 16:44:46 2020 -0400

    Helm overrides for cert-manager

    Adding helm override plugin for cert-manager. The changes here will
    set the number of pods value to max(1, number_of_controllers) depending
    on the system configuration.

    Closes-Bug: 1876328

    Change-Id: I3a36ea667678374deefa35e4454f944a739bcf18
    Signed-off-by: Sabeel Ansari <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
ayyappa (mantri425) wrote :

Fix failed on
Lab: WCP_78_79
Load: 2020-05-29_20-00-00

Attached the logs

Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
ayyappa (mantri425) wrote :

Fix addresses the pod deployment for the duplex,standard system by creating cm pods on both the controllers but it fails to override just with pod label, it creates a new pod it stays in "pending" state forever

[sysadmin@controller-1 ~(keystone_admin)]$ cat cm_override_values.yaml
podLabels:
 test: pv

[sysadmin@controller-1 ~(keystone_admin)]$ kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cm-cert-manager-7c8dd5d8bc-5lxwh 0/1 Pending 0 2m49s
cm-cert-manager-856678cfb7-9h2lr 1/1 Running 4 22h
cm-cert-manager-856678cfb7-zg5f7 1/1 Running 0 5h59m
cm-cert-manager-cainjector-85849bd97-2vqdl 1/1 Running 3 22h
cm-cert-manager-cainjector-85849bd97-vnb8j 1/1 Running 0 5h59m
cm-cert-manager-webhook-5745478cbc-hfff8 1/1 Running 0 5h59m
cm-cert-manager-webhook-5745478cbc-kl4g4 1/1 Running 1 22h

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cert-manager-armada-app (master)

Reviewed: https://review.opendev.org/737124
Committed: https://git.openstack.org/cgit/starlingx/cert-manager-armada-app/commit/?id=e4d81aef48b663fd14f723373270eb1dc4dcbd7e
Submitter: Zuul
Branch: master

commit e4d81aef48b663fd14f723373270eb1dc4dcbd7e
Author: Robert Church <email address hidden>
Date: Thu Jun 18 23:55:43 2020 -0400

    Provide an update strategy to allow application updates

    Default chart behavior with anti-affinity policies set leave pods in a
    Pending state after an application update. Setting a RollingUpdate
    strategy allows the new application version to be applied.

    Change-Id: I0d3a84708283a198ef9534ca99f69453f56e01bb
    Partial-Bug: #1876328
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/737125
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=fee3970e134d1823d1cd2cd6f9b06dc2a9f422fc
Submitter: Zuul
Branch: master

commit fee3970e134d1823d1cd2cd6f9b06dc2a9f422fc
Author: Robert Church <email address hidden>
Date: Sat Jun 20 03:33:20 2020 -0400

    Bump cert-mgr version

    New change to cert-manager to support a RollingUpdate strategy requires
    a version change to the playbooks.

    Change-Id: I826705b0ddd3f3b081f6777b3d6f6faa304c96c9
    Depends-On: https://review.opendev.org/#/c/737124
    Closes-Bug: #1876328
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
ayyappa (mantri425) wrote :

Fix working fine in build 2020-06-24_22-16-59, tested on lab wp_8_12 ipv6 and lab ip_18_19 ipv4 labs

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.