System remove,apply fails to start the vault pods with TLS handshake error

Bug #1888900 reported by ayyappa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Michel Thebeau [WIND]

Bug Description

Brief Description
-----------------
System remove, reapply fails to start the Vault pods with "http: TLS handshake error from 172.16.154.5:40060: remote error: tls: bad certificate"

Severity
--------
Major

Steps to Reproduce
------------------
1)Upload,apply the app
2)perform the following steps to enable kubernetes,kv engine,inject the secret
Get the auth methods
********************
curl --insecure --header "X-Vault-Token:s.wi1Jab7k24PpNpieFmJwBfK1" https://10.102.120.240:8200/v1/sys/auth

Enable kubernetes auth
********************************
curl --insecure --header "X-Vault-Token:s.wi1Jab7k24PpNpieFmJwBfK1" \
--request POST \
--data '{"type":"kubernetes","description":"kubernetes auth"}' \
  https://10.102.120.240:8200/v1/sys/auth/kubernetes

Configure kubernetes auth
***********************************
Get the ca cert and token from the vault pod

kubectl exec -n vault sva-vault-0 -- cat /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
kubectl exec -n vault sva-vault-0 -- cat /var/run/secrets/kubernetes.io/serviceaccount/token

curl \
    --insecure \
    --header "X-Vault-Token:s.wi1Jab7k24PpNpieFmJwBfK1" \
    --request POST \
    --data '{"kubernetes_host": "https://10.96.0.1:443", "kubernetes_ca_cert":"-----BEGIN CERTIFICATE-----\nMIICyDCCAbCgAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJlcm5ldGVzMB4XDTIwMDcyNDEzMTY1M1oXDTMwMDcyMjEzMTY1M1owFTETMBEGA1UEAxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAKMUzG64PKNDGSrKREkG5WaLasWI/SDmJ5oyxhUYythuYAWgFfzbTUfkMaAYlWFjfqj5AcYzQuA8J6LkJbKBjZVNKL7pGSXBf1NkRdp90tqGH34FmuwPv7A7tpz3sfP25mRldb76DYk8bDP9qFzqaC5Z1Uqp30MP04wKFlHoxJ+zZ0xLsE5Yh8J/9pjlKr3LaQUrG3iXbxDYMPP3KWf6HtrUDqtZH/p1vpuUCP1QE4A2ZuA+Krt/AeGvXYUyr8Q1uYLV8WpoIZsCUVS88FGZRUF4pCE0v7Btvm1v2A8An6gGdIoJLMTUmDC5ZUrQIfsEsOVzy1I3nw9XGN2ktOATN2ECAwEAAaMjMCEwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEBAFEr9KBCWWNsS1vO+xV1bXYfvE9ajJMI3zG3sZ/55AnjBmQkJTIcRrhd6DxbZGor0400yvteRCGWTTQKvv1Dbpa0Q0xdm2pllGlSXSI4DRdv2V0Doa2l28AYZjemVoOXXfLYVSWY90CxTRsLvVYVEnm2ogkG7moTiTmZJh/cTmUZpyrR0xYniaOOLVsGSwgvzSPa7kt06DM4HFWqlfUVdDMt/9Resr4HjX5PIDtP5sBKEmuviUdiTxO5d6xmhhcJ1IDmZZ4meKo7fElKmpZTAkaBSv5V3ct2Hx31loO1xhfLOjAyp8cDHqMIWcNxt+Qlm2bGQZNA6Hs+xnLnk3stWw8=\n-----END CERTIFICATE-----", "token_reviewer_jwt":"eyJhbGciOiJSUzI1NiIsImtpZCI6IjBvMWpjTmc0Q0FmT2Z6LTh0UzN6bV9MVWVmQWZLVVJ1OU9ma1I4TzliNXcifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJ2YXVsdCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJzdmEtdmF1bHQtdG9rZW4tbGZkdGoiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoic3ZhLXZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiNGYwOTZmY2YtOWU5NC00NjM3LWI5YWYtOWZjZDViZGM2YjE2Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OnZhdWx0OnN2YS12YXVsdCJ9.OxPAVADb2_5mXSJDZpbBLToQlph3e4b3QGut8d0pGS2w-qKSfbJ_ksBIsabtH4s7xOlN9lUdLGANixnebrS-9PmkKA6tcPFi36ghvHT7GAZvRy4gK02c2pb91BvsGAGwWMt_egzseIxsIXG5o6uPqLiH1vEDDRQvDRF_CKNg2S79CtHKl2gyfZAI97YET4NKL7kaGWjnOQui8n5KMhsQ-CENprMa7eH5BGFXR-VX4g5f3zURmo4tXG5KKBqFoqNN2mapUjONH2SnzA70wfYWtx5XZPd4TKBj-eGHLfk-xQV9wk51etGuE_pjzx1FKq8WY53LOfLEFtPPTig-bmIIGQ"}' \
https://10.102.120.240:8200/v1/auth/kubernetes/config

Read the config
****************
curl --insecure --header "X-Vault-Token:s.wi1Jab7k24PpNpieFmJwBfK1" https://10.102.120.240:8200/v1/auth/kubernetes/config

Create the policy
************
curl --insecure \
    -H "X-Vault-Token: s.wi1Jab7k24PpNpieFmJwBfK1" \
    -H "Content-Type: application/json" \
    -X PUT \
    -d '{"policy":"path \"secret/basic-secret/*\" {capabilities = [\"read\"]}"}' \
    https://10.102.120.240:8200/v1/sys/policy/basic-secret-policy

Create the role with policy and namespace
***************************************
curl --insecure \
    --header "X-Vault-Token:s.wi1Jab7k24PpNpieFmJwBfK1" \
    --request POST \
    --data '{ "bound_service_account_names": "basic-secret", "bound_service_account_namespaces": "pvtest", "policies": "basic-secret-policy", "max_ttl": "1800000"}' \
    https://10.102.120.240:8200/v1/auth/kubernetes/role/basic-secret-role

Read the role
************
curl --insecure --header "X-Vault-Token:s.wi1Jab7k24PpNpieFmJwBfK1" https://10.102.120.240:8200/v1/auth/kubernetes/role/basic-secret-role

Enable the secret engine
*********************
curl --insecure --header "X-Vault-Token:s.wi1Jab7k24PpNpieFmJwBfK1" \
--request POST \
--data '{"type": "kv","version":"2"}' \
  https://10.102.120.240:8200/v1/sys/mounts/secret

Create the secrets
*************************
curl --insecure \
 -H "X-Vault-Token: s.wi1Jab7k24PpNpieFmJwBfK1" \
 -H "Content-Type: application/json" \
 -X POST -d '{"username":"pvtest","password":"Li69nux*"}' \
  https://10.102.120.240:8200/v1/secret/basic-secret/helloworld

Check the secret
***********************
curl --insecure --header "X-Vault-Token:s.wi1Jab7k24PpNpieFmJwBfK1" https://10.102.120.240:8200/v1/secret/basic-secret/helloworld

cat helloworld.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: pvtest
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: basic-secret
  namespace: pvtest
  labels:
    app: basic-secret
spec:
  selector:
    matchLabels:
      app: basic-secret
  replicas: 1
  template:
    metadata:
      annotations:
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/tls-skip-verify: "true"
        vault.hashicorp.com/agent-inject-secret-helloworld: "secret/basic-secret/helloworld"
        vault.hashicorp.com/agent-inject-template-helloworld: |
          {{- with secret "secret/basic-secret/helloworld" -}}
          {
            "username" : "{{ .Data.username }}",
            "password" : "{{ .Data.password }}"
          }
          {{- end }}
        vault.hashicorp.com/role: "basic-secret-role"
      labels:
        app: basic-secret
    spec:
      serviceAccountName: basic-secret
      containers:
      - name: app
        image: jweissig/app:0.0.1
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: basic-secret
  namespace: pvtest
  labels:
    app: basic-secret

Apply the app and verify pod is running
********************************
kubectl create -f helloworld.yaml

Verify secrets injected into the pod
******************************************
 kubectl exec -n pvtest basic-secret-55d6c9bb6f-4whbp -- cat /vault/secrets/helloworld
Defaulting container name to app.
Use 'kubectl describe pod/basic-secret-55d6c9bb6f-4whbp -n pvtest' to see all of the containers in this pod.
{
  "username" : "pvtest",
  "password" : "Li69nux*"
}

3)Remove the vault app without deleting the PVC
system application-remove vault
4)apply the app
system applciation-apply vault

5)Wait for vault pod init and they dont reach the ready state
[sysadmin@controller-1 ~(keystone_admin)]$ kubectl get pods -n vault
NAME READY STATUS RESTARTS AGE
sva-vault-0 0/1 Running 0 2m25s
sva-vault-1 0/1 Running 0 2m25s
sva-vault-2 0/1 Running 0 2m25s
sva-vault-agent-injector-db6878c69-z5rtg 1/1 Running 0 2m25s
sva-vault-manager-0 1/1 Running 0 2m25s

6)log show the following error
[sysadmin@controller-1 ~(keystone_admin)]$ tail -f /var/log/pods/vault_sva-vault-2_5747c4fe-446b-4ff7-8ce2-cd40df2a9342/vault/0.log
2020-07-24T20:24:35.666429779Z stderr F 2020-07-24T20:24:35.666Z [INFO] http: TLS handshake error from 172.16.154.5:40060: remote error: tls: bad certificate
2020-07-24T20:24:41.049837853Z stderr F 2020-07-24T20:24:41.049Z [INFO] http: TLS handshake error from 172.16.154.5:40088: remote error: tls: bad certificate
2020-07-24T20:24:46.484364442Z stderr F 2020-07-24T20:24:46.484Z [INFO] http: TLS handshake error from 172.16.154.5:40118: remote error: tls: bad certificate
2020-07-24T20:24:51.88965778Z stderr F 2020-07-24T20:24:51.889Z [INFO] http: TLS handshake error from 172.16.154.5:40150: remote error: tls: bad certificate
2020-07-24T20:24:57.315645712Z stderr F 2020-07-24T20:24:57.315Z [INFO] http: TLS handshake error from 172.16.154.5:40184: remote error: tls: bad certificate
2020-07-24T20:25:02.741276992Z stderr F 2020-07-24T20:25:02.741Z [INFO] http: TLS handshake error from 172.16.154.5:40212: remote error: tls: bad certificate
2020-07-24T20:25:08.161556419Z stderr F 2020-07-24T20:25:08.161Z [INFO] http: TLS handshake error from 172.16.154.5:40242: remote error: tls: bad certificate
2020-07-24T20:25:13.597918351Z stderr F 2020-07-24T20:25:13.597Z [INFO] http: TLS handshake error from 172.16.154.5:40270: remote error: tls: bad certificate
2020-07-24T20:25:19.011553862Z stderr F 2020-07-24T20:25:19.011Z [INFO] http: TLS handshake error from 172.16.154.5:40302: remote error: tls: bad certificate
2020-07-24T20:25:24.456478768Z stderr F 2020-07-24T20:25:24.456Z [INFO] http: TLS handshake error from 172.16.154.5:40338: remote error: tls: bad certificate
2020-07-24T20:25:29.862866032Z stderr F 2020-07-24T20:25:29.862Z [INFO] http: TLS handshake error from 172.16.154.5:40366: remote error: tls: bad certificate
2020-07-24T20:25:35.284795945Z stderr F 2020-07-24T20:25:35.284Z [INFO] http: TLS handshake error from 172.16.154.5:40398: remote error: tls: bad certificate
2020-07-24T20:25:40.731831023Z stderr F 2020-07-24T20:25:40.731Z [INFO] http: TLS handshake error from 172.16.154.5:40426: remote error: tls: bad certificate
2020-07-24T20:25:46.181830461Z stderr F 2020-07-24T20:25:46.181Z [INFO] http: TLS handshake error from 172.16.154.5:40458: remote error: tls: bad certificate
2020-07-24T20:25:51.617320245Z stderr F 2020-07-24T20:25:51.617Z [INFO] http: TLS handshake error from 172.16.154.5:40480: remote error: tls: bad certificate
2020-07-24T20:25:57.074625326Z stderr F 2020-07-24T20:25:57.074Z [INFO] http: TLS handshake error from 172.16.154.5:40522: remote error: tls: bad certificate
2020-07-24T20:26:02.515011819Z stderr F 2020-07-24T20:26:02.514Z [INFO] http: TLS handshake error from 172.16.154.5:40552: remote error: tls: bad certificate

Attached the server logs, please check

Expected Behavior
------------------
the pods should run successfully also the secrets should persist on the system

Actual Behavior
----------------
the pods doesn't reach to ready state

Reproducibility
---------------
100%

System Configuration
--------------------
standard wcp_3_6 ipv4

Branch/Pull Time/Commit
-----------------------
2020-07-24_00-00-00

Last Pass
---------
This is a new test scenario

Timestamp/Logs
--------------
2020-07-24T15:43:40.922783884Z

Test Activity
-------------
Feature Testing

Workaround
----------
Haven't found any

Revision history for this message
ayyappa (mantri425) wrote :
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Cole Walker (cwalops)
tags: added: stx.apps
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium - failure scenario test-case; should be investigated

tags: added: stx.5.0
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Cole Walker (cwalops) wrote :

This appears to be caused by a combination of how vault is bootstrapping tls and how cert-manager is currently configured to manage the k8s secrets it creates. I'll provide a bit of background on how vault TLS is set up and propose a solution.

Vault currently bootstraps its TLS by generating a key pair during the helm template stage, and then using that key pair to create a cert-manager issuer resource. See:
kubectl get issuers.cert-manager.io -n vault

Vault then provisions a certificate from that cert-manager issuer, which creates a certificate resource and a secret containing the certificate data. This secret is what is consumed by the vault components to run TLS.
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get certificate -n vault
NAME READY SECRET AGE
vault-server-tls True vault-server-tls 39m

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get secrets -n vault vault-server-tls
NAME TYPE DATA AGE
vault-server-tls kubernetes.io/tls 3 39m

When vault is deleted with system application-remove vault, the issuer and certificate resources are deleted, but the secret is left behind because it is not owned by the certificate resource. This is intended behaviour for how cert-manager is configured. This orphaned secret does not get updated on a subsequent reapply and is then invalid when vault attempts to use it.

Cert-manager can be configured to clean up secrets when the corresponding certificate resource is removed by enabling this option via its helm chart:
extraArgs: []
  # When this flag is enabled, secrets will be automatically removed when the certificate resource is deleted
  # - --enable-certificate-owner-ref=true

This can be tested using an override like this:
[sysadmin@controller-0 ~(keystone_admin)]$ cat cm.yaml
extraArgs:
  - --enable-certificate-owner-ref=true

system helm-override-update --values cm.yaml cert-manager cert-manager cert-manager
system application-apply cert-manager

With this enabled, vault can be removed and reapplied properly.

The alternative to this is to manually delete the orphaned secret before applying vault.

Cole Walker (cwalops)
Changed in starlingx:
status: Triaged → Confirmed
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on the investigation above, this is now lower priority given there is a workaround. We'll still target fixing this in the stx.5.0 time-frame.

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Cole Walker (cwalops) → Michel Thebeau [WIND] (mthebeau)
Revision history for this message
Michel Thebeau [WIND] (mthebeau) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per review with Michel Thebeau and Greg Waines, this is a risky change and requires lots of testing, so it's too late to introduce it for stx.5.0. There is a workaround noted in the notes from Cole Walker above. Moving to stx.6.0

tags: added: stx.6.0
removed: stx.5.0
Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
Michel Thebeau [WIND] (mthebeau) wrote :

Testing includes:

 - AIO-SX. Boot from unfixed starlingx ISO, updating cert-manager from private build
 - AIO-SX. Boot from private build ISO that contains new version of cert-manager
 - AIO-SX. apply vault before updating cert-manager (Result: requires vault reapply for cert-manager's enable-certificate-owner-ref option to take affect for vault)
 - AIO-DX. From private build ISO that contains new version of cert-manager
 - Sanity on H/W (2+2+4). From private build with updated cert-manager
 - Provision distributed cloud from private build ISO with cert-manager change, AIO-DX system controller and AIO-SX subcloud; delete subcloud.
 - backup and restore AIO-SX
 - backup/restore standard 2+1
 - backup/restore H/W standard 2+2
 - platform upgrade on H/W, standard 2+2: with cert-manager change only on the new load; apply vault after upgrade completes (etc apply, remove, delete, repeat, excerise vault)
 - platform upgrade standard 2+1: with cert-manager change on both loads; apply vault after upgrade completes (etc apply, remove, delete, repeat)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cert-manager-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/cert-manager-armada-app/+/784812
Committed: https://opendev.org/starlingx/cert-manager-armada-app/commit/d72a3d49bc7a5bf55e965cc741e27122f8982cb6
Submitter: "Zuul (22348)"
Branch: master

commit d72a3d49bc7a5bf55e965cc741e27122f8982cb6
Author: Michel Thebeau <email address hidden>
Date: Wed Mar 24 19:12:30 2021 -0400

    add extraArgs enable-certificate-owner-ref

    When removing an application (e.g., vault) that had provisioned a
    certificate with cert-manager, "the issuer and certificate resources are
    deleted, but the secret is left behind because it is not owned by the
    certificate resource... This orphaned secret does not get updated on a
    subsequent reapply and is then invalid when vault attempts to use it."

    The workaround was to remove the secret manually before reapply.

    Enable the option by default for StarlingX. Secrets will be
    automatically removed when the certificate resource is deleted.

    Closes-Bug: 1888900
    Change-Id: I2b057a71da8dd761a891fc879ad9860c9822cba0
    Signed-off-by: Michel Thebeau <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.