cert requests not handled when the original leader vault is not available

Bug #1836348 reported by Yoshi Kadokawa
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
vault-charm
Fix Released
High
Cory Johns

Bug Description

After leader unit of vault is unavailable or removed for whatever reason,
adding unit to kubernetes-master will be stuck in "Waiting for master components to start" status.

The steps to reproduce are as follows.

1. Deploy CDK with Vault HA
I used this bundle.
2. Remove or take the leader unit of vault down
$ juju run -a vault is-leader
- Stdout: |
    False
  UnitId: vault/0
- Stdout: |
    True
  UnitId: vault/1
- Stdout: |
    False
  UnitId: vault/2
$ juju remove-unit --force vault/1
3. Add unit for kubernetes-master
$ juju add-unit kubernetes-master

After a while, the added kubernetes-master unit will be stuck in "Waiting for master components to start" status.
Since the "tls_client.certs.saved" flag is not set and no certificates found under /root/cdk/,
I believe update_certs() function[0] somehow fails to retrieve the certificates from Vault when the original leader unit is not there.

$ juju run -a kubernetes-master -- "charms.reactive -p get_flags | grep tls_client.certs.saved"
- Stdout: |2
     'tls_client.certs.saved',
  UnitId: kubernetes-master/0
- ReturnCode: 1
  Stdout: ""
  UnitId: kubernetes-master/1
$ juju run -a kubernetes-master -- sudo ls -al /root/cdk
- Stdout: |
    total 52
    drwxrwx--- 4 root root 4096 Jul 12 09:22 .
    drwx------ 7 root root 4096 Jul 12 09:29 ..
    drwxr-xr-x 2 root root 4096 Jul 12 09:23 audit
    -rw-r--r-- 1 root root 61 Jul 12 09:05 basic_auth.csv
    -r--r----- 1 root root 1245 Jul 12 09:22 ca.crt
    -rw-r--r-- 1 root root 1406 Jul 12 09:22 client.crt
    -rw-r--r-- 1 root root 1678 Jul 12 09:22 client.key
    drwxr-xr-x 2 root root 4096 Jul 12 09:22 etcd
    -rw-r--r-- 1 root root 385 Jul 12 09:10 known_tokens.csv
    -rw------- 1 root root 2014 Jul 12 09:33 kubeproxyconfig
    -rw-r--r-- 1 root root 1670 Jul 12 09:22 server.crt
    -rw-r--r-- 1 root root 1674 Jul 12 09:22 server.key
    -rw------- 1 root root 1675 Jul 12 09:05 serviceaccount.key
  UnitId: kubernetes-master/0
- Stdout: |
    total 20
    drwxr-xr-x 2 root root 4096 Jul 12 09:40 .
    drwx------ 6 root root 4096 Jul 12 09:46 ..
    -rw-r--r-- 1 root root 60 Jul 12 10:11 basic_auth.csv
    -rw-r--r-- 1 root root 385 Jul 12 10:11 known_tokens.csv
    -rw-r--r-- 1 root root 1675 Jul 12 10:11 serviceaccount.key
  UnitId: kubernetes-master/1

[0] https://github.com/juju-solutions/layer-tls-client/blob/master/reactive/tls_client.py#L93

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

Here is the bundle that I have used.

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

Subscribing this to field-critical, since this is the last item blocking a project completion.

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

For the steps to unseal vault cluster, I have followed the exact same steps as described here.
https://ubuntu.com/kubernetes/docs/using-vault

But I could also confirm this issue when totally-unsecure-auto-unlock=true as well.

George Kraft (cynerva)
Changed in charm-kubernetes-master:
importance: Undecided → Critical
Cory Johns (johnsca)
Changed in vault-charm:
assignee: nobody → Cory Johns (johnsca)
status: New → In Progress
Revision history for this message
Cory Johns (johnsca) wrote :

This is actually being caused by a bug in the vault charm. When leadership changes, the flag indicating that the CA has been configured doesn't get updated on the new leader.

Until a fix is available in the vault charm, you can recover from this bad state by running:

juju run --unit vault/2 -- 'charms.reactive set_flag charm.vault.ca.ready ; hooks/update-status'

(assuming vault/2 is the new leader)

Revision history for this message
Cory Johns (johnsca) wrote :
no longer affects: charm-kubernetes-master
summary: - add-unit kubernetes-master will stuck in "Waiting for master components
- to start" when the original leader vault is not available
+ cert requests not handled when the original leader vault is not
+ available
Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

Thank you for the workaround.
With the workaround by setting the flag charm.vault.ca.ready, I could confirm that this will mitigate the issue.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI, fix has landed in charm-vault @ master, and I've cherry-picked that back to the stable/19.04 branch as a backport. Track status at:

https://review.opendev.org/#/q/topic:bug/1836348+(status:open+OR+status:merged)

Ryan Beisner (1chb1n)
Changed in vault-charm:
status: In Progress → Fix Committed
milestone: none → 19.07
importance: Undecided → High
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers