The second leader doesn't take over the previous leader's CA cert/key then initiates its own CA

Bug #1835258 reported by Nobuto Murata
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
EasyRSA Charm
Fix Released
High
Joseph Borg

Bug Description

When the first leader unit is dead (by hardware failure, etc.) and the second unit is added as a new leader, the second leader will initiate its own CA and issue a server certificate to newly deployed unit of other applications which cannot be verified with the original CA so application deployment will fail.

The first leader already saved CA cert and secret key into Juju's leader storage, so the second leader should take over those files and should not start its own CA.

How to reproduce:
$ juju deploy ./etcd.yaml
$ juju add-unit -n2 etcd ## and verify new etcd units join the cluster with healthy state

Take down the original easyrsa unit.
$ lxc stop -f juju-796b78-0 ## machine of easyrsa/0

$ juju add-unit easyrsa ## deploy the next leader

$ juju add-unit etcd

Expected:
The last etcd unit joins the cluster.

Actual:
The last unit will have an unverifiable server cert and will be stuck on "Waiting to retry etcd registration"

$ juju run --application etcd 'openssl verify /var/snap/etcd/common/{ca,server}.crt'
- Stdout: |
    /var/snap/etcd/common/ca.crt: OK
    /var/snap/etcd/common/server.crt: OK
  UnitId: etcd/0
- Stdout: |
    /var/snap/etcd/common/ca.crt: OK
    /var/snap/etcd/common/server.crt: OK
  UnitId: etcd/1
- Stdout: |
    /var/snap/etcd/common/ca.crt: OK
    /var/snap/etcd/common/server.crt: OK
  UnitId: etcd/2
- ReturnCode: 2
  Stderr: |
    CN = 10.0.9.157
    error 20 at 0 depth lookup: unable to get local issuer certificate
  Stdout: |
    /var/snap/etcd/common/ca.crt: OK
    error /var/snap/etcd/common/server.crt: verification failed
  UnitId: etcd/3

$ juju status
Model Controller Cloud/Region Version SLA Timestamp
etcd localhost-localhost localhost/localhost 2.6.4 unsupported 14:53:02Z

App Version Status Scale Charm Store Rev OS Notes
easyrsa 3.0.1 active 1/2 easyrsa jujucharms 254 ubuntu
etcd 3.2.10 waiting 4 etcd jujucharms 434 ubuntu

Unit Workload Agent Machine Public address Ports Message
easyrsa/0 unknown lost 0 10.0.9.125 agent lost, see 'juju show-status-log easyrsa/0'
easyrsa/1* active idle 5 10.0.9.180 Certificate Authority connected.
etcd/0* active idle 1 10.0.9.78 2379/tcp Healthy with 3 known peers
etcd/1 active idle 2 10.0.9.147 2379/tcp Healthy with 3 known peers
etcd/2 active idle 3 10.0.9.92 2379/tcp Healthy with 3 known peers
etcd/3 waiting idle 4 10.0.9.157 Waiting to retry etcd registration

Tags: cpe-onsite
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
summary: The second leader doesn't take over the previous leader's CA cert/key
- then initiate its own CA
+ then initiates its own CA
Revision history for this message
Nobuto Murata (nobuto) wrote :

output of `juju run --unit easyrsa/1 -- leader-get`

Revision history for this message
Nobuto Murata (nobuto) wrote :

For example certificate_authority is already overwritten by the second leader while Juju had a proper certificate_authority before.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high.

We don't need easyrsa to be HA like active-active. But we need to keep the original CA cert/key to issue another server cert for other applications. So the current behavior that the second unit will overwrite and delete the original CA from Juju leader storage when the first unit is dead is not appropriate.

We are still using easyrsa for etcd to bootstrap Vault HA in existing customer deployments. Until the following bug will be addressed as a new feature, this issue needs a hotfix otherwise we will suffer from recovering etcd-vault clusters from just one physical host failure from an operational point of view.
https://bugs.launchpad.net/vault-charm/+bug/1835356

Revision history for this message
Nobuto Murata (nobuto) wrote :

Basically this part needs a condition whether to download the existing CA cert/key from Juju leader storage or create a new one.
https://github.com/charmed-kubernetes/layer-easyrsa/blob/eb064667bc052a123a0e04b8d5545e87a0265ff8/reactive/easyrsa.py#L155-L158

Joseph Borg (joeborg)
Changed in charm-easyrsa:
assignee: nobody → Joseph Borg (joeborg)
importance: Undecided → High
status: New → In Progress
Revision history for this message
Joseph Borg (joeborg) wrote :

Reproduced on AWS.

Revision history for this message
Joseph Borg (joeborg) wrote :

For me, the original etcd units go to error though

etcd/0* active idle 1 52.200.51.51 2379/tcp Errored with 0 known peers
etcd/1 active idle 2 100.27.33.107 2379/tcp Errored with 0 known peers
etcd/2 active idle 3 107.21.70.193 2379/tcp Errored with 0 known peers
etcd/3 waiting idle 5 3.83.44.198 Waiting to retry etcd registration

Revision history for this message
Joseph Borg (joeborg) wrote :
Joseph Borg (joeborg)
Changed in charm-easyrsa:
status: In Progress → Fix Committed
Cory Johns (johnsca)
Changed in charm-easyrsa:
assignee: Joseph Borg (joeborg) → Cory Johns (johnsca)
assignee: Cory Johns (johnsca) → Joseph Borg (joeborg)
Changed in charm-easyrsa:
milestone: none → 1.15+ck1
Revision history for this message
George Kraft (cynerva) wrote :
Changed in charm-easyrsa:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.