EasyRSA Charm

The second leader doesn't take over the previous leader's CA cert/key then initiates its own CA

Bug #1835258 reported by Nobuto Murata on 2019-07-03

14

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	EasyRSA Charm	Fix Released	High	Joseph Borg	EasyRSA Charm 1.15+ck1

Bug Description

When the first leader unit is dead (by hardware failure, etc.) and the second unit is added as a new leader, the second leader will initiate its own CA and issue a server certificate to newly deployed unit of other applications which cannot be verified with the original CA so application deployment will fail.

The first leader already saved CA cert and secret key into Juju's leader storage, so the second leader should take over those files and should not start its own CA.

How to reproduce:
$ juju deploy ./etcd.yaml
$ juju add-unit -n2 etcd ## and verify new etcd units join the cluster with healthy state

Take down the original easyrsa unit.
$ lxc stop -f juju-796b78-0 ## machine of easyrsa/0

$ juju add-unit easyrsa ## deploy the next leader

$ juju add-unit etcd

Expected:
The last etcd unit joins the cluster.

Actual:
The last unit will have an unverifiable server cert and will be stuck on "Waiting to retry etcd registration"

$ juju run --application etcd 'openssl verify /var/snap/etcd/common/{ca,server}.crt'
- Stdout: |
    /var/snap/etcd/common/ca.crt: OK
    /var/snap/etcd/common/server.crt: OK
  UnitId: etcd/0
- Stdout: |
    /var/snap/etcd/common/ca.crt: OK
    /var/snap/etcd/common/server.crt: OK
  UnitId: etcd/1
- Stdout: |
    /var/snap/etcd/common/ca.crt: OK
    /var/snap/etcd/common/server.crt: OK
  UnitId: etcd/2
- ReturnCode: 2
  Stderr: |
    CN = 10.0.9.157
    error 20 at 0 depth lookup: unable to get local issuer certificate
  Stdout: |
    /var/snap/etcd/common/ca.crt: OK
    error /var/snap/etcd/common/server.crt: verification failed
  UnitId: etcd/3

$ juju status
Model Controller Cloud/Region Version SLA Timestamp
etcd localhost-localhost localhost/localhost 2.6.4 unsupported 14:53:02Z

App Version Status Scale Charm Store Rev OS Notes
easyrsa 3.0.1 active 1/2 easyrsa jujucharms 254 ubuntu
etcd 3.2.10 waiting 4 etcd jujucharms 434 ubuntu

Unit Workload Agent Machine Public address Ports Message
easyrsa/0 unknown lost 0 10.0.9.125 agent lost, see 'juju show-status-log easyrsa/0'
easyrsa/1* active idle 5 10.0.9.180 Certificate Authority connected.
etcd/0* active idle 1 10.0.9.78 2379/tcp Healthy with 3 known peers
etcd/1 active idle 2 10.0.9.147 2379/tcp Healthy with 3 known peers
etcd/2 active idle 3 10.0.9.92 2379/tcp Healthy with 3 known peers
etcd/3 waiting idle 4 10.0.9.157 Waiting to retry etcd registration

Tags:

Revision history for this message

Nobuto Murata (nobuto) wrote on 2019-07-03:

#1

etcd.yaml Edit (445 bytes, text/plain)

Revision history for this message

Nobuto Murata (nobuto) wrote on 2019-07-03:

#2

juju-crashdump-cba2fdc3-0635-4a94-9ad5-9f9270efcd81.tar.xz Edit (322.5 KiB, application/x-tar)

summary:

The second leader doesn't take over the previous leader's CA cert/key
- then initiate its own CA
+ then initiates its own CA

Revision history for this message

Nobuto Murata (nobuto) wrote on 2019-07-03:

#3

easyrsa-leader-get-with-1.yaml Edit (9.2 KiB, text/plain)

output of `juju run --unit easyrsa/1 -- leader-get`

Revision history for this message

Nobuto Murata (nobuto) wrote on 2019-07-03:

#4

For example certificate_authority is already overwritten by the second leader while Juju had a proper certificate_authority before.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2019-07-04:

#5

Subscribing ~field-high.

We don't need easyrsa to be HA like active-active. But we need to keep the original CA cert/key to issue another server cert for other applications. So the current behavior that the second unit will overwrite and delete the original CA from Juju leader storage when the first unit is dead is not appropriate.

We are still using easyrsa for etcd to bootstrap Vault HA in existing customer deployments. Until the following bug will be addressed as a new feature, this issue needs a hotfix otherwise we will suffer from recovering etcd-vault clusters from just one physical host failure from an operational point of view.
https://bugs.launchpad.net/vault-charm/+bug/1835356

Revision history for this message

Nobuto Murata (nobuto) wrote on 2019-07-04:

#6

Basically this part needs a condition whether to download the existing CA cert/key from Juju leader storage or create a new one.
https://github.com/charmed-kubernetes/layer-easyrsa/blob/eb064667bc052a123a0e04b8d5545e87a0265ff8/reactive/easyrsa.py#L155-L158

Joseph Borg (joeborg) on 2019-07-05

Changed in charm-easyrsa:
assignee:	nobody → Joseph Borg (joeborg)
importance:	Undecided → High
status:	New → In Progress

Revision history for this message

Joseph Borg (joeborg) wrote on 2019-07-05:

#7

Reproduced on AWS.

Revision history for this message

Joseph Borg (joeborg) wrote on 2019-07-05:

#8

For me, the original etcd units go to error though

etcd/0* active idle 1 52.200.51.51 2379/tcp Errored with 0 known peers
etcd/1 active idle 2 100.27.33.107 2379/tcp Errored with 0 known peers
etcd/2 active idle 3 107.21.70.193 2379/tcp Errored with 0 known peers
etcd/3 waiting idle 5 3.83.44.198 Waiting to retry etcd registration

Revision history for this message

Joseph Borg (joeborg) wrote on 2019-07-05:

#9

Pending PR https://github.com/charmed-kubernetes/layer-easyrsa/pull/21

Joseph Borg (joeborg) on 2019-07-08

Changed in charm-easyrsa:
status:	In Progress → Fix Committed

Cory Johns (johnsca) on 2019-07-09

Changed in charm-easyrsa:
assignee:	Joseph Borg (joeborg) → Cory Johns (johnsca)
assignee:	Cory Johns (johnsca) → Joseph Borg (joeborg)

Tim Van Steenburgh (tvansteenburgh) on 2019-07-15

Changed in charm-easyrsa:
milestone:	none → 1.15+ck1

Revision history for this message

George Kraft (cynerva) wrote on 2019-07-30:

#10

Cherry-picked to stable branch:

https://github.com/charmed-kubernetes/layer-easyrsa/commit/827d2c2db5467f197b0a18d48d3cce12ffbadb94

Tim Van Steenburgh (tvansteenburgh) on 2019-08-15

Changed in charm-easyrsa:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.