juju add-unit etcd does not work - falsely claims ''that a certificates relation is missing

Bug #1832883 reported by Hans van den Bogert on 2019-06-14
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EasyRSA Charm
Undecided
Cory Johns

Bug Description

I can't seem to grow my etcd cluster from 2 members to 3 by doing `juju add-unit etcd`

```
    Model Controller Cloud/Region Version SLA Timestamp
    controller maas maas 2.5.7 unsupported 22:19:33+02:00

    App Version Status Scale Charm Store Rev OS Notes
    easyrsa 3.0.1 active 1 easyrsa jujucharms 250 ubuntu
    etcd 3.1.10 blocked 3 etcd jujucharms 428 ubuntu
    flannel 0.10.0 active 5 flannel jujucharms 398 ubuntu
    kubernetes-master 1.14.2 active 1 kubernetes-master jujucharms 636 ubuntu
    kubernetes-worker 1.14.2 active 4 kubernetes-worker jujucharms 398 ubuntu exposed
...
...
...
etcd/18* active idle 10 192.168.1.156 2379/tcp Healthy with 2 known peers
etcd/20 active idle 0 192.168.1.185 2379/tcp Healthy with 2 known peers
etcd/32 blocked idle 32 192.168.1.222 Missing relation to certificate authority.
```

The logs are not really helpful for the uninitiated:

```
unit-etcd-32: 20:01:40 INFO unit.etcd/32.juju-log Reactive main running for hook update-status
unit-etcd-32: 20:01:40 DEBUG unit.etcd/32.juju-log tracer>
tracer: set flag config.set.port
tracer: ++ queue handler hooks/relations/tls-certificates/requires.py:74:joined:certificates
tracer: ++ queue handler reactive/etcd.py:105:prepare_tls_certificates
tracer: ++ queue handler reactive/etcd.py:190:set_db_ingress_address
tracer: ++ queue handler reactive/etcd.py:91:set_app_version
tracer: ++ queue handler reactive/tls_client.py:18:store_ca
tracer: ++ queue handler reactive/tls_client.py:39:store_server
tracer: ++ queue handler reactive/tls_client.py:63:store_client
unit-etcd-32: 20:01:40 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.port
unit-etcd-32: 20:01:40 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.management_port
unit-etcd-32: 20:01:40 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.management_port
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.nagios_servicegroups
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.nagios_context
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.nagios_context
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.bind_to_all_interfaces
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.bind_to_all_interfaces
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.snapd_refresh
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.snapd_refresh
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.channel
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.snap_proxy
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.snap_proxy_url
unit-etcd-32: 20:01:41 INFO unit.etcd/32.juju-log Initializing Leadership Layer (is follower)
unit-etcd-32: 20:01:41 INFO unit.etcd/32.juju-log Initializing Snap Layer
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.update-status none
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer>
tracer: starting handler dispatch, 57 flags set
tracer: set flag cdk-service-kicker.installed
tracer: set flag certificates.available
tracer: set flag certificates.batch.cert.available
tracer: set flag certificates.ca.available
tracer: set flag certificates.certs.available
tracer: set flag certificates.client.cert.available
tracer: set flag certificates.server.cert.available
tracer: set flag certificates.server.certs.available
tracer: set flag cluster.joined
tracer: set flag config.default.bind_to_all_interfaces
tracer: set flag config.default.management_port
tracer: set flag config.default.nagios_context
tracer: set flag config.default.nagios_servicegroups
tracer: set flag config.default.port
tracer: set flag config.default.snap_proxy
tracer: set flag config.default.snap_proxy_url
tracer: set flag config.default.snapd_refresh
tracer: set flag config.set.bind_to_all_interfaces
tracer: set flag config.set.channel
tracer: set flag config.set.management_port
tracer: set flag config.set.nagios_context
tracer: set flag config.set.port
tracer: set flag config.set.snapd_refresh
tracer: set flag db.connected
tracer: set flag endpoint.certificates.changed
tracer: set flag endpoint.certificates.changed.ca
tracer: set flag endpoint.certificates.changed.egress-subnets
tracer: set flag endpoint.certificates.changed.etcd_18.server.cert
tracer: set flag endpoint.certificates.changed.etcd_18.server.key
tracer: set flag endpoint.certificates.changed.etcd_20.server.cert
tracer: set flag endpoint.certificates.changed.etcd_20.server.key
tracer: set flag endpoint.certificates.changed.etcd_30.server.cert
tracer: set flag endpoint.certificates.changed.etcd_30.server.key
tracer: set flag endpoint.certificates.changed.etcd_31.server.cert
tracer: set flag endpoint.certificates.changed.etcd_31.server.key
tracer: set flag endpoint.certificates.changed.etcd_32.server.cert
tracer: set flag endpoint.certificates.changed.etcd_32.server.key
tracer: set flag endpoint.certificates.changed.ingress-address
tracer: set flag endpoint.certificates.changed.private-address
tracer: set flag endpoint.certificates.joined
tracer: set flag etcd.pillowmints
tracer: set flag etcd.service-restart.configured
tracer: set flag leadership.set.cluster
tracer: set flag leadership.set.leader_address
tracer: set flag leadership.set.snapd_refresh
tracer: set flag leadership.set.token
tracer: set flag snap.installed.core
tracer: set flag snap.installed.etcd
tracer: set flag snap.refresh.set
tracer: set flag snap.refresh.was-set
tracer: set flag tls_client.ca.saved
tracer: set flag tls_client.ca.written
tracer: set flag tls_client.ca_installed
tracer: set flag tls_client.certs.saved
tracer: set flag tls_client.server.certificate.saved
tracer: set flag tls_client.server.certificate.written
tracer: set flag tls_client.server.key.saved
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: hooks phase, 0 handlers queued
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer>
tracer: main dispatch loop, 7 handlers queued
tracer: ++ queue handler hooks/relations/tls-certificates/requires.py:74:joined:certificates
tracer: ++ queue handler reactive/etcd.py:105:prepare_tls_certificates
tracer: ++ queue handler reactive/etcd.py:190:set_db_ingress_address
tracer: ++ queue handler reactive/etcd.py:91:set_app_version
tracer: ++ queue handler reactive/tls_client.py:18:store_ca
tracer: ++ queue handler reactive/tls_client.py:39:store_server
tracer: ++ queue handler reactive/tls_client.py:63:store_client
unit-etcd-32: 20:01:41 INFO unit.etcd/32.juju-log Invoking reactive handler: hooks/relations/tls-certificates/requires.py:74:joined:certificates
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/tls_client.py:39:store_server
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/tls_client.py:18:store_ca
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/etcd.py:105:prepare_tls_certificates
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/etcd.py:190:set_db_ingress_address
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/tls_client.py:63:store_client
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/etcd.py:91:set_app_version

```

Needless to say, yes I checked, there is a relation between easyrsa and etcd -- otherwise my first two etcd units would not have been able to boot.

description: updated
description: updated
George Kraft (cynerva) wrote :

`juju add-unit etcd` works in general - I just ran it with no issues on a CDK deployment on AWS. We will need more info to understand what is happening here.

Can you tell us more about your deployment? What revisions of the etcd and easyrsa charms are you using?

Hans van den Bogert (hbogert) wrote :

I hope this suffices

    Model Controller Cloud/Region Version SLA Timestamp
    controller maas maas 2.5.7 unsupported 22:19:33+02:00

    App Version Status Scale Charm Store Rev OS Notes
    easyrsa 3.0.1 active 1 easyrsa jujucharms 250 ubuntu
    etcd 3.1.10 blocked 3 etcd jujucharms 428 ubuntu
    flannel 0.10.0 active 5 flannel jujucharms 398 ubuntu
    kubernetes-master 1.14.2 active 1 kubernetes-master jujucharms 636 ubuntu
    kubernetes-worker 1.14.2 active 4 kubernetes-worker jujucharms 398 ubuntu exposed

description: updated
Hans van den Bogert (hbogert) wrote :

For the record, I had done a remove-unit before doing the add-unit. I read that older charms of etcd had problems with down scaling etcd in general -- is that still an issue?

George Kraft (cynerva) wrote :

Thanks. Here are my observations so far.

The "Missing relation to certificate authority." message is stale. That message occurs normally during etcd deployment, but then resolves itself once the relation is established. In this case, we know the relation has been established (the certificates.available flag is set) but the charm has not progressed to the next step.

The tls_client.client.certificate.saved flag is missing. It is supposed to be set by the store_client handler[1]. The handler is running, but not setting the flag. Most likely, there is something wrong with the certificate data coming from easyrsa.

> For the record, I had done a remove-unit before doing the add-unit.

I suspect this is related, but I'm still unable to reproduce.

> I read that older charms of etcd had problems with down scaling etcd in general -- is that still an issue?

Looking through open issues[2], there are some that are related to downscaling of etcd, but none that look like your case.

[1]: https://github.com/charmed-kubernetes/layer-tls-client/blob/295a54b97e7f47ee9a18e78b46f443d3963a6ad3/reactive/tls_client.py#L84
[2]: https://bugs.launchpad.net/charm-etcd

George Kraft (cynerva) wrote :

How old is this deployment? Did you initially deploy etcd rev 428 and easyrsa rev 250, or did you deploy older revisions and upgrade to those?

Output from these commands would help narrow this down further:

juju status --format yaml
juju debug-log -i etcd/32 --replay
juju debug-log -i easyrsa --replay

Alternatively, output from cdk-field-agent would help: https://github.com/charmed-kubernetes/cdk-field-agent/

Hans van den Bogert (hbogert) wrote :
Download full text (14.1 KiB)

model:
  name: controller
  type: iaas
  controller: maas
  cloud: maas
  version: 2.5.7
  model-status:
    current: available
    since: 10 Jun 2019 23:50:02+02:00
  sla: unsupported
machines:
  "0":
    juju-status:
      current: started
      since: 16 Jun 2019 18:51:39+02:00
      version: 2.5.7
    dns-name: 192.168.1.185
    ip-addresses:
    - 192.168.1.185
    - 10.1.60.0
    - 10.1.60.1
    - 172.17.0.1
    instance-id: m6pq4s
    machine-status:
      current: running
      message: Deployed
      since: 25 Dec 2017 00:10:06+01:00
    modification-status:
      current: idle
      since: 10 Jun 2019 23:48:56+02:00
    series: xenial
    network-interfaces:
      cni0:
        ip-addresses:
        - 10.1.52.1
        - 10.1.60.1
        mac-address: 5a:64:11:f0:3d:4e
        is-up: true
      docker0:
        ip-addresses:
        - 172.17.0.1
        mac-address: 02:42:ad:71:61:a7
        space: undefined
        is-up: true
      eno1:
        ip-addresses:
        - 192.168.1.185
        mac-address: 10:60:4b:5a:16:85
        gateway: 192.168.1.1
        space: undefined
        is-up: true
      flannel.1:
        ip-addresses:
        - 10.1.89.0
        - 10.1.1.0
        - 10.1.20.0
        - 10.1.5.0
        - 10.1.59.0
        - 10.1.60.0
        mac-address: de:0e:ee:7c:04:89
        is-up: true
    constraints: mem=3584M
    hardware: arch=amd64 cores=2 mem=4096M availability-zone=default
    controller-member-status: has-vote
  "10":
    juju-status:
      current: started
      since: 10 Jun 2019 23:50:24+02:00
      version: 2.5.7
    dns-name: 192.168.1.156
    ip-addresses:
    - 192.168.1.156
    - 10.1.18.0
    - 10.1.18.1
    - 172.17.0.1
    instance-id: s8mt6q
    machine-status:
      current: running
      message: Deployed
      since: 05 Jan 2018 11:10:39+01:00
    modification-status:
      current: idle
      since: 10 Jun 2019 23:48:56+02:00
    series: xenial
    network-interfaces:
      cni0:
        ip-addresses:
        - 10.1.47.1
        - 10.1.92.1
        - 10.1.38.1
        - 10.1.18.1
        mac-address: 92:e3:40:06:18:50
        is-up: true
      docker0:
        ip-addresses:
        - 172.17.0.1
        mac-address: 02:42:1a:be:a3:0d
        space: undefined
        is-up: true
      eno1:
        ip-addresses:
        - 192.168.1.156
        mac-address: 10:60:4b:5e:7a:8a
        gateway: 192.168.1.1
        space: undefined
        is-up: true
      flannel.1:
        ip-addresses:
        - 10.1.47.0
        - 10.1.92.0
        - 10.1.38.0
        - 10.1.18.0
        mac-address: 3a:d6:23:92:14:bd
        is-up: true
    hardware: arch=amd64 cores=4 mem=4096M availability-zone=default
  "25":
    juju-status:
      current: started
      since: 16 Jun 2019 18:04:37+02:00
      version: 2.5.7
    dns-name: 192.168.1.203
    ip-addresses:
    - 192.168.1.203
    - 10.1.66.0
    - 10.1.66.1
    - 172.17.0.1
    instance-id: cm7yma
    machine-status:
      current: running
      message: Deployed
      since: 30 Dec 2018 00:27:51+01:00
    modification-status:
      current: idle
      since: 10 Jun 2019 23:48:56+02:00
    series: xenial
    network-interfaces:
      cni0:
        ip-addr...

Hans van den Bogert (hbogert) wrote :
Hans van den Bogert (hbogert) wrote :
Hans van den Bogert (hbogert) wrote :

> How old is this deployment? Did you initially deploy etcd rev 428 and easyrsa rev 250, or did you deploy older revisions and upgrade to those?

They're pretty old, I had to migrate from etcd 2 to 3 for example.

Hans van den Bogert (hbogert) wrote :

I was wondering if the supplied logs are enough for new insights? I still can't pinpoint it myself.

George Kraft (cynerva) wrote :

Thanks. Right now it looks like easyrsa isn't sending the client cert and key to etcd like it's supposed to. This is done by easyrsa's publish_global_client_cert handler[1], which only ever runs once. That's a problem if the relation between etcd and easyrsa has ever been removed. Do you know if that's ever happened in this deployment?

I was able to reproduce the symptom by removing and re-adding the relation between easyrsa and etcd:

juju deploy charmed-kubernetes
# wait for deployment to settle
juju remove-relation easyrsa etcd
juju add-relation easyrsa etcd
juju add-unit etcd

After doing this, the new unit of etcd came up with "Missing relation to certificate authority." status and got stuck.

Here is a workaround:

juju run --unit easyrsa/0 -- charms.reactive clear_flag easyrsa.global-client-cert.created

This forces the easyrsa charm to publish the client cert again. Can you try that and see if it helps?

[1]: https://github.com/charmed-kubernetes/layer-easyrsa/blob/eb064667bc052a123a0e04b8d5545e87a0265ff8/reactive/easyrsa.py#L211

Cory Johns (johnsca) wrote :

I think this can naturally be fixed as part of https://bugs.launchpad.net/charm-easyrsa/+bug/1835258 so I've assigned Joseph.

Changed in charm-etcd:
status: New → Triaged
assignee: nobody → Joseph Borg (joeborg)
Cory Johns (johnsca) wrote :

Disregard that previous comment. The other bug doesn't touch on this, so I'm working on it now.

Changed in charm-etcd:
assignee: Joseph Borg (joeborg) → Cory Johns (johnsca)
status: Triaged → In Progress
affects: charm-etcd → charm-easyrsa
Hans van den Bogert (hbogert) wrote :

I tried the advice from #11, didn't help. What can I send for further debugging, newest etcd status in juju status is:

    etcd/34 waiting idle 34 192.168.1.222 Waiting to retry etcd registration

Interesting journalctl from the host running etcd/34 hereafter:

---

Jul 09 19:47:50 node1 etcd.etcd[10362]: Running as system with data in /var/snap/etcd/81
Jul 09 19:47:50 node1 etcd.etcd[10362]: Configuration from /var/snap/etcd/common/etcd.conf.yml
Jul 09 19:47:50 node1 etcd[10362]: Loading server configuration from "/var/snap/etcd/common/etcd.conf.yml"
Jul 09 19:47:50 node1 etcd[10362]: etcd Version: 3.1.10
Jul 09 19:47:50 node1 etcd[10362]: Git SHA: Not provided (use ./build instead of go build)
Jul 09 19:47:50 node1 etcd[10362]: Go Version: go1.7.6
Jul 09 19:47:50 node1 etcd[10362]: Go OS/Arch: linux/amd64
Jul 09 19:47:50 node1 etcd[10362]: setting maximum number of CPUs to 4, total number of available CPUs is 4
Jul 09 19:47:50 node1 etcd[10362]: the server is already initialized as member before, starting as etcd member...
Jul 09 19:47:50 node1 etcd[10362]: peerTLS: cert = /var/snap/etcd/common/server.crt, key = /var/snap/etcd/common/server.key, ca = , trusted-ca = /var/snap/etcd/common/ca.crt, client-cert-auth = true
Jul 09 19:47:50 node1 etcd[10362]: listening for peers on https://0.0.0.0:2380
Jul 09 19:47:50 node1 etcd[10362]: The scheme of client url http://127.0.0.1:4001 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 09 19:47:50 node1 etcd[10362]: The scheme of client url http://127.0.0.1:4001 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 09 19:47:50 node1 etcd[10362]: listening for client requests on 127.0.0.1:4001
Jul 09 19:47:50 node1 etcd[10362]: listening for client requests on 0.0.0.0:2379
Jul 09 19:47:50 node1 etcd[10362]: warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
Jul 09 19:47:50 node1 etcd[10362]: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls
Jul 09 19:47:50 node1 systemd[1]: snap.etcd.etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 09 19:47:50 node1 systemd[1]: snap.etcd.etcd.service: Unit entered failed state.
Jul 09 19:47:50 node1 systemd[1]: snap.etcd.etcd.service: Failed with result 'exit-code'.

Cory Johns (johnsca) on 2019-07-09
Changed in charm-easyrsa:
status: In Progress → Fix Committed
Cory Johns (johnsca) wrote :

I fixed the cert not being sent issue in https://github.com/charmed-kubernetes/layer-easyrsa/pull/22 which from the latest comment sounds like it did fix the original error of "missing relation to certificate authority" but now that it has the client cert you've hit another issue where it's still trying to use a non-SSL connection. I'm not sure why that would happen, but I'll look into it to see if I can see anything that might lead to that.

Cory Johns (johnsca) wrote :

I misread those log messages. It's properly listening on both the SSL and non-SSL port and it clearly has the client cert now, so I'm leaving the status as "Fix Committed" since the original issue has been resolved.

Hans, it sounds like you're now having an issue where the newer clients can't talk to their peers properly (the charm reported that it failed to register the peer, and the service errored with "cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls"). The Juju log for the new etcd unit should have more info about why the charm couldn't register the new unit and you might try checking the networking between the units and the status of the service on the leader as well. Regardless, since this is a new issue from the one originally reported, would you mind please opening it as a separate bug?

Changed in charm-easyrsa:
milestone: none → 1.15+ck1
Changed in charm-easyrsa:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers