EasyRSA Charm

juju add-unit etcd does not work - falsely claims ''that a certificates relation is missing

Bug #1832883 reported by Hans van den Bogert on 2019-06-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	EasyRSA Charm	Fix Released	Undecided	Cory Johns	EasyRSA Charm 1.15+ck1

Bug Description

I can't seem to grow my etcd cluster from 2 members to 3 by doing `juju add-unit etcd`

```
Model Controller Cloud/Region Version SLA Timestamp
controller maas maas 2.5.7 unsupported 22:19:33+02:00

    App Version Status Scale Charm Store Rev OS Notes
    easyrsa 3.0.1 active 1 easyrsa jujucharms 250 ubuntu
    etcd 3.1.10 blocked 3 etcd jujucharms 428 ubuntu
    flannel 0.10.0 active 5 flannel jujucharms 398 ubuntu
    kubernetes-master 1.14.2 active 1 kubernetes-master jujucharms 636 ubuntu
    kubernetes-worker 1.14.2 active 4 kubernetes-worker jujucharms 398 ubuntu exposed
...
...
...
etcd/18* active idle 10 192.168.1.156 2379/tcp Healthy with 2 known peers
etcd/20 active idle 0 192.168.1.185 2379/tcp Healthy with 2 known peers
etcd/32 blocked idle 32 192.168.1.222 Missing relation to certificate authority.
```

The logs are not really helpful for the uninitiated:

```
unit-etcd-32: 20:01:40 INFO unit.etcd/32.juju-log Reactive main running for hook update-status
unit-etcd-32: 20:01:40 DEBUG unit.etcd/32.juju-log tracer>
tracer: set flag config.set.port
tracer: ++ queue handler hooks/relations/tls-certificates/requires.py:74:joined:certificates
tracer: ++ queue handler reactive/etcd.py:105:prepare_tls_certificates
tracer: ++ queue handler reactive/etcd.py:190:set_db_ingress_address
tracer: ++ queue handler reactive/etcd.py:91:set_app_version
tracer: ++ queue handler reactive/tls_client.py:18:store_ca
tracer: ++ queue handler reactive/tls_client.py:39:store_server
tracer: ++ queue handler reactive/tls_client.py:63:store_client
unit-etcd-32: 20:01:40 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.port
unit-etcd-32: 20:01:40 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.management_port
unit-etcd-32: 20:01:40 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.management_port
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.nagios_servicegroups
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.nagios_context
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.nagios_context
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.bind_to_all_interfaces
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.bind_to_all_interfaces
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.snapd_refresh
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.snapd_refresh
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.set.channel
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.snap_proxy
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: set flag config.default.snap_proxy_url
unit-etcd-32: 20:01:41 INFO unit.etcd/32.juju-log Initializing Leadership Layer (is follower)
unit-etcd-32: 20:01:41 INFO unit.etcd/32.juju-log Initializing Snap Layer
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.update-status none
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer>
tracer: starting handler dispatch, 57 flags set
tracer: set flag cdk-service-kicker.installed
tracer: set flag certificates.available
tracer: set flag certificates.batch.cert.available
tracer: set flag certificates.ca.available
tracer: set flag certificates.certs.available
tracer: set flag certificates.client.cert.available
tracer: set flag certificates.server.cert.available
tracer: set flag certificates.server.certs.available
tracer: set flag cluster.joined
tracer: set flag config.default.bind_to_all_interfaces
tracer: set flag config.default.management_port
tracer: set flag config.default.nagios_context
tracer: set flag config.default.nagios_servicegroups
tracer: set flag config.default.port
tracer: set flag config.default.snap_proxy
tracer: set flag config.default.snap_proxy_url
tracer: set flag config.default.snapd_refresh
tracer: set flag config.set.bind_to_all_interfaces
tracer: set flag config.set.channel
tracer: set flag config.set.management_port
tracer: set flag config.set.nagios_context
tracer: set flag config.set.port
tracer: set flag config.set.snapd_refresh
tracer: set flag db.connected
tracer: set flag endpoint.certificates.changed
tracer: set flag endpoint.certificates.changed.ca
tracer: set flag endpoint.certificates.changed.egress-subnets
tracer: set flag endpoint.certificates.changed.etcd_18.server.cert
tracer: set flag endpoint.certificates.changed.etcd_18.server.key
tracer: set flag endpoint.certificates.changed.etcd_20.server.cert
tracer: set flag endpoint.certificates.changed.etcd_20.server.key
tracer: set flag endpoint.certificates.changed.etcd_30.server.cert
tracer: set flag endpoint.certificates.changed.etcd_30.server.key
tracer: set flag endpoint.certificates.changed.etcd_31.server.cert
tracer: set flag endpoint.certificates.changed.etcd_31.server.key
tracer: set flag endpoint.certificates.changed.etcd_32.server.cert
tracer: set flag endpoint.certificates.changed.etcd_32.server.key
tracer: set flag endpoint.certificates.changed.ingress-address
tracer: set flag endpoint.certificates.changed.private-address
tracer: set flag endpoint.certificates.joined
tracer: set flag etcd.pillowmints
tracer: set flag etcd.service-restart.configured
tracer: set flag leadership.set.cluster
tracer: set flag leadership.set.leader_address
tracer: set flag leadership.set.snapd_refresh
tracer: set flag leadership.set.token
tracer: set flag snap.installed.core
tracer: set flag snap.installed.etcd
tracer: set flag snap.refresh.set
tracer: set flag snap.refresh.was-set
tracer: set flag tls_client.ca.saved
tracer: set flag tls_client.ca.written
tracer: set flag tls_client.ca_installed
tracer: set flag tls_client.certs.saved
tracer: set flag tls_client.server.certificate.saved
tracer: set flag tls_client.server.certificate.written
tracer: set flag tls_client.server.key.saved
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer: hooks phase, 0 handlers queued
unit-etcd-32: 20:01:41 DEBUG unit.etcd/32.juju-log tracer>
tracer: main dispatch loop, 7 handlers queued
tracer: ++ queue handler hooks/relations/tls-certificates/requires.py:74:joined:certificates
tracer: ++ queue handler reactive/etcd.py:105:prepare_tls_certificates
tracer: ++ queue handler reactive/etcd.py:190:set_db_ingress_address
tracer: ++ queue handler reactive/etcd.py:91:set_app_version
tracer: ++ queue handler reactive/tls_client.py:18:store_ca
tracer: ++ queue handler reactive/tls_client.py:39:store_server
tracer: ++ queue handler reactive/tls_client.py:63:store_client
unit-etcd-32: 20:01:41 INFO unit.etcd/32.juju-log Invoking reactive handler: hooks/relations/tls-certificates/requires.py:74:joined:certificates
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/tls_client.py:39:store_server
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/tls_client.py:18:store_ca
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/etcd.py:105:prepare_tls_certificates
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/etcd.py:190:set_db_ingress_address
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/tls_client.py:63:store_client
unit-etcd-32: 20:01:42 INFO unit.etcd/32.juju-log Invoking reactive handler: reactive/etcd.py:91:set_app_version

```

Needless to say, yes I checked, there is a relation between easyrsa and etcd -- otherwise my first two etcd units would not have been able to boot.

See original description

Hans van den Bogert (hbogert) on 2019-06-14

description:	updated
description:	updated

Revision history for this message

George Kraft (cynerva) wrote on 2019-06-14:

`juju add-unit etcd` works in general - I just ran it with no issues on a CDK deployment on AWS. We will need more info to understand what is happening here.

Can you tell us more about your deployment? What revisions of the etcd and easyrsa charms are you using?

Revision history for this message

Hans van den Bogert (hbogert) wrote on 2019-06-14:

I hope this suffices

Model Controller Cloud/Region Version SLA Timestamp
controller maas maas 2.5.7 unsupported 22:19:33+02:00

description:

updated

Revision history for this message

Hans van den Bogert (hbogert) wrote on 2019-06-14:

For the record, I had done a remove-unit before doing the add-unit. I read that older charms of etcd had problems with down scaling etcd in general -- is that still an issue?

Revision history for this message

George Kraft (cynerva) wrote on 2019-06-17:

Thanks. Here are my observations so far.

The "Missing relation to certificate authority." message is stale. That message occurs normally during etcd deployment, but then resolves itself once the relation is established. In this case, we know the relation has been established (the certificates.available flag is set) but the charm has not progressed to the next step.

The tls_client.client.certificate.saved flag is missing. It is supposed to be set by the store_client handler[1]. The handler is running, but not setting the flag. Most likely, there is something wrong with the certificate data coming from easyrsa.

> For the record, I had done a remove-unit before doing the add-unit.

I suspect this is related, but I'm still unable to reproduce.

> I read that older charms of etcd had problems with down scaling etcd in general -- is that still an issue?

Looking through open issues[2], there are some that are related to downscaling of etcd, but none that look like your case.

[1]: https://github.com/charmed-kubernetes/layer-tls-client/blob/295a54b97e7f47ee9a18e78b46f443d3963a6ad3/reactive/tls_client.py#L84
[2]: https://bugs.launchpad.net/charm-etcd

Revision history for this message

George Kraft (cynerva) wrote on 2019-06-17:

How old is this deployment? Did you initially deploy etcd rev 428 and easyrsa rev 250, or did you deploy older revisions and upgrade to those?

Output from these commands would help narrow this down further:

juju status --format yaml
juju debug-log -i etcd/32 --replay
juju debug-log -i easyrsa --replay

Alternatively, output from cdk-field-agent would help: https://github.com/charmed-kubernetes/cdk-field-agent/

Revision history for this message

Hans van den Bogert (hbogert) wrote on 2019-06-18:

Download full text (14.1 KiB)

model:
  name: controller
  type: iaas
  controller: maas
  cloud: maas
  version: 2.5.7
  model-status:
    current: available
    since: 10 Jun 2019 23:50:02+02:00
  sla: unsupported
machines:
  "0":
    juju-status:
      current: started
      since: 16 Jun 2019 18:51:39+02:00
      version: 2.5.7
    dns-name: 192.168.1.185
    ip-addresses:
    - 192.168.1.185
    - 10.1.60.0
    - 10.1.60.1
    - 172.17.0.1
    instance-id: m6pq4s
    machine-status:
      current: running
      message: Deployed
      since: 25 Dec 2017 00:10:06+01:00
    modification-status:
      current: idle
      since: 10 Jun 2019 23:48:56+02:00
    series: xenial
    network-interfaces:
      cni0:
        ip-addresses:
        - 10.1.52.1
        - 10.1.60.1
        mac-address: 5a:64:11:f0:3d:4e
        is-up: true
      docker0:
        ip-addresses:
        - 172.17.0.1
        mac-address: 02:42:ad:71:61:a7
        space: undefined
        is-up: true
      eno1:
        ip-addresses:
        - 192.168.1.185
        mac-address: 10:60:4b:5a:16:85
        gateway: 192.168.1.1
        space: undefined
        is-up: true
      flannel.1:
        ip-addresses:
        - 10.1.89.0
        - 10.1.1.0
        - 10.1.20.0
        - 10.1.5.0
        - 10.1.59.0
        - 10.1.60.0
        mac-address: de:0e:ee:7c:04:89
        is-up: true
    constraints: mem=3584M
    hardware: arch=amd64 cores=2 mem=4096M availability-zone=default
    controller-member-status: has-vote
  "10":
    juju-status:
      current: started
      since: 10 Jun 2019 23:50:24+02:00
      version: 2.5.7
    dns-name: 192.168.1.156
    ip-addresses:
    - 192.168.1.156
    - 10.1.18.0
    - 10.1.18.1
    - 172.17.0.1
    instance-id: s8mt6q
    machine-status:
      current: running
      message: Deployed
      since: 05 Jan 2018 11:10:39+01:00
    modification-status:
      current: idle
      since: 10 Jun 2019 23:48:56+02:00
    series: xenial
    network-interfaces:
      cni0:
        ip-addresses:
        - 10.1.47.1
        - 10.1.92.1
        - 10.1.38.1
        - 10.1.18.1
        mac-address: 92:e3:40:06:18:50
        is-up: true
      docker0:
        ip-addresses:
        - 172.17.0.1
        mac-address: 02:42:1a:be:a3:0d
        space: undefined
        is-up: true
      eno1:
        ip-addresses:
        - 192.168.1.156
        mac-address: 10:60:4b:5e:7a:8a
        gateway: 192.168.1.1
        space: undefined
        is-up: true
      flannel.1:
        ip-addresses:
        - 10.1.47.0
        - 10.1.92.0
        - 10.1.38.0
        - 10.1.18.0
        mac-address: 3a:d6:23:92:14:bd
        is-up: true
    hardware: arch=amd64 cores=4 mem=4096M availability-zone=default
  "25":
    juju-status:
      current: started
      since: 16 Jun 2019 18:04:37+02:00
      version: 2.5.7
    dns-name: 192.168.1.203
    ip-addresses:
    - 192.168.1.203
    - 10.1.66.0
    - 10.1.66.1
    - 172.17.0.1
    instance-id: cm7yma
    machine-status:
      current: running
      message: Deployed
      since: 30 Dec 2018 00:27:51+01:00
    modification-status:
      current: idle
      since: 10 Jun 2019 23:48:56+02:00
    series: xenial
    network-interfaces:
      cni0:
        ip-addresses:
        - 10.1.91.1
        - 10.1.66.1
        mac-address: 06:12:ff:55:09:03
        is-up: true
      docker0:
        ip-addresses:
        - 172.17.0.1
        mac-address: 02:42:e1:85:58:bb
        space: undefined
        is-up: true
      eno1:
        ip-addresses:
        - 192.168.1.203
        mac-address: d4:c9:ef:ee:9e:e9
        gateway: 192.168.1.1
        space: undefined
        is-up: true
      flannel.1:
        ip-addresses:
        - 10.1.91.0
        - 10.1.66.0
        mac-address: b2:de:23:12:a3:d3
        is-up: true
    hardware: arch=amd64 cores=4 mem=8192M availability-zone=default
  "32":
    juju-status:
      current: started
      since: 16 Jun 2019 18:04:37+02:00
      version: 2.5.7
    dns-name: 192.168.1.222
    ip-addresses:
    - 192.168.1.222
    - 10.1.21.0
    - 10.1.21.1
    - 172.17.0.1
    instance-id: 6bk67c
    display-name: node1
    machine-status:
      current: running
      message: Deployed
      since: 14 Jun 2019 01:08:22+02:00
    modification-status:
      current: idle
      since: 14 Jun 2019 01:03:42+02:00
    series: xenial
    network-interfaces:
      cni0:
        ip-addresses:
        - 10.1.21.1
        mac-address: ee:6e:6b:82:b1:8d
        is-up: true
      docker0:
        ip-addresses:
        - 172.17.0.1
        mac-address: 02:42:e7:03:e8:53
        space: undefined
        is-up: true
      eno1:
        ip-addresses:
        - 192.168.1.222
        mac-address: 10:60:4b:5a:1c:14
        gateway: 192.168.1.1
        space: undefined
        is-up: true
      flannel.1:
        ip-addresses:
        - 10.1.21.0
        mac-address: 02:d0:b4:31:58:58
        is-up: true
    hardware: arch=amd64 cores=4 mem=4096M availability-zone=default
applications:
  easyrsa:
    charm: cs:~containers/easyrsa-250
    series: xenial
    os: ubuntu
    charm-origin: jujucharms
    charm-name: easyrsa
    charm-rev: 250
    charm-version: b2ec786
    exposed: false
    application-status:
      current: active
      message: Certificate Authority connected.
      since: 13 Jul 2018 12:14:46+02:00
    relations:
      client:
      - etcd
      - kubernetes-master
      - kubernetes-worker
    units:
      easyrsa/6:
        workload-status:
          current: active
          message: Certificate Authority connected.
          since: 13 Jul 2018 12:14:46+02:00
        juju-status:
          current: idle
          since: 18 Jun 2019 19:30:04+02:00
          version: 2.5.7
        leader: true
        machine: "0"
        public-address: 192.168.1.185
    version: 3.0.1
    endpoint-bindings:
      client: ""
  etcd:
    charm: cs:~containers/etcd-428
    series: xenial
    os: ubuntu
    charm-origin: jujucharms
    charm-name: etcd
    charm-rev: 428
    charm-version: 238cb96
    exposed: false
    application-status:
      current: blocked
      message: Missing relation to certificate authority.
      since: 14 Jun 2019 19:57:31+02:00
    relations:
      certificates:
      - easyrsa
      cluster:
      - etcd
      db:
      - flannel
      - kubernetes-master
    units:
      etcd/18:
        workload-status:
          current: active
          message: Healthy with 2 known peers
          since: 18 Jun 2019 10:58:27+02:00
        juju-status:
          current: idle
          since: 16 Jun 2019 18:57:44+02:00
          version: 2.5.7
        machine: "10"
        open-ports:
        - 2379/tcp
        public-address: 192.168.1.156
      etcd/20:
        workload-status:
          current: active
          message: Healthy with 2 known peers
          since: 18 Jun 2019 14:00:30+02:00
        juju-status:
          current: idle
          since: 16 Jun 2019 19:03:18+02:00
          version: 2.5.7
        leader: true
        machine: "0"
        open-ports:
        - 2379/tcp
        public-address: 192.168.1.185
      etcd/32:
        workload-status:
          current: blocked
          message: Missing relation to certificate authority.
          since: 14 Jun 2019 19:57:31+02:00
        juju-status:
          current: idle
          since: 16 Jun 2019 18:20:24+02:00
          version: 2.5.7
        machine: "32"
        public-address: 192.168.1.222
    version: 3.1.10
    endpoint-bindings:
      certificates: ""
      cluster: ""
      db: ""
      nrpe-external-master: ""
      proxy: ""
  flannel:
    charm: cs:~containers/flannel-398
    series: xenial
    os: ubuntu
    charm-origin: jujucharms
    charm-name: flannel
    charm-rev: 398
    can-upgrade-to: cs:~containers/flannel-421
    exposed: false
    application-status:
      current: active
      message: Flannel subnet 10.1.18.1/24
      since: 16 Jun 2019 18:23:46+02:00
    relations:
      cni:
      - kubernetes-master
      - kubernetes-worker
      etcd:
      - etcd
    subordinate-to:
    - kubernetes-master
    - kubernetes-worker
    version: 0.10.0
    endpoint-bindings:
      cni: ""
      etcd: ""
      nrpe-external-master: ""
  kubernetes-master:
    charm: cs:~containers/kubernetes-master-684
    series: xenial
    os: ubuntu
    charm-origin: jujucharms
    charm-name: kubernetes-master
    charm-rev: 684
    charm-version: e65491b
    exposed: false
    application-status:
      current: active
      message: Kubernetes master running.
      since: 18 Jun 2019 19:29:03+02:00
    relations:
      certificates:
      - easyrsa
      cluster-dns:
      - kubernetes-worker
      cni:
      - flannel
      etcd:
      - etcd
      kube-api-endpoint:
      - kubernetes-worker
      kube-control:
      - kubernetes-worker
    units:
      kubernetes-master/15:
        workload-status:
          current: active
          message: Kubernetes master running.
          since: 18 Jun 2019 19:29:03+02:00
        juju-status:
          current: idle
          since: 16 Jun 2019 18:58:40+02:00
          version: 2.5.7
        leader: true
        machine: "25"
        open-ports:
        - 6443/tcp
        public-address: 192.168.1.203
        subordinates:
          flannel/303:
            workload-status:
              current: active
              message: Flannel subnet 10.1.66.1/24
              since: 11 Jun 2019 00:03:19+02:00
            juju-status:
              current: idle
              since: 16 Jun 2019 18:20:10+02:00
              version: 2.5.7
            upgrading-from: cs:~containers/flannel-398
            public-address: 192.168.1.203
    version: 1.14.2
    endpoint-bindings:
      aws: ""
      azure: ""
      ceph-client: ""
      ceph-storage: ""
      certificates: ""
      cluster-dns: ""
      cni: ""
      docker-registry: ""
      dockerhost: ""
      etcd: ""
      gcp: ""
      ha: ""
      keystone-credentials: ""
      kube-api-endpoint: ""
      kube-control: ""
      loadbalancer: ""
      nrpe-external-master: ""
      openstack: ""
      sdn-plugin: ""
      vault-kv: ""
      vsphere: ""
  kubernetes-worker:
    charm: cs:~containers/kubernetes-worker-541
    series: xenial
    os: ubuntu
    charm-origin: jujucharms
    charm-name: kubernetes-worker
    charm-rev: 541
    charm-version: fa560d1
    exposed: true
    application-status:
      current: active
      message: Kubernetes worker running.
      since: 16 Jun 2019 18:40:21+02:00
    relations:
      certificates:
      - easyrsa
      cni:
      - flannel
      kube-api-endpoint:
      - kubernetes-master
      kube-control:
      - kubernetes-master
      kube-dns:
      - kubernetes-master
    units:
      kubernetes-worker/42:
        workload-status:
          current: active
          message: Kubernetes worker running.
          since: 16 Jun 2019 18:44:05+02:00
        juju-status:
          current: idle
          since: 18 Jun 2019 19:29:22+02:00
          version: 2.5.7
        machine: "10"
        open-ports:
        - 80/tcp
        - 443/tcp
        public-address: 192.168.1.156
        subordinates:
          flannel/226:
            workload-status:
              current: active
              message: Flannel subnet 10.1.18.1/24
              since: 16 Jun 2019 18:23:46+02:00
            juju-status:
              current: idle
              since: 16 Jun 2019 18:20:04+02:00
              version: 2.5.7
            upgrading-from: cs:~containers/flannel-398
            public-address: 192.168.1.156
      kubernetes-worker/52:
        workload-status:
          current: active
          message: Kubernetes worker running.
          since: 16 Jun 2019 18:40:21+02:00
        juju-status:
          current: idle
          since: 18 Jun 2019 19:29:36+02:00
          version: 2.5.7
        machine: "25"
        open-ports:
        - 80/tcp
        - 443/tcp
        public-address: 192.168.1.203
        subordinates:
          flannel/260:
            workload-status:
              current: active
              message: Flannel subnet 10.1.66.1/24
              since: 16 Jun 2019 18:23:30+02:00
            juju-status:
              current: idle
              since: 16 Jun 2019 18:22:55+02:00
              version: 2.5.7
            upgrading-from: cs:~containers/flannel-398
            public-address: 192.168.1.203
      kubernetes-worker/55:
        workload-status:
          current: active
          message: Kubernetes worker running.
          since: 16 Jun 2019 18:43:20+02:00
        juju-status:
          current: idle
          since: 18 Jun 2019 19:29:40+02:00
          version: 2.5.7
        machine: "0"
        open-ports:
        - 80/tcp
        - 443/tcp
        public-address: 192.168.1.185
        subordinates:
          flannel/313:
            workload-status:
              current: active
              message: Flannel subnet 10.1.60.1/24
              since: 16 Jun 2019 18:51:50+02:00
            juju-status:
              current: idle
              since: 16 Jun 2019 18:23:02+02:00
              version: 2.5.7
            upgrading-from: cs:~containers/flannel-398
            public-address: 192.168.1.185
      kubernetes-worker/60:
        workload-status:
          current: active
          message: Kubernetes worker running.
          since: 16 Jun 2019 18:42:16+02:00
        juju-status:
          current: idle
          since: 18 Jun 2019 19:29:22+02:00
          version: 2.5.7
        leader: true
        machine: "32"
        open-ports:
        - 80/tcp
        - 443/tcp
        public-address: 192.168.1.222
        subordinates:
          flannel/319:
            workload-status:
              current: active
              message: Flannel subnet 10.1.21.1/24
              since: 16 Jun 2019 18:54:24+02:00
            juju-status:
              current: idle
              since: 16 Jun 2019 18:04:41+02:00
              version: 2.5.7
            leader: true
            upgrading-from: cs:~containers/flannel-398
            public-address: 192.168.1.222
    version: 1.14.2
    endpoint-bindings:
      aws: ""
      azure: ""
      certificates: ""
      cni: ""
      docker-registry: ""
      dockerhost: ""
      gcp: ""
      kube-api-endpoint: ""
      kube-control: ""
      kube-dns: ""
      nfs: ""
      nrpe-external-master: ""
      openstack: ""
      sdn-plugin: ""
      vsphere: ""
storage: {}
controller:
  timestamp: 19:34:39+02:00

Revision history for this message

Hans van den Bogert (hbogert) wrote on 2019-06-18:

juju etcd32 Edit (1.5 MiB, text/plain)

Revision history for this message

Hans van den Bogert (hbogert) wrote on 2019-06-18:

juju easyrsa Edit (5.6 MiB, text/plain)

Revision history for this message

Hans van den Bogert (hbogert) wrote on 2019-06-18:

> How old is this deployment? Did you initially deploy etcd rev 428 and easyrsa rev 250, or did you deploy older revisions and upgrade to those?

They're pretty old, I had to migrate from etcd 2 to 3 for example.

Revision history for this message

Hans van den Bogert (hbogert) wrote on 2019-06-25:

#10

I was wondering if the supplied logs are enough for new insights? I still can't pinpoint it myself.

Revision history for this message

George Kraft (cynerva) wrote on 2019-06-26:

#11

Thanks. Right now it looks like easyrsa isn't sending the client cert and key to etcd like it's supposed to. This is done by easyrsa's publish_global_client_cert handler[1], which only ever runs once. That's a problem if the relation between etcd and easyrsa has ever been removed. Do you know if that's ever happened in this deployment?

I was able to reproduce the symptom by removing and re-adding the relation between easyrsa and etcd:

juju deploy charmed-kubernetes
# wait for deployment to settle
juju remove-relation easyrsa etcd
juju add-relation easyrsa etcd
juju add-unit etcd

After doing this, the new unit of etcd came up with "Missing relation to certificate authority." status and got stuck.

Here is a workaround:

juju run --unit easyrsa/0 -- charms.reactive clear_flag easyrsa.global-client-cert.created

This forces the easyrsa charm to publish the client cert again. Can you try that and see if it helps?

[1]: https://github.com/charmed-kubernetes/layer-easyrsa/blob/eb064667bc052a123a0e04b8d5545e87a0265ff8/reactive/easyrsa.py#L211

Revision history for this message

Cory Johns (johnsca) wrote on 2019-07-09:

#12

I think this can naturally be fixed as part of https://bugs.launchpad.net/charm-easyrsa/+bug/1835258 so I've assigned Joseph.

Changed in charm-etcd:
status:	New → Triaged
assignee:	nobody → Joseph Borg (joeborg)

Revision history for this message

Cory Johns (johnsca) wrote on 2019-07-09:

#13

Disregard that previous comment. The other bug doesn't touch on this, so I'm working on it now.

Changed in charm-etcd:
assignee:	Joseph Borg (joeborg) → Cory Johns (johnsca)
status:	Triaged → In Progress
affects:	charm-etcd → charm-easyrsa

Revision history for this message

Hans van den Bogert (hbogert) wrote on 2019-07-09:

#14

I tried the advice from #11, didn't help. What can I send for further debugging, newest etcd status in juju status is:

etcd/34 waiting idle 34 192.168.1.222 Waiting to retry etcd registration

Interesting journalctl from the host running etcd/34 hereafter:

---

Jul 09 19:47:50 node1 etcd.etcd[10362]: Running as system with data in /var/snap/etcd/81
Jul 09 19:47:50 node1 etcd.etcd[10362]: Configuration from /var/snap/etcd/common/etcd.conf.yml
Jul 09 19:47:50 node1 etcd[10362]: Loading server configuration from "/var/snap/etcd/common/etcd.conf.yml"
Jul 09 19:47:50 node1 etcd[10362]: etcd Version: 3.1.10
Jul 09 19:47:50 node1 etcd[10362]: Git SHA: Not provided (use ./build instead of go build)
Jul 09 19:47:50 node1 etcd[10362]: Go Version: go1.7.6
Jul 09 19:47:50 node1 etcd[10362]: Go OS/Arch: linux/amd64
Jul 09 19:47:50 node1 etcd[10362]: setting maximum number of CPUs to 4, total number of available CPUs is 4
Jul 09 19:47:50 node1 etcd[10362]: the server is already initialized as member before, starting as etcd member...
Jul 09 19:47:50 node1 etcd[10362]: peerTLS: cert = /var/snap/etcd/common/server.crt, key = /var/snap/etcd/common/server.key, ca = , trusted-ca = /var/snap/etcd/common/ca.crt, client-cert-auth = true
Jul 09 19:47:50 node1 etcd[10362]: listening for peers on https://0.0.0.0:2380
Jul 09 19:47:50 node1 etcd[10362]: The scheme of client url http://127.0.0.1:4001 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 09 19:47:50 node1 etcd[10362]: The scheme of client url http://127.0.0.1:4001 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 09 19:47:50 node1 etcd[10362]: listening for client requests on 127.0.0.1:4001
Jul 09 19:47:50 node1 etcd[10362]: listening for client requests on 0.0.0.0:2379
Jul 09 19:47:50 node1 etcd[10362]: warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
Jul 09 19:47:50 node1 etcd[10362]: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls
Jul 09 19:47:50 node1 systemd[1]: snap.etcd.etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 09 19:47:50 node1 systemd[1]: snap.etcd.etcd.service: Unit entered failed state.
Jul 09 19:47:50 node1 systemd[1]: snap.etcd.etcd.service: Failed with result 'exit-code'.

I tried the advice from #11, didn't help. What can I send for further debugging, newest etcd status in juju status is:

etcd/34                waiting   idle   34       192.168.1.222                   Waiting to retry etcd registration

Interesting journalctl from the host running etcd/34 hereafter:

---

Cory Johns (johnsca) on 2019-07-09

Changed in charm-easyrsa:
status:	In Progress → Fix Committed

Revision history for this message

Cory Johns (johnsca) wrote on 2019-07-09:

#15

I fixed the cert not being sent issue in https://github.com/charmed-kubernetes/layer-easyrsa/pull/22 which from the latest comment sounds like it did fix the original error of "missing relation to certificate authority" but now that it has the client cert you've hit another issue where it's still trying to use a non-SSL connection. I'm not sure why that would happen, but I'll look into it to see if I can see anything that might lead to that.

Revision history for this message

Cory Johns (johnsca) wrote on 2019-07-09:

#16

I misread those log messages. It's properly listening on both the SSL and non-SSL port and it clearly has the client cert now, so I'm leaving the status as "Fix Committed" since the original issue has been resolved.

Hans, it sounds like you're now having an issue where the newer clients can't talk to their peers properly (the charm reported that it failed to register the peer, and the service errored with "cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls"). The Juju log for the new etcd unit should have more info about why the charm couldn't register the new unit and you might try checking the networking between the units and the status of the service on the leader as well. Regardless, since this is a new issue from the one originally reported, would you mind please opening it as a separate bug?

Tim Van Steenburgh (tvansteenburgh) on 2019-07-15