wrong etcd_connection_string on kubernetes-master charm

Bug #1831580 reported by Seyeong Kim on 2019-06-04
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
High
Joseph Phillips
2.8
High
Joseph Phillips

Bug Description

Hello

This is case on manual deployment environment.

1. bootstrap & deploy juju 2.3.7 & k8s with specific revision
2. I got issue like below, but it is resolved when I upgrade juju to 2.4.7
- subprocess.CalledProcessError: Command '['leader-set', 'auto_dns_provider=kube-dns']' returned non-zero exit status 1
3. so, I upgraded juju controller and model to 2.4.7
4. There are different issue.
- I analyzed it and found out that there are wrong IP in kube-apiserver args ( etcd-servers )
- --etcd-servers="https://127.0.1.1:2379,https://127.0.1.1:2379,https://127.0.1.1:2379" on kubernetes-master
- /var/snap/kube-apiserver/current/args

5. I also needed to change canal argument manually like you did before.
- It points 127.0.1.1 as master as well
6. I modified them manually, k8s cluster worked fine.
7. then, I tried to upgrad charm to lastest(etcd, kubernetes-master)
8. kubernetes-master's configuration reverted to wrong IP
- --etcd-servers="https://127.0.1.1:2379,https://127.0.1.1:2379,https://127.0.1.1:2379"

I checked code quickly and found out below function gets this info

etcd.get_connection_string()

Some info is here
##########################################################
juju run --unit etcd/0 unit-get public-address
node-02.maas
juju run --unit etcd/0 unit-get private-address
node-02.maas
##########################################################
cat /var/snap/kube-apiserver/current/args

--advertise-address="127.0.1.1"
--min-request-timeout="300"
--etcd-cafile="/root/cdk/etcd/client-ca.pem"
--etcd-certfile="/root/cdk/etcd/client-cert.pem"
--etcd-keyfile="/root/cdk/etcd/client-key.pem"
--etcd-servers="https://127.0.1.1:2379,https://127.0.1.1:2379,https://127.0.1.1:2379"
--storage-backend="etcd3"
--tls-cert-file="/root/cdk/server.crt"
--tls-private-key-file="/root/cdk/server.key"
--insecure-bind-address="127.0.0.1"
--insecure-port="8080"
--audit-log-maxbackup="9"
--audit-log-maxsize="100"
--audit-log-path="/root/cdk/audit/audit.log"
--audit-policy-file="/root/cdk/audit/audit-policy.yaml"
--basic-auth-file="/root/cdk/basic_auth.csv"
--client-ca-file="/root/cdk/ca.crt"
--requestheader-allowed-names="system:kube-apiserver"
--requestheader-client-ca-file="/root/cdk/ca.crt"
--requestheader-extra-headers-prefix="X-Remote-Extra-"
--requestheader-group-headers="X-Remote-Group"
--requestheader-username-headers="X-Remote-User"
--service-account-key-file="/root/cdk/serviceaccount.key"
--token-auth-file="/root/cdk/known_tokens.csv"
--authorization-mode="AlwaysAllow"
--admission-control="NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota"
--allow-privileged=false
--enable-aggregator-routing
--kubelet-certificate-authority="/root/cdk/ca.crt"
--kubelet-client-certificate="/root/cdk/client.crt"
--kubelet-client-key="/root/cdk/client.key"
--kubelet-preferred-address-types="[InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP]"
--proxy-client-cert-file="/root/cdk/client.crt"
--proxy-client-key-file="/root/cdk/client.key"
--service-cluster-ip-range="172.18.0.0/16"
--v="4"

#############################################################
juju status

tnode01: Tue Jun 4 18:12:06 2019

Model Controller Cloud/Region Version SLA Timestamp
maas maas maas 2.4.7 unsupported 18:12:07+09:00

App Version Status Scale Charm Store Rev OS Notes
canal 0.10.0/2.6.12 active 5 canal jujucharms 604 ubuntu
easyrsa 3.0.1 active 1 easyrsa jujucharms 231 ubuntu
etcd 3.2.10 active 3 etcd jujucharms 426 ubuntu
kubeapi-load-balancer 1.10.3 active 1 kubeapi-load-balancer jujucharms 58 ubuntu
kubernetes-master 1.12.8 blocked 2 kubernetes-master jujucharms 678 ubuntu
kubernetes-worker 1.12.8 active 3 kubernetes-worker jujucharms 536 ubuntu

Unit Workload Agent Machine Public address Ports Message
easyrsa/0* active idle 0 node-01.maas Certificate Authority connected.
etcd/0 active idle 1 node-02.maas 2379/tcp Healthy with 3 known peers
etcd/1* active idle 2 node-03.maas 2379/tcp Healthy with 3 known peers
etcd/2 active idle 3 node-04.maas 2379/tcp Healthy with 3 known peers
kubeapi-load-balancer/0* active idle 4 node-05.maas 443/tcp Loadbalancer ready.
kubernetes-master/0 maintenance idle 5 node-06.maas 6443/tcp Writing kubeconfig file.
  canal/4 active idle node-06.maas Flannel subnet 172.19.88.1/24
kubernetes-master/1* blocked idle 6 node-07.maas 6443/tcp Stopped services: kube-apiserver
  canal/3 active idle node-07.maas Flannel subnet 172.19.62.1/24
kubernetes-worker/0* active executing 7 node-08.maas Kubernetes worker running.
  canal/0* active idle node-08.maas Flannel subnet 172.19.63.1/24
kubernetes-worker/1 active executing 8 node-09.maas Kubernetes worker running.
  canal/2 active idle node-09.maas Flannel subnet 172.19.19.1/24
kubernetes-worker/2 active executing 9 node-10.maas Kubernetes worker running.
  canal/1 active idle node-10.maas Flannel subnet 172.19.57.1/24

Entity Meter status Message
model amber user verification pending

Machine State DNS Inst id Series AZ Message
0 started node-01.maas manual:node-01.maas xenial Manually provisioned machine
1 started node-02.maas manual:node-02.maas xenial Manually provisioned machine
2 started node-03.maas manual:node-03.maas xenial Manually provisioned machine
3 started node-04.maas manual:node-04.maas xenial Manually provisioned machine
4 started node-05.maas manual:node-05.maas xenial Manually provisioned machine
5 started node-06.maas manual:node-06.maas xenial Manually provisioned machine
6 started node-07.maas manual:node-07.maas xenial Manually provisioned machine
7 started node-08.maas manual:node-08.maas xenial Manually provisioned machine
8 started node-09.maas manual:node-09.maas xenial Manually provisioned machine
9 started node-10.maas manual:node-10.maas xenial Manually provisioned machine

Seyeong Kim (seyeongkim) on 2019-06-04
tags: added: sts
Felipe Reyes (freyes) wrote :

the connection string comes from etcd charm itself, it sets it in the relation with the key "connection_string", this string is built with the information coming from get_ingress_addresses()[0] which relies on network_get() and fallbacks to unit_private_ip(), so I'm adding a task for the etcd charm as well.

[0] https://github.com/juju-solutions/layer-etcd/blob/master/lib/etcd_lib.py#L4

Seyeong Kim (seyeongkim) wrote :

127.0.1.1 is from ingress-addresses like below.
I haven't analyzed further but what code set ingress-addresses?

I think it is NetworksForRelation func in state/relationunit.go

analyzing further...

juju run --unit etcd/2 "network-get --format yaml db"
bind-addresses:
- macaddress: ""
  interfacename: ""
  addresses:
  - hostname: node-04.maas
    address: 127.0.1.1
    cidr: ""
egress-subnets:
- 10.0.0.6/32
ingress-addresses:
- 127.0.1.1

George Kraft (cynerva) wrote :

Looks like a Juju bug. Why is network-get returning 127.0.1.1 as the ingress address?

Felipe Reyes (freyes) wrote :

@seyeong,

> This is case on manual deployment environment.

does this mean you are using juju's manual provider?

Seyeong Kim (seyeongkim) wrote :

@freyes

right, I added machines manually first, then deployed units on that machines

Richard Harding (rharding) wrote :

can you post the network (including all devices) setup of the machine please?

Changed in juju:
status: New → Incomplete
Seyeong Kim (seyeongkim) wrote :

@rharding

I paste ip addr, you may need something else?

https://pastebin.ubuntu.com/p/5tcYFtPNGg/

Felipe Reyes (freyes) wrote :

Setting the juju task to new as Seyeong provided the info requested.

Changed in juju:
status: Incomplete → New

I was trying to deploy an openstack bundle [1] with juju manual provider and I noticed the same bug on percona-cluster charm:

$ juju run --unit neutron-api/0 "relation-get -r shared-db:28 - mysql/0"
allowed_units: neutron-api/0
db_host: 127.0.1.1
egress-subnets: 10.230.56.251/32
ingress-address: z-rotomvm21
password: 7yyftsyMk6ffsmSmzFPy22ZxkcLfV8Y5
private-address: z-rotomvm21

[1] https://pastebin.ubuntu.com/p/RXvT3csvzy/

Seyeong Kim (seyeongkim) wrote :

I managed to find why this is happening in my env.

I found that network-get is getting info from local dns, and 127.0.1.1 is returned when
trying to nslookup node-12.maas(affected machine)

This was because manage_etc_hosts: true is default for MAAS deployed machine.

so I set user_data when deploy maas machine like below

maas xtrusia machine deploy MACHINE_NAME distro_series=xenial user_data=I2Nsb3VkLWNvbmZpZwptYW5hZ2VfZXRjX2hvc3RzOiBmYWxzZQo=

user_data is like below

#cloud-config
manage_etc_hosts: false

After that, symptom was gone.

Thanks.

no longer affects: charm-etcd
no longer affects: charm-kubernetes-master
Changed in juju:
status: New → Triaged
importance: Undecided → Low
tags: added: network
Felipe Reyes (freyes) wrote :

This bug was hit again, this time during the deployment of a proof of concept, the characteristics of the scenario were the same: manual provider where etcd misbehaved due to the 127.0.0.1 getting registered as the ingress address and from that point all the services related to etcd were trying to use it.

On site the /etc/hosts was fixed, but it looks like etcd already stored the incorrect address calling set_db_ingress_address()[0] which internally calls to conversation.set_remote()[1] and that ultimately boils down to a relation-set[2]

[0] https://github.com/charmed-kubernetes/layer-etcd/blob/master/reactive/etcd.py#L242
[1] https://github.com/juju-solutions/interface-etcd/blob/master/peers.py#L57
[2] https://github.com/juju-solutions/charms.reactive/blob/master/charms/reactive/relations.py#L773

Nick Niehoff (nniehoff) wrote :

To clarify Felipe's comment this was hit again on a manual cloud with /etc/hosts configured with:

127.0.1.1 hostname.fqdn hostname

For completeness, it was 127.0.1.1 not 127.0.0.1 which matches Seyeong's findings as well.

Tim Penhey (thumper) wrote :

Latest issues happend with Juju 2.7

Ian Booth (wallyworld) on 2020-05-26
Changed in juju:
milestone: none → 2.7.7
importance: Low → High
Changed in juju:
status: Triaged → In Progress
assignee: nobody → Joseph Phillips (manadart)
Joseph Phillips (manadart) wrote :

Can we get the output of "juju show-machine x" where the offending unit is deployed?

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers