Bug #2017814 “nginx configuration is missing on non-leader units...” : Bugs : Kubernetes API Load Balancer

Adam Dyess (addyess) on 2023-04-27

Changed in charm-kubeapi-load-balancer:
milestone:	none → 1.27+ck1
status:	New → Triaged

Revision history for this message

Adam Dyess (addyess) wrote on 2023-04-27:

#1

Can you capture the flags from all the units? I don't immediately see any reason why the non-leader units wouldn't write the config during the install_load_balancer hook unless maybe the server or api server certs were missing

```
juju run -a kubeapi-load-balancer -- charms.reactive get_flags
```

Do the non-leaders have:
"Skipping due to missing cert"
or
"Skipping due to requests not ready"

in the logs from /var/log/juju/unit-kubeapi-load-balancer-*.log

Revision history for this message

Navdeep (navdeep-bjn) wrote on 2023-04-27 (last edit on 2023-04-27):

#2

juju run -a kubeapi-load-balancer -- charms.reactive get_flags output Edit (4.3 KiB, text/plain)

I am using vault as the cert store which get initialized after everything in the cluster including the Loadbalancers are up. So there is some time when everything in the cluster comes up and is waiting for the certs and the certs being issued after vault is initialized.

After the vault is initialized and certs are issued, that is when only the leader get/updates the config and the rest of the kubeapi-load-balancer units don't.

I do see the "Skipping due to requests not ready" in log files on the 2 non leader units. The same is not present in the leader logs.

ubuntu@juju-d6fbef-3-lxd-3:/var/log/juju$ grep Skipping unit-kubeapi-load-balancer-0.log
2023-04-27 22:17:44 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 certificates:17: Skipping due to requests not ready
2023-04-27 22:18:36 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 Skipping due to requests not ready
2023-04-27 22:23:50 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 Skipping due to requests not ready
2023-04-27 22:28:18 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 Skipping due to requests not ready

ubuntu@juju-d6fbef-5-lxd-3:/var/log/juju$ grep Skipping unit-kubeapi-load-balancer-2.log
2023-04-27 22:17:44 INFO unit.kubeapi-load-balancer/2.juju-log server.go:316 certificates:17: Skipping due to requests not ready
2023-04-27 22:18:22 INFO unit.kubeapi-load-balancer/2.juju-log server.go:316 Skipping due to requests not ready
2023-04-27 22:24:21 INFO unit.kubeapi-load-balancer/2.juju-log server.go:316 Skipping due to requests not ready

The workaround I am doing right now to get the cluster working is by writing a script that does the following :-

1. Does a `service juju-machine-XXX stop` on the any kubeapi-load-balancer unit that has port 443 exposed (at cluster initialization that is only the leader)
2. Wait for new leader to be elected. This populates the nginx config on the new leader unit.
3. Check if the cluster has 3 kubeapi-load-balancer units with port 443 exposed.
4. If not repeat step 1
5. Once there are 3 kubeapi-load-balancer units with port 443 exposed, does restart juju client on all 3 units.

This populated the config in all 3 nodes as each unit becomes leader at soem point and the workers can register and start talking to the control plane.

I am using vault as the cert store which get initialized after everything in the cluster including the Loadbalancers are up. So there is some time when everything in the cluster comes up and is waiting for the certs and the certs being issued after vault is initialized.

After the vault is initialized and certs are issued, that is when only the leader get/updates the config and the rest of the kubeapi-load-balancer units don't.

I do see the "Skipping due to requests not ready" in log files on the 2 non leader units. The same is not present in the leader logs.

ubuntu@juju-d6fbef-3-lxd-3:/var/log/juju$ grep Skipping unit-kubeapi-load-balancer-0.log 
2023-04-27 22:17:44 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 certificates:17: Skipping due to requests not ready
2023-04-27 22:18:36 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 Skipping due to requests not ready
2023-04-27 22:23:50 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 Skipping due to requests not ready
2023-04-27 22:28:18 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 Skipping due to requests not ready

ubuntu@juju-d6fbef-5-lxd-3:/var/log/juju$ grep Skipping unit-kubeapi-load-balancer-2.log
2023-04-27 22:17:44 INFO unit.kubeapi-load-balancer/2.juju-log server.go:316 certificates:17: Skipping due to requests not ready
2023-04-27 22:18:22 INFO unit.kubeapi-load-balancer/2.juju-log server.go:316 Skipping due to requests not ready
2023-04-27 22:24:21 INFO unit.kubeapi-load-balancer/2.juju-log server.go:316 Skipping due to requests not ready

The workaround I am doing right now to get the cluster working is by writing a script that does the following :-

1. Does a `service juju-machine-XXX stop` on the any kubeapi-load-balancer unit  that has port 443 exposed  (at cluster initialization that is only the leader)
2. Wait for new leader to be elected. This populates the nginx config on the new leader unit. 
3. Check if the cluster has 3 kubeapi-load-balancer units with port 443 exposed.
4. If not repeat step 1
5. Once there are 3 kubeapi-load-balancer units with port 443 exposed, does restart juju client on all 3 units.

This populated the config in all 3 nodes as each unit becomes leader at soem point and the workers can register and start talking to the control plane.

Revision history for this message

George Kraft (cynerva) wrote on 2023-04-28:

#3

This is caused by an interaction between LBConsumers.all_requests[1] and install_load_balancer[2]. Non-leader units are not able to see any requests from the relation. With no visible requests, they will always skip rendering site config.

To fix this, I think we either need to update all_requests so that non-leader units can see the requests (taking care to ensure that other charms aren't broken by this change), or we need to make the kubeapi-load-balancer leader share data with the non-leader units so they can proceed.

[1]: https://github.com/juju-solutions/loadbalancer-interface/blob/47babea8e654318adc7d2874eaa8fd9230ca2bd9/loadbalancer_interface/provides.py#L53-L56
[2]: https://github.com/charmed-kubernetes/charm-kubeapi-load-balancer/blob/669945bd901191a5b58053ccb1e134f6939dc6c6/reactive/load_balancer.py#L161-L163

Pedro Victor Lourenço Fragola (pedrovlf) on 2023-05-03

tags:

added: backport-needed

Revision history for this message

Adam Dyess (addyess) wrote on 2023-05-17:

#4

I did some testing with these PRs today and the results look promising. the non-leader units (followers) configured nginx as expected.

https://github.com/charmed-kubernetes/charm-kubeapi-load-balancer/pull/21
https://github.com/juju-solutions/loadbalancer-interface/pull/16

Changed in charm-kubeapi-load-balancer:
assignee:	nobody → Adam Dyess (addyess)
status:	Triaged → In Progress
importance:	Undecided → High

Revision history for this message

Adam Dyess (addyess) wrote on 2023-05-19:

#5

https://github.com/juju-solutions/loadbalancer-interface/pull/16 is merged, and a new loadbalancer-interface package pushed to pypi

Adam Dyess (addyess) on 2023-05-19

Changed in charm-kubeapi-load-balancer:
status:	In Progress → Fix Committed

Revision history for this message

Navdeep (navdeep-bjn) wrote on 2023-05-19:

#6

Will it be available in the 1.24 channel ? or is it still pending a backport ?

Revision history for this message

Adam Dyess (addyess) wrote on 2023-05-22:

#7

Hi Nadeep. We are planning to backport to 1.24, 1.25, 1.26, and 1.27 in the next 2 weeks with validation testing before releasing to the stable branches of those tracks. Do you need something to try that's untested?

Revision history for this message

Navdeep (navdeep-bjn) wrote on 2023-05-22:

#8

No, I am ok with the timeline. Just need the fix in 1.24.

Revision history for this message

Adam Dyess (addyess) wrote on 2023-05-22:

#9

The fix to the bug is now in the following charm channels:
1.27/candidate
1.26/candidate
1.25/candidate
1.24/candidate

$ juju upgrade-charm kubeapi-load-balancer --switch --channel=1.xx/candidate

If you're desperate to try it out. But I'd recommend waiting for the stable channel release for production environments.

tags:

removed: backport-needed

Revision history for this message

Adam Dyess (addyess) wrote on 2023-05-22:

#10

Demonstration of functionality in ck-1.24, 1.25, and 1.26
https://paste.ubuntu.com/p/Cp7NtPHnrb/

Revision history for this message

Navdeep (navdeep-bjn) wrote on 2023-05-30:

#11

I verified the fix in one of our test clusters. Please let me now once it is available in the stable branch.

Kevin W Monroe (kwmonroe) on 2023-06-12

Changed in charm-kubeapi-load-balancer:
status:	Fix Committed → Fix Released

Kubernetes API Load Balancer

nginx configuration is missing on non-leader units when VIP is set

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches