nginx configuration is missing on non-leader units when VIP is set

Bug #2017814 reported by Jake Nabasny
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes API Load Balancer
Fix Released
High
Adam Dyess

Bug Description

== SUMMARY ==

When either the "ha-cluster-vip" or "loadbalancer-ips" options are set in the kubeapi-load-balancer charm, the non-leader units do not receive the /etc/nginx/sites-available/apilb file. This is a critical omission because those units are not able to proxy requests. If the VIP points to one of them, all kubectl commands fail, as well as any functionality dependent on the load balancer.

== WORKAROUND ==

1. Copy missing nginx configuration to affected units.

2. Restart nginx.

Adam Dyess (addyess)
Changed in charm-kubeapi-load-balancer:
milestone: none → 1.27+ck1
status: New → Triaged
Revision history for this message
Adam Dyess (addyess) wrote :

Can you capture the flags from all the units? I don't immediately see any reason why the non-leader units wouldn't write the config during the install_load_balancer hook unless maybe the server or api server certs were missing

```
juju run -a kubeapi-load-balancer -- charms.reactive get_flags
```

Do the non-leaders have:
"Skipping due to missing cert"
or
"Skipping due to requests not ready"

in the logs from /var/log/juju/unit-kubeapi-load-balancer-*.log

Revision history for this message
Navdeep (navdeep-bjn) wrote (last edit ):

I am using vault as the cert store which get initialized after everything in the cluster including the Loadbalancers are up. So there is some time when everything in the cluster comes up and is waiting for the certs and the certs being issued after vault is initialized.

After the vault is initialized and certs are issued, that is when only the leader get/updates the config and the rest of the kubeapi-load-balancer units don't.

I do see the "Skipping due to requests not ready" in log files on the 2 non leader units. The same is not present in the leader logs.

ubuntu@juju-d6fbef-3-lxd-3:/var/log/juju$ grep Skipping unit-kubeapi-load-balancer-0.log
2023-04-27 22:17:44 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 certificates:17: Skipping due to requests not ready
2023-04-27 22:18:36 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 Skipping due to requests not ready
2023-04-27 22:23:50 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 Skipping due to requests not ready
2023-04-27 22:28:18 INFO unit.kubeapi-load-balancer/0.juju-log server.go:316 Skipping due to requests not ready

ubuntu@juju-d6fbef-5-lxd-3:/var/log/juju$ grep Skipping unit-kubeapi-load-balancer-2.log
2023-04-27 22:17:44 INFO unit.kubeapi-load-balancer/2.juju-log server.go:316 certificates:17: Skipping due to requests not ready
2023-04-27 22:18:22 INFO unit.kubeapi-load-balancer/2.juju-log server.go:316 Skipping due to requests not ready
2023-04-27 22:24:21 INFO unit.kubeapi-load-balancer/2.juju-log server.go:316 Skipping due to requests not ready

The workaround I am doing right now to get the cluster working is by writing a script that does the following :-

1. Does a `service juju-machine-XXX stop` on the any kubeapi-load-balancer unit that has port 443 exposed (at cluster initialization that is only the leader)
2. Wait for new leader to be elected. This populates the nginx config on the new leader unit.
3. Check if the cluster has 3 kubeapi-load-balancer units with port 443 exposed.
4. If not repeat step 1
5. Once there are 3 kubeapi-load-balancer units with port 443 exposed, does restart juju client on all 3 units.

This populated the config in all 3 nodes as each unit becomes leader at soem point and the workers can register and start talking to the control plane.

Revision history for this message
George Kraft (cynerva) wrote :

This is caused by an interaction between LBConsumers.all_requests[1] and install_load_balancer[2]. Non-leader units are not able to see any requests from the relation. With no visible requests, they will always skip rendering site config.

To fix this, I think we either need to update all_requests so that non-leader units can see the requests (taking care to ensure that other charms aren't broken by this change), or we need to make the kubeapi-load-balancer leader share data with the non-leader units so they can proceed.

[1]: https://github.com/juju-solutions/loadbalancer-interface/blob/47babea8e654318adc7d2874eaa8fd9230ca2bd9/loadbalancer_interface/provides.py#L53-L56
[2]: https://github.com/charmed-kubernetes/charm-kubeapi-load-balancer/blob/669945bd901191a5b58053ccb1e134f6939dc6c6/reactive/load_balancer.py#L161-L163

tags: added: backport-needed
Revision history for this message
Adam Dyess (addyess) wrote :

I did some testing with these PRs today and the results look promising. the non-leader units (followers) configured nginx as expected.

https://github.com/charmed-kubernetes/charm-kubeapi-load-balancer/pull/21
https://github.com/juju-solutions/loadbalancer-interface/pull/16

Changed in charm-kubeapi-load-balancer:
assignee: nobody → Adam Dyess (addyess)
status: Triaged → In Progress
importance: Undecided → High
Revision history for this message
Adam Dyess (addyess) wrote :

https://github.com/juju-solutions/loadbalancer-interface/pull/16 is merged, and a new loadbalancer-interface package pushed to pypi

Adam Dyess (addyess)
Changed in charm-kubeapi-load-balancer:
status: In Progress → Fix Committed
Revision history for this message
Navdeep (navdeep-bjn) wrote :

Will it be available in the 1.24 channel ? or is it still pending a backport ?

Revision history for this message
Adam Dyess (addyess) wrote :

Hi Nadeep. We are planning to backport to 1.24, 1.25, 1.26, and 1.27 in the next 2 weeks with validation testing before releasing to the stable branches of those tracks. Do you need something to try that's untested?

Revision history for this message
Navdeep (navdeep-bjn) wrote :

No, I am ok with the timeline. Just need the fix in 1.24.

Revision history for this message
Adam Dyess (addyess) wrote :

The fix to the bug is now in the following charm channels:
1.27/candidate
1.26/candidate
1.25/candidate
1.24/candidate

$ juju upgrade-charm kubeapi-load-balancer --switch --channel=1.xx/candidate

If you're desperate to try it out. But I'd recommend waiting for the stable channel release for production environments.

tags: removed: backport-needed
Revision history for this message
Adam Dyess (addyess) wrote :

Demonstration of functionality in ck-1.24, 1.25, and 1.26
https://paste.ubuntu.com/p/Cp7NtPHnrb/

Revision history for this message
Navdeep (navdeep-bjn) wrote :

I verified the fix in one of our test clusters. Please let me now once it is available in the stable branch.

Changed in charm-kubeapi-load-balancer:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.