Tigera units do not become active after the first installation of the bundle

Bug #2053143 reported by Ebrar Leblebici
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Charm Calico Enterprise
Fix Released
High
Kevin W Monroe

Bug Description

Hi,

After the deployment of the bundle, most of the tigera units become stuck in "waiting" state and some of them become stuck in "error" state.

For the "waiting" units we have the message:

tigera-operator POD not found

And for the "error" units we have the message:

hook-failed: "calico-enterprise-relation-changed"

We have the error here for the unit in error state: https://pastebin.ubuntu.com/p/xrYrTXK3JV/

It seems it is failing when it tries to label the nodes because of the apiserver has not running yet.

After running the "juju resolved" command for the units in "error" state, after approx. 10-15 minutes all the tigera units become "active" and "idle". So, there may be a race condition?

Regards,
Ebrar

Revision history for this message
Adrian Flynn (flynna) wrote :

This bug requires manual user intervention and is a blocker to deploying Charmed K8S in a nightly automated cluster builds.

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

Thanks for the report. Will you share the bundle used to hit this? Or otherwise let me know the steps to reproduce?

Changed in charm-calico-enterprise:
assignee: nobody → Kevin W Monroe (kwmonroe)
importance: Undecided → High
milestone: none → 1.29+ck1
status: New → Incomplete
Revision history for this message
Adrian Flynn (flynna) wrote :

Unable to share bundle on a public accessible site.

Changed in charm-calico-enterprise:
status: Incomplete → In Progress
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

Taking a closer look at the failure from the description, I can tell the charm is going through lifecycle hooks faster than it should; e.g. attempting to use kubectl before the api server is up.

This charm has significant config requirements, and if they're not all provided at deployment (and if the hook timing isn't quite right), we'll end up in the described failed state.

I made adjustments to the charm so that it only proceeds as far as the provided config allows. IOW, if you haven't provided image registry credentials, it won't try to pull images. If you haven't provided CIDR ranges, it won't attempt to configure bgp peering data.

PR for review:

https://github.com/charmed-kubernetes/charm-calico-enterprise/pull/5

Revision history for this message
Adrian Flynn (flynna) wrote :

I have over the last 24 hours had to deploy a charmed cluster 4 times before it would successfully deploy. This is really not a place we want to be when deploying Kubernetes clusters.

Is the PR above likely to fix things? Any ETA when the fix will be available?

I had updated support case 00380420.

Thanks

Regards

Adrian

Changed in charm-calico-enterprise:
status: In Progress → Fix Committed
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

overnight builds should get this landed in the 1.29/candidate channel, with promotion to 1.29/stable shortly after (pending ci for the whole 1.29+ck1 release).

Revision history for this message
Ebrar Leblebici (birru2) wrote :

Thank you Kevin for your effort on this one. But we need a backport to 1.28. Can you please help on this one, too?

Changed in charm-calico-enterprise:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.