Canonical Juju

2.8.8: caas applications stuck in installing agent

Bug #1914144 reported by Jason Hobbs on 2021-02-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Invalid	Critical	Yang Kelvin Liu	Canonical Juju 2.8.8

Bug Description

On juju 2.8.8, I bootstrapped juju on top of CK running in AWS.

I then deployed kubeflow.

Out of 32 applications, only one came up:
https://paste.ubuntu.com/p/gXWHGrXbJP/

The others don't have any pods:
https://paste.ubuntu.com/p/dYVfCRxkCq/

Debug log from the controller:
https://paste.ubuntu.com/p/tdnpvQsQZR/

logs from the model operator pod:
https://paste.ubuntu.com/p/GcfVqtsfqZ/

Here's an example test run with crashdumps:
https://solutions.qa.canonical.com/testruns/testRun/030f3bde-386f-4fe5-b71a-b3d3c74bbb67

See original description

Tags:

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2021-02-02:

https://github.com/juju/juju/pull/12580 this pr should fix the issue in 2.8

Changed in juju:
importance:	Undecided → Critical
status:	New → Triaged
status:	Triaged → In Progress
assignee:	nobody → Yang Kelvin Liu (kelvin.liu)

John A Meinel (jameinel) on 2021-02-02

Changed in juju:
status:	In Progress → Fix Committed
milestone:	none → 2.8.8

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2021-02-03:

Seems like we're still hitting this with what I believe is the new build:
2.8/candidate: 2.8.8 2021-02-03 (15283) 72MB classic

juju status:
https://paste.ubuntu.com/p/NskCDbhwKF/

get pods:
https://paste.ubuntu.com/p/J9QYH4pnQS/

debug log from the controller:
https://paste.ubuntu.com/p/jCQdQghySr/

logs from the model operator pod:
https://paste.ubuntu.com/p/PmHgGXbynq/

Changed in juju:
status:	Fix Committed → New

Revision history for this message

John A Meinel (jameinel) wrote on 2021-02-03:

That debug log looks like a rather unhappy controller:
controller-0: 16:13:33 INFO juju.state using client-side transactions
controller-0: 16:13:58 WARNING juju.worker.httpserver http: TLS handshake error from 10.1.84.1:43218: EOF

That line about client-side transactions is issued whenever we instantiate a new state object, which should only be happening infrequently.
On a long lived LXD controller of mine, I only see 10 total lines.

It would be good to see the pod and logs for the controller itself. It *looks* like it might be bouncing continuously, but I would expect to see more of that in debug-log if that were true.

Revision history for this message

John A Meinel (jameinel) wrote on 2021-02-03:

controller-0: 16:47:37 WARNING juju.worker.httpserver http: TLS handshake error from 10.1.15.0:3606: EOF

These are also surprising. They look like something is trying to connect but not actually doing a TLS handshake (eg, trying to do a direct HTTP request, and not HTTPS one), or possibly doesn't have the right CA cert so it doesn't end up trusting the controller certificate?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2021-02-03:

Looks like it's happening with 2.8.7 too. Leaving release blocker tagged because we can't do a release without testing this.

Here's pods and logs from the controller too; let me know if there is something else you want to see
https://paste.ubuntu.com/p/M3ff2vpCRx/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2021-02-03:

It looks like something is wrong with our CK.

We killed a controller pod and it didn't come back up on its own, which it should have:

https://paste.ubuntu.com/p/vWpz262cJd/

We'll need to look into why CK isn't working.

It was working last week, here was a successful deploy:
https://solutions.qa.canonical.com/testruns/testRun/151910a6-1661-4779-9358-74849494685b

One difference I see is the mysql-innodb-cluster version:
successful: mysql-innodb-cluster 8.0.22 active 3 mysql-innodb-cluster jujucharms 1 ubuntu
unsuccessful: mysql-innodb-cluster 8.0.23 active 3 mysql-innodb-cluster jujucharms 1 ubuntu

Anyhow, will need to dig more into CK state next.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2021-02-04:

We think CK is causing problems because pods aren't being restarted. If you look at the log here, we delete the juju model operator pod and it doesn't come back up with a FailedCreate condition:

https://paste.ubuntu.com/p/vWpz262cJd/

I updated the bug description with a link to a crashdump.

description:

updated

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2021-02-04:

Lots of errors like this in the kubernetes master journal log:

Feb 04 16:43:50 ip-172-31-44-198 kube-controller-manager.daemon[146381]: I0204 16:43:50.000166 146381 event.go:291] "Event occurred" object="kubeflow/minio-operator" kind="StatefulSet" apiVersion="apps/v1" type="Warning" reason="FailedCreate" message="create Pod minio-operator-0 in StatefulSet minio-operator failed error: Internal error occurred: failed calling webhook \"admission-webhook.kubeflow.org\": Post \"https://admission-webhook.kubeflow.svc:443/apply-poddefault?timeout=30s\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2021-02-04:

According to knkski this may be a charm issue: "we just have to update the certificate generation code in the admission-webhook charm"; see https://github.com/juju-solutions/bundle-kubeflow/issues/308

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2021-02-05:

#10

This is working now with updates to the kubeflow charms

Changed in juju:
status:	New → Invalid
tags:	removed: cdo-release-blocker

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-juju-solutions-bundle-kubeflow #308
[closed] Edit

Bug watches keep track of this bug in other bug trackers.