2.8.8: caas applications stuck in installing agent

Bug #1914144 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Critical
Yang Kelvin Liu

Bug Description

On juju 2.8.8, I bootstrapped juju on top of CK running in AWS.

I then deployed kubeflow.

Out of 32 applications, only one came up:
https://paste.ubuntu.com/p/gXWHGrXbJP/

The others don't have any pods:
https://paste.ubuntu.com/p/dYVfCRxkCq/

Debug log from the controller:
https://paste.ubuntu.com/p/tdnpvQsQZR/

logs from the model operator pod:
https://paste.ubuntu.com/p/GcfVqtsfqZ/

Here's an example test run with crashdumps:
https://solutions.qa.canonical.com/testruns/testRun/030f3bde-386f-4fe5-b71a-b3d3c74bbb67

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

https://github.com/juju/juju/pull/12580 this pr should fix the issue in 2.8

Changed in juju:
importance: Undecided → Critical
status: New → Triaged
status: Triaged → In Progress
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
John A Meinel (jameinel)
Changed in juju:
status: In Progress → Fix Committed
milestone: none → 2.8.8
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Seems like we're still hitting this with what I believe is the new build:
  2.8/candidate: 2.8.8 2021-02-03 (15283) 72MB classic

juju status:
https://paste.ubuntu.com/p/NskCDbhwKF/

get pods:
https://paste.ubuntu.com/p/J9QYH4pnQS/

debug log from the controller:
https://paste.ubuntu.com/p/jCQdQghySr/

logs from the model operator pod:
https://paste.ubuntu.com/p/PmHgGXbynq/

Changed in juju:
status: Fix Committed → New
Revision history for this message
John A Meinel (jameinel) wrote :

That debug log looks like a rather unhappy controller:
controller-0: 16:13:33 INFO juju.state using client-side transactions
controller-0: 16:13:58 WARNING juju.worker.httpserver http: TLS handshake error from 10.1.84.1:43218: EOF

That line about client-side transactions is issued whenever we instantiate a new state object, which should only be happening infrequently.
On a long lived LXD controller of mine, I only see 10 total lines.

It would be good to see the pod and logs for the controller itself. It *looks* like it might be bouncing continuously, but I would expect to see more of that in debug-log if that were true.

Revision history for this message
John A Meinel (jameinel) wrote :

controller-0: 16:47:37 WARNING juju.worker.httpserver http: TLS handshake error from 10.1.15.0:3606: EOF

These are also surprising. They look like something is trying to connect but not actually doing a TLS handshake (eg, trying to do a direct HTTP request, and not HTTPS one), or possibly doesn't have the right CA cert so it doesn't end up trusting the controller certificate?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Looks like it's happening with 2.8.7 too. Leaving release blocker tagged because we can't do a release without testing this.

Here's pods and logs from the controller too; let me know if there is something else you want to see
https://paste.ubuntu.com/p/M3ff2vpCRx/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It looks like something is wrong with our CK.

We killed a controller pod and it didn't come back up on its own, which it should have:

https://paste.ubuntu.com/p/vWpz262cJd/

We'll need to look into why CK isn't working.

It was working last week, here was a successful deploy:
https://solutions.qa.canonical.com/testruns/testRun/151910a6-1661-4779-9358-74849494685b

One difference I see is the mysql-innodb-cluster version:
successful: mysql-innodb-cluster 8.0.22 active 3 mysql-innodb-cluster jujucharms 1 ubuntu
unsuccessful: mysql-innodb-cluster 8.0.23 active 3 mysql-innodb-cluster jujucharms 1 ubuntu

Anyhow, will need to dig more into CK state next.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We think CK is causing problems because pods aren't being restarted. If you look at the log here, we delete the juju model operator pod and it doesn't come back up with a FailedCreate condition:

https://paste.ubuntu.com/p/vWpz262cJd/

I updated the bug description with a link to a crashdump.

description: updated
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Lots of errors like this in the kubernetes master journal log:

Feb 04 16:43:50 ip-172-31-44-198 kube-controller-manager.daemon[146381]: I0204 16:43:50.000166 146381 event.go:291] "Event occurred" object="kubeflow/minio-operator" kind="StatefulSet" apiVersion="apps/v1" type="Warning" reason="FailedCreate" message="create Pod minio-operator-0 in StatefulSet minio-operator failed error: Internal error occurred: failed calling webhook \"admission-webhook.kubeflow.org\": Post \"https://admission-webhook.kubeflow.svc:443/apply-poddefault?timeout=30s\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

According to knkski this may be a charm issue: "we just have to update the certificate generation code in the admission-webhook charm"; see https://github.com/juju-solutions/bundle-kubeflow/issues/308

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This is working now with updates to the kubeflow charms

Changed in juju:
status: New → Invalid
tags: removed: cdo-release-blocker
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.