2.9rc2: kubeflow deploy fails on microk8s

Bug #1902945 reported by Jason Hobbs on 2020-11-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Simon Richardson

Bug Description

I'm trying to deploy kubeflow on microk8s using juju 2.9rc2. It fails, with most applications getting stuck trying fetch resource: oci-image:

http://paste.ubuntu.com/p/vqS3DsGwmy/

steps to reproduce:

bootstrap juju 2.9rc2 against mircok8s
juju deploy cs:kubeflow

This works fine with 2.8.6, but fails with 2.9rc2.

I'm not sure what logs to grab or how to grab them - let me know what additional info you need.

Jason Hobbs (jason-hobbs) wrote :

Marked as a release blocker.

tags: added: cdo-release-blocker
Ian Booth (wallyworld) wrote :

This doesn't look like a juju issue - it seems there's an issue with the upstream image repo, eg rate limit?
Can you jubectl describe the affected pods and pull out the error messages?

We've had kubeflow deployed to 2.9 as part of the kubeflow CI with no issues.

Changed in juju:
status: New → Incomplete
Jason Hobbs (jason-hobbs) wrote :

Here's describe and logs for one of the failing pods:

http://paste.ubuntu.com/p/wRR7rZ45Yj/

Changed in juju:
status: Incomplete → New
Jason Hobbs (jason-hobbs) wrote :

This works reliably with 2.8 and never with 2.9, which leads me to believe it's an issue related to the juju version change.

Changed in juju:
status: New → Triaged
status: Triaged → In Progress
importance: Undecided → High
Ian Booth (wallyworld) wrote :

Juju 2.9 has become more strict with how it deals with charm channels. And also support for snap like semantics is being introduced, to allow track and risk instead of juju channel.

The issue here is that the kubeflow bundle specifies a default channel of "stable". Well, there is no channel and so a default of stable is assumed. And there are no charm specific channels. So Juju will look for stable versions. But most of the kubeflow charms are only on edge. The ones that do have stable versions (eg katib) do deploy properly.

So the kubeflow bundle needs to specify a default risk of edge, or it needs to add edge to the charms which need it.

Having said that, there does appear to be a juju bug where a charm specific channel does not override the bundle value. So that needs fixing. But the kubeflow bundle also needs updating.

It worked with 2.8 because 2.8 was broken in how it searched for charms - it accepted any old charm risk if none were otherwise specified, ie it does not default to stable. But snaps do and we are adopting the snap behaviour for 2.9.

Changed in juju:
milestone: none → 2.9-rc3
status: In Progress → Triaged
Changed in juju:
assignee: nobody → Simon Richardson (simonrichardson)
status: Triaged → In Progress
Jason Hobbs (jason-hobbs) wrote :

Is this going to break everyone using ~id numbers in their bundles when those charm~id's aren't in stable? I understand that there is a behavior change server side, but it also seems like juju could handle that to ensure we don't break existing bundles.

Kenneth Koski (knkski) wrote :

I'm not sure what you mean by most of the kubeflow charms are on the edge channel. For example:

$ charm show kubeflow-dashboard
Kubeflow Central Dashboard

Name kubeflow-dashboard
Owner kubeflow-charmers
Revision 0
Supported Series kubernetes
Tags ai, bigdata, kubeflow, machine-learning, tensorflow
Subordinate false
Promulgated true
Home page https://github.com/juju-solutions/bundle-kubeflow
Bugs url https://github.com/juju-solutions/bundle-kubeflow/issues
Read everyone
Write

CHANNEL CURRENT
stable true
candidate true
beta true
edge false

Kenneth Koski (knkski) wrote :

Ah, I see. The stable bundle points at charms that are on the edge channel. The charms haven't been getting promoted to stable due to a recent change to the CD process, so I'll fix that.

That being said, Juju should really be a lot more vocal about this situation. Seems like low-hanging fruit to automatically set the message to e.g. "Bundle specified stable track; argo-controller-10 is not marked stable".

Given the bundle doesn't specify which track/channel at all, the old
behavior of just grabbing the charms from wherever they are should continue
to work.

On Thu, Nov 5, 2020 at 2:45 PM Kenneth Koski <email address hidden>
wrote:

> Ah, I see. The stable bundle points at charms that are on the edge
> channel. The charms haven't been getting promoted to stable due to a
> recent change to the CD process, so I'll fix that.
>
> That being said, Juju should really be a lot more vocal about this
> situation. Seems like low-hanging fruit to automatically set the message
> to e.g. "Bundle specified stable track; argo-controller-10 is not marked
> stable".
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1902945
>
> Title:
> 2.9rc2: kubeflow deploy fails on microk8s
>
> Status in juju:
> In Progress
>
> Bug description:
> I'm trying to deploy kubeflow on microk8s using juju 2.9rc2. It fails,
> with most applications getting stuck trying fetch resource: oci-image:
>
> http://paste.ubuntu.com/p/vqS3DsGwmy/
>
> steps to reproduce:
>
> bootstrap juju 2.9rc2 against mircok8s
> juju deploy cs:kubeflow
>
> This works fine with 2.8.6, but fails with 2.9rc2.
>
> I'm not sure what logs to grab or how to grab them - let me know what
> additional info you need.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1902945/+subscriptions
>

I'm not sure I agree that the "old behaviour" should continue to work. It's not going to work when we move to the new charmhub store backend (currently behind a feature flag) and well have to check if this will work in the same way once the charmstore is served by the shim API?

If you look at the way snaps work, if you attempt to install a snap from a stable channel it won't look into edge and pick that one. Instead it will fail fast and say no snap is available (but does give hints about other channels).

Changed in juju:
status: In Progress → Fix Committed
John A Meinel (jameinel) wrote :

I think this is a case of "accidentally supporting unsafe behavior" and we
have to deal with the fallout of breaking continuity. You can specify the
channel in existing bundles and 2.8 and 2.9 will support them (so a bundle
updated to be correct with 2.9 will be supported by 2.8). But leaving the
behavior of "if you don't specify then break the channel promise" is not
something that we can commit to supporting in the future.

The fact that the charmstore itself is going to stop supporting the old
behavior is also part of the motivation for why we need to stop supporting
it in 2.9 rather than waiting.

On Fri, Nov 6, 2020 at 10:20 AM Simon Richardson <email address hidden>
wrote:

> ** Changed in: juju
> Status: In Progress => Fix Committed
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1902945
>
> Title:
> 2.9rc2: kubeflow deploy fails on microk8s
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1902945/+subscriptions
>

Jason Hobbs (jason-hobbs) wrote :

I'm confused why you would consider getting the exact charm revision you
specified unsafe

On Fri, Nov 6, 2020 at 9:45 AM John A Meinel <email address hidden>
wrote:

> I think this is a case of "accidentally supporting unsafe behavior" and we
> have to deal with the fallout of breaking continuity. You can specify the
> channel in existing bundles and 2.8 and 2.9 will support them (so a bundle
> updated to be correct with 2.9 will be supported by 2.8). But leaving the
> behavior of "if you don't specify then break the channel promise" is not
> something that we can commit to supporting in the future.
>
> The fact that the charmstore itself is going to stop supporting the old
> behavior is also part of the motivation for why we need to stop supporting
> it in 2.9 rather than waiting.
>
> On Fri, Nov 6, 2020 at 10:20 AM Simon Richardson <
> <email address hidden>>
> wrote:
>
> > ** Changed in: juju
> > Status: In Progress => Fix Committed
> >
> > --
> > You received this bug notification because you are subscribed to juju.
> > Matching subscriptions: juju bugs
> > https://bugs.launchpad.net/bugs/1902945
> >
> > Title:
> > 2.9rc2: kubeflow deploy fails on microk8s
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1902945/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1902945
>
> Title:
> 2.9rc2: kubeflow deploy fails on microk8s
>
> Status in juju:
> Fix Committed
>
> Bug description:
> I'm trying to deploy kubeflow on microk8s using juju 2.9rc2. It fails,
> with most applications getting stuck trying fetch resource: oci-image:
>
> http://paste.ubuntu.com/p/vqS3DsGwmy/
>
> steps to reproduce:
>
> bootstrap juju 2.9rc2 against mircok8s
> juju deploy cs:kubeflow
>
> This works fine with 2.8.6, but fails with 2.9rc2.
>
> I'm not sure what logs to grab or how to grab them - let me know what
> additional info you need.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1902945/+subscriptions
>

Pete Vander Giessen (petevg) wrote :

I think that this bug reflects some difficulty in the transition to handling channels more like snaps.

I agree that the correct behavior is to only deploy charms from edge if the operator has explicitly accepted the edge risk. But that's not how things used to work, and existing bundles will break.

Do we want to push off the change in behavior for Juju 3.0.0? That would allow 2.9 to essentially operate like the rest of the 2.x series. The new charm store is going to be behind a feature flag, anyway.

The only catch is that the charm store shim may break us, even if we make an effort to support the old behavior ...

John A Meinel (jameinel) wrote :

So if you are referencing an exact version of the charm, this could be ok.
That said I'm not sure that the new charm store supports the same revision
semantics of the existing store. (Because of how the new store tracks
architecture versions for charms, I'm not sure that revision numbers that
worked in the old store continue to work once the data is migrated.)
That said, we still need to understand what track the version is on for
purposes of things like "juju upgrade-charm app" knowing the channel.

On Fri, Nov 6, 2020 at 12:20 PM Pete Vander Giessen <
<email address hidden>> wrote:

> I think that this bug reflects some difficulty in the transition to
> handling channels more like snaps.
>
> I agree that the correct behavior is to only deploy charms from edge if
> the operator has explicitly accepted the edge risk. But that's not how
> things used to work, and existing bundles will break.
>
> Do we want to push off the change in behavior for Juju 3.0.0? That would
> allow 2.9 to essentially operate like the rest of the 2.x series. The
> new charm store is going to be behind a feature flag, anyway.
>
> The only catch is that the charm store shim may break us, even if we
> make an effort to support the old behavior ...
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1902945
>
> Title:
> 2.9rc2: kubeflow deploy fails on microk8s
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1902945/+subscriptions
>

Jason Hobbs (jason-hobbs) wrote :

Ok. This bug is about using the specific charm numbers no longer working if
the charms weren't in stable; was that fixed?

On Fri, Nov 6, 2020 at 3:05 PM John A Meinel <email address hidden>
wrote:

> So if you are referencing an exact version of the charm, this could be ok.
> That said I'm not sure that the new charm store supports the same revision
> semantics of the existing store. (Because of how the new store tracks
> architecture versions for charms, I'm not sure that revision numbers that
> worked in the old store continue to work once the data is migrated.)
> That said, we still need to understand what track the version is on for
> purposes of things like "juju upgrade-charm app" knowing the channel.
>
>
> On Fri, Nov 6, 2020 at 12:20 PM Pete Vander Giessen <
> <email address hidden>> wrote:
>
> > I think that this bug reflects some difficulty in the transition to
> > handling channels more like snaps.
> >
> > I agree that the correct behavior is to only deploy charms from edge if
> > the operator has explicitly accepted the edge risk. But that's not how
> > things used to work, and existing bundles will break.
> >
> > Do we want to push off the change in behavior for Juju 3.0.0? That would
> > allow 2.9 to essentially operate like the rest of the 2.x series. The
> > new charm store is going to be behind a feature flag, anyway.
> >
> > The only catch is that the charm store shim may break us, even if we
> > make an effort to support the old behavior ...
> >
> > --
> > You received this bug notification because you are subscribed to juju.
> > Matching subscriptions: juju bugs
> > https://bugs.launchpad.net/bugs/1902945
> >
> > Title:
> > 2.9rc2: kubeflow deploy fails on microk8s
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1902945/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1902945
>
> Title:
> 2.9rc2: kubeflow deploy fails on microk8s
>
> Status in juju:
> Fix Committed
>
> Bug description:
> I'm trying to deploy kubeflow on microk8s using juju 2.9rc2. It fails,
> with most applications getting stuck trying fetch resource: oci-image:
>
> http://paste.ubuntu.com/p/vqS3DsGwmy/
>
> steps to reproduce:
>
> bootstrap juju 2.9rc2 against mircok8s
> juju deploy cs:kubeflow
>
> This works fine with 2.8.6, but fails with 2.9rc2.
>
> I'm not sure what logs to grab or how to grab them - let me know what
> additional info you need.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1902945/+subscriptions
>

Pete Vander Giessen (petevg) wrote :

Re-opening this, as we need to address the break in bundles that specify exact revision.

Open question: will the new charm hub support these bundles?

Changed in juju:
status: Fix Committed → Triaged
assignee: Simon Richardson (simonrichardson) → nobody
Pete Vander Giessen (petevg) wrote :

This is fix committed for Juju 2.9-rc3.

There is a longer term issue w/ how revisions work w/ the new CharmHub, but that is behind a feature flag for 2.9. Bundles may break in Juju 3.0.0, which we will document and talk about ahead of time.

Changed in juju:
status: Triaged → Fix Committed
John A Meinel (jameinel) wrote :

We may need to consider this wrt the feature flag being removed for 2.9
final.

On Tue, Nov 10, 2020 at 10:55 AM Pete Vander Giessen <
<email address hidden>> wrote:

> This is fix committed for Juju 2.9-rc3.
>
> There is a longer term issue w/ how revisions work w/ the new CharmHub,
> but that is behind a feature flag for 2.9. Bundles may break in Juju
> 3.0.0, which we will document and talk about ahead of time.
>
> ** Changed in: juju
> Status: Triaged => Fix Committed
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1902945
>
> Title:
> 2.9rc2: kubeflow deploy fails on microk8s
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1902945/+subscriptions
>

Changed in juju:
assignee: nobody → Simon Richardson (simonrichardson)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers