[2.9] pod/node affinity for sidecar charms not implemented

Bug #1993716 reported by Syed Mohammad Adnan Karim
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Wishlist
Ian Booth

Bug Description

Juju version: 2.9.35-ubuntu-amd64
Kubernetes version: v1.24.6

My k8s cluster has some nodes with GPU and some without. I am trying to deploy kubeflow 1.6/stable on only the nodes without a GPU. The nodes without a GPU have a specific label "mldatanode: true". I have tried to follow the examples in the following threads:

- https://discourse.charmhub.io/t/pod-priority-and-affinity-in-juju-charms/4091
  In my kubeflow bundle I have the following constraints for all the applications:
    constraints: tags=mldatanode=true,^mlgpunode=true
  I also tried to deploy a single application with the cli:
    juju deploy istio-pilot --channel 1.11 --constraints="tags=mldatanode=true,^mlgpunode=true"
  Neither of these worked and would put pods on GPU nodes.

- https://discourse.charmhub.io/t/mapping-juju-concepts-to-kubernetes/2627
  I also tried to use the hostname of the nodes in the bundle:
    to: [kubernetes.io/hostname=gpu-less-node-1]
  This also did not work as expected and placed pods on GPU nodes.

Furthermore, when creating a model, juju does not support placement directives for the modeloperator pod and I think it should as it ends up on a GPU node.

I will try to work around this by cordoning/uncordoning the GPU nodes as I deploy.

Thomas Miller (tlmiller)
Changed in juju:
assignee: nobody → Thomas Miller (tlmiller)
Thomas Miller (tlmiller)
Changed in juju:
assignee: Thomas Miller (tlmiller) → Ian Booth (wallyworld)
Revision history for this message
Ian Booth (wallyworld) wrote :

juju uses the constraint tags that are prefixed with "pod." or "anti-pod." or "node." into pod affinity selectors.
your constraints are: --constraints="tags=mldatanode=true,^mlgpunode=true"
The tag keys don't have the required prefixes to be translated into affinity selection expressions.

A contrived example.

juju deploy somecharm --constraints="tags=node.foo=a|b|c,^bar=d|e|f,^foo=g|h,pod.foo=1|2|3,^pod.bar=4|5|6,anti-pod.afoo=x|y|z,^anti-pod.abar=7|8|9"

would result in

kubectl get -o json statefulset.apps/somecharm | jq .spec.template.spec.affinity
{
  "nodeAffinity": {
    "requiredDuringSchedulingIgnoredDuringExecution": {
      "nodeSelectorTerms": [
        {
          "matchExpressions": [
            {
              "key": "bar",
              "operator": "NotIn",
              "values": [
                "d",
                "e",
                "f"
              ]
            },
            {
              "key": "foo",
              "operator": "NotIn",
              "values": [
                "g",
                "h"
              ]
            },
            {
              "key": "foo",
              "operator": "In",
              "values": [
                "a",
                "b",
                "c"
              ]
            }
          ]
        }
      ]
    }
  },
  "podAffinity": {
    "requiredDuringSchedulingIgnoredDuringExecution": [
      {
        "labelSelector": {
          "matchExpressions": [
            {
              "key": "bar",
              "operator": "NotIn",
              "values": [
                "4",
                "5",
                "6"
              ]
            },
            {
              "key": "foo",
              "operator": "In",
              "values": [
                "1",
                "2",
                "3"
              ]
            }
          ]
        },
        "topologyKey": ""
      }
    ]
  },
  "podAntiAffinity": {
    "requiredDuringSchedulingIgnoredDuringExecution": [
      {
        "labelSelector": {
          "matchExpressions": [
            {
              "key": "abar",
              "operator": "NotIn",
              "values": [
                "7",
                "8",
                "9"
              ]
            },
            {
              "key": "afoo",
              "operator": "In",
              "values": [
                "x",
                "y",
                "z"
              ]
            }
          ]
        },
        "topologyKey": ""
      }
    ]
  }
}

Changed in juju:
assignee: Ian Booth (wallyworld) → nobody
status: New → Incomplete
Revision history for this message
Ian Booth (wallyworld) wrote :

Can you try with the required constraint syntax and see if it works?

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

Unfortunately it did not work for me yet.
I updated my kubeflow bundle to contain constraints for all applications in the following forms:

    constraints: tags=node.mldatanode=true,^mlgpunode=true
    constraints: tags="node.mldatanode=true,^mlgpunode=true"
    constraints: tags="node.mldatanode=true,^node.mlgpunode=true"

and redeployed multiple times but the pods still land on nodes with a GPU that are labelled mlgpunode=true.

Revision history for this message
Ian Booth (wallyworld) wrote :

To help understand what is happening, we need the k8s statefulset config info like was shown in comment #1. Also the full node config info with the tags etc.

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

Here is the example of the kubeflow-dashboard-operator that ends up on a GPU node (las2-mlgpu43).
The application is specified in the bundle as follows:
  kubeflow-dashboard:
    charm: kubeflow-dashboard
    channel: 1.6/stable
    scale: 1
    _github_repo_name: kubeflow-dashboard-operator
    constraints: tags=node.mldatanode=true,^node.mlgpunode=true

$ kubectl get statefulsets.apps -n kubeflow kubeflow-dashboard-operator -o json | jq .spec.template.spec.affinity
null

Here is the full YAML for the kubeflow-dashboard-operator statefulset: https://pastebin.canonical.com/p/RGb2YyzRr3/
Here is the full JSON for the kubeflow-dashboard-operator statefulset: https://pastebin.canonical.com/p/Y6pYgHxQzD/

The cluster has the following nodes:
NAME STATUS ROLES AGE VERSION
las2-mlgpu41 Ready <none> 29d v1.24.3
las2-mlgpu43 Ready <none> 28d v1.24.3
lv01-mlkfwapp-l01 Ready <none> 34d v1.24.6
lv01-mlkfwapp-l02 Ready <none> 34d v1.24.6
lv01-mlkfwapp-l03 Ready <none> 34d v1.24.6
lv01-mlkfwapp-l04 Ready <none> 34d v1.24.6
lv01-mlkfwapp-l05 Ready <none> 34d v1.24.6
lv1-mlksapp-l01 Ready control-plane,master 35d v1.24.6
lv1-mlksapp-l02 Ready control-plane,master 35d v1.24.6
lv1-mlksapp-l03 Ready control-plane,master 35d v1.24.6
lv1-mlksapp-l04 Ready control-plane,master 35d v1.24.6
lv1-mlksapp-l05 Ready control-plane,master 35d v1.24.6

Here is the full YAML for the nodes in the cluster: https://pastebin.canonical.com/p/gjqR2hnjYh/
Here is the full JSON for the nodes in the cluster: https://pastebin.canonical.com/p/R8PKD3DBPq/

Revision history for this message
Ian Booth (wallyworld) wrote :

This

$ kubectl get statefulsets.apps -n kubeflow kubeflow-dashboard-operator -o json | jq .spec.template.spec.affinity
null

seems to show that the constraints aren't being applied. Maybe it's a bundle processing bug - I wonder what happens if the charm is deployed as a charm outside of any bundle.

Revision history for this message
Ian Booth (wallyworld) wrote (last edit ):

I tested with a sidecar charm and it worked as expected. This is on Juju 3.0. I suspect it will be the same on 2.9.

The issue is because kubeflow-dashboard is an older "podspec" charm which is deprecated. It seems somewhere along the way, some of the work to implement sidecar charms broke affinity on podspec charms.

Revision history for this message
Ian Booth (wallyworld) wrote :

Ah, I just noticed you are looking at the operator statefuleset. This is not where the node/pod affinity is applied to. podspec charms deploy 2 stateful sets:

1. for the operator itself (the charm)
2. for the workload

The workload stateful set is created when the charm sets the podspec; this is where the affinity selectors are applied. There's no affinity rules that are applied to the operator pod.

With the transition to sidecar charms, the charm container and workload container all run in the same pod, so the affinity rules are applied to that single pod, that why the sidecar example I tried worked.

We're not doing any more work on podspec charms so will not be adding affinity support to the podspec charm operator pod.

Changed in juju:
status: Incomplete → Won't Fix
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

@Ian Not supporting that feature on podspec charms means that charmed kubeflow will not support placement directives for another 1-1.5 year. This is a critical functionality for any charmed app deployment on kubernetes. What would be the level of effort required to make this backward compatible with podspec charms ?

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

I just tried this in a bundle and CLI (constraints="tags=mldatanode=true,^mlgpunode=true") with a sidecar charm (training-operator) and it did not respect the node affinity (it did not show up in the statefulset):
https://pastebin.canonical.com/p/mp7v5BVYT4/

Revision history for this message
Ian Booth (wallyworld) wrote :

Your tag names are missing the required "node" and/or "pod" prefixes :-)

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

Sorry that was a typo but just to be sure I tried again with:
juju deploy training-operator --constraints="tags=node.mldatanode=true,^node.mlgpunode=true"
and it still ended up on a node labelled with mlgpunode=true. Here is the deployed statefulset YAML again:
https://pastebin.canonical.com/p/X7h3wq9hP9/

Revision history for this message
Ian Booth (wallyworld) wrote :

I guess you're using Juju 2.9.

I checked the code and it seems affinity support for sidecar charms was only added in Juju 3.x. I had thought both podspec and sidecar charms supported it even in 2.9, but it seems 2.9 doesn't support affinity for sidecar charms. We can look to add in this support.

@Camille to clarify, 2.9 does support affinity for podspec charms, but only for the workload pod, not the operator pod that runs the charm. The original thinking way back when this was done was that it's the workload that needs access to gpu etc. Because podspec charms spin up 2 statefulsets, one for the operator and one for the workload, there's no good way to use the constraint syntax to provide a different set of affinity rules for the operator vs workload pods. But you can use the same approach as already used in sidecar charms for changing the cluster in ways juju doesn't support - use the k8s api client from the charm to update the operator's statefulset podspec template.

summary: - placement directives for k8s cloud not working
+ [2.9] pod/node affinity for sidecar charms not implemented
Changed in juju:
milestone: none → 2.9.38
status: Won't Fix → Triaged
importance: Undecided → Wishlist
Revision history for this message
Ian Booth (wallyworld) wrote :

I backported support for affinity for sidecar charms from juju 3
https://github.com/juju/juju/pull/14897

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Hi Ian - can you provide a timeline for the backport to be packaged and available to use?

Revision history for this message
Ian Booth (wallyworld) wrote :

We hope to have 2.9.38 candidate out next week (currently 3.0.2 is being tested).
Until then you can try with the edge snap.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.