tridentctl commands fail with "could not find a Trident pod in the trident namespace"

Bug #2023116 reported by Erickson Silva de Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
New
Undecided
Unassigned

Bug Description

Brief Description
-----------------
sysadmin@controller-0 ~(keystone_admin)]$ tridentctl -n trident version
Error: could not find a Trident pod in the trident namespace. You may need to use the -n option to specify the correct namespace

Severity
--------
Critical: System/Feature is not usable due to the defect

Expected Behavior
------------------
NetApp trident drivers should be running 21.04.1 and I should be able to view what version is installed with `trident version --namespace trident`

Actual Behavior
----------------
Above command fails. Looking at the pods, some are not running.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
2-controllers, 1-worker

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ tridentctl -n trident version
Error: could not find a Trident pod in the trident namespace. You may need to use the -n option to specify the correct namespace
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods -n trident
NAME READY STATUS RESTARTS AGE
trident-csi-5f44c44567-7x9qr 0/5 Pending 0 120m
trident-csi-dbwpz 2/2 Running 2 30h
trident-csi-qq4ft 1/2 Running 2 30h
trident-csi-tvxwh 1/2 Running 0 30h
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe pod -n trident trident-csi-5f44c44567-7x9qr
Name: trident-csi-5f44c44567-7x9qr
Namespace: trident
Priority: 0
Node: <none>
Labels: app=controller.csi.trident.netapp.io
                pod-template-hash=5f44c44567
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/trident-csi-5f44c44567
Containers:
  trident-main:
    Image: registry.local:9001/docker.io/netapp/trident:21.04.1
    Ports: 8678/TCP, 8001/TCP
    Host Ports: 0/TCP, 0/TCP
    Command:
      /trident_orchestrator
    Args:
      --crd_persistence
      --k8s_pod
      --https_rest
      --https_port=8678
      --csi_node_name=$(KUBE_NODE_NAME)
      --csi_endpoint=$(CSI_ENDPOINT)
      --csi_role=controller
      --log_format=text
      --address=127.0.0.1
      --port=8677
      --metrics
      --metrics_port=8001
    Liveness: exec [tridentctl -s 127.0.0.1:8677 version] delay=120s timeout=90s period=120s #success=1 #failure=2
    Environment:
      KUBE_NODE_NAME: (v1:spec.nodeName)
      CSI_ENDPOINT: unix://plugin/csi.sock
      TRIDENT_SERVER: 127.0.0.1:8677
    Mounts:
      /certs from certs (ro)
      /plugin from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vrk65 (ro)
  csi-provisioner:
    Image: registry.local:9001/quay.io/k8scsi/csi-provisioner:v2.1.1
    Port: <none>
    Host Port: <none>
    Args:
      --v=2
      --timeout=600s
      --csi-address=$(ADDRESS)
      --retry-interval-start=8s
      --retry-interval-max=30s
    Environment:
      ADDRESS: /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vrk65 (ro)
  csi-attacher:
    Image: registry.local:9001/quay.io/k8scsi/csi-attacher:v3.1.0
    Port: <none>
    Host Port: <none>
    Args:
      --v=2
      --timeout=60s
      --retry-interval-start=10s
      --csi-address=$(ADDRESS)
    Environment:
      ADDRESS: /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vrk65 (ro)
  csi-resizer:
    Image: registry.local:9001/quay.io/k8scsi/csi-resizer:v1.1.0
    Port: <none>
    Host Port: <none>
    Args:
      --v=2
      --timeout=300s
      --csi-address=$(ADDRESS)
    Environment:
      ADDRESS: /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vrk65 (ro)
  csi-snapshotter:
    Image: registry.local:9001/quay.io/k8scsi/csi-snapshotter:v3.0.3
    Port: <none>
    Host Port: <none>
    Args:
      --v=2
      --timeout=300s
      --csi-address=$(ADDRESS)
    Environment:
      ADDRESS: /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vrk65 (ro)
Conditions:
  Type Status
  PodScheduled False
Volumes:
  socket-dir:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit: <unset>
  certs:
    Type: Secret (a volume populated by a Secret)
    SecretName: trident-csi
    Optional: false
  asup-dir:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit: 1Gi
  kube-api-access-vrk65:
    Type: Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds: 3607
    ConfigMapName: kube-root-ca.crt
    ConfigMapOptional: <nil>
    DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/arch=amd64
                             kubernetes.io/os=linux
                             node-role.kubernetes.io/master=
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 120m default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {services: disabled}, that the pod didn't tolerate.
  Warning FailedScheduling 120m default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {services: disabled}, that the pod didn't tolerate.
  Warning FailedScheduling 100m default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning FailedScheduling 87m default-scheduler 0/3 nodes are available: 1 node(s) had taint {services: disabled}, that the pod didn't tolerate, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning FailedScheduling 64m default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning FailedScheduling 63m default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning FailedScheduling 58m default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Workaround
----------
$ cat <<EOF >> ~/trident_patch.yml
spec:
  template:
    spec:
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Exists"
        effect: "NoSchedule"
      - key: "node-role.kubernetes.io/control-plane"
        operator: "Exists"
        effect: "NoSchedule"

EOF

$ kubectl patch deployment trident-csi -n trident --patch "$(cat ~/trident_patch.yml)"

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.