When restarting stateful set deployment, Pods get stuck waiting for Cinder backed PVC

Bug #1871455 reported by Ryan Farrell
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
CDK Addons
Incomplete
High
Unassigned

Bug Description

Certain deployments are not able restart their pcv-attached pods without incurring long wait times (between 15 minutes and 8 hours) due to waiting for the cinder backed PVC.

This is reproducible using helm chart of next cloud:
1- First install helm
2- Then add the official stable helm repository
$ helm repo add stable https://kubernetes-charts.storage.googleapis.com
3- Install nextcloud
$ helm install nextcloud stable/nextcloud --set persistence.enabled=true --set persistence.storageClass=cdk-cinder
4- wait for the pod to be ready, then run the following (the deployment name should be the same as the release name above):
$ kubectl rollout restart deployment nextcloud
5- The old pod will be killed, then the new one gets stuck at ContainerCreating, describing it should show the error "Unable to attach or mount volumes: unmounted volumes=[html], unattached volumes=[default-token-p7hn7 html]: timed out waiting for the condition"

In this example, we observe that the remains stuck for ~15 minutes however our customer has other stateful sets which experience this issue and much have longer wait times.

# From $kubectl cluster-info dump - and searching for 'nextcloud' we see this:
I0407 15:04:28.304113 1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"canonical", Name:"nextcloud-nextcloud", UID:"92e8a238-0898-4828-bbb3-31662dfeeb7f", APIVersion:"v1",
ResourceVersion:"76811277", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "canonical/nextcloud-nextcloud"
I0407 15:04:37.295060 1 controller.go:671] successfully created PV {GCEPersistentDisk:nil AWSElasticBlockStore:nil HostPath:nil Glusterfs:nil NFS:nil RBD:nil ISCSI:nil Cinder:nil CephFS:nil FC:nil Flocker:
nil FlexVolume:nil AzureFile:nil VsphereVolume:nil Quobyte:nil AzureDisk:nil PhotonPersistentDisk:nil PortworxVolume:nil ScaleIO:nil Local:nil StorageOS:nil CSI:&CSIPersistentVolumeSource{Driver:cinder.csi.opens
tack.org,VolumeHandle:96684e99-0b1e-4373-a319-0dadb4fb8030,ReadOnly:false,FSType:ext4,VolumeAttributes:map[string]string{storage.kubernetes.io/csiProvisionerIdentity: 1585018739102-8081-csi-cinderplugin,},Contro
llerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,}}
I0407 15:04:37.295547 1 controller.go:1026] provision "canonical/nextcloud-nextcloud" class "cdk-cinder": volume "pvc-92e8a238-0898-4828-bbb3-31662dfeeb7f" provisioned
I0407 15:04:37.295927 1 controller.go:1040] provision "canonical/nextcloud-nextcloud" class "cdk-cinder": trying to save persistentvolume "pvc-92e8a238-0898-4828-bbb3-31662dfeeb7f"
I0407 15:04:37.304034 1 controller.go:1047] provision "canonical/nextcloud-nextcloud" class "cdk-cinder": persistentvolume "pvc-92e8a238-0898-4828-bbb3-31662dfeeb7f" saved
I0407 15:04:37.304308 1 controller.go:1088] provision "canonical/nextcloud-nextcloud" class "cdk-cinder": succeeded
I0407 15:04:37.304670 1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"canonical", Name:"nextcloud-nextcloud", UID:"92e8a238-0898-4828-bbb3-31662dfeeb7f", APIVersion:"v1",
ResourceVersion:"76811277", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-92e8a238-0898-4828-bbb3-31662dfeeb7f
I0407 15:07:41.677034 1 controller.go:1097] delete "pvc-031bbaec-69b5-4252-99d3-3093e073fdd8": started
E0407 15:07:41.908855 1 controller.go:1120] delete "pvc-031bbaec-69b5-4252-99d3-3093e073fdd8": volume deletion failed: rpc error: code = Unknown desc = Cannot delete the volume "b765f8b4-1fa8-4f36-b3e3-5d9
9ef03607f", it's still attached to a node
W0407 15:07:41.909044 1 controller.go:726] Retrying syncing volume "pvc-031bbaec-69b5-4252-99d3-3093e073fdd8" because failures 0 < threshold 15
E0407 15:07:41.909167 1 controller.go:741] error syncing volume "pvc-031bbaec-69b5-4252-99d3-3093e073fdd8": rpc error: code = Unknown desc = Cannot delete the volume "b765f8b4-1fa8-4f36-b3e3-5d99ef03607f",
 it's still attached to a node
I0407 15:07:41.909515 1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-031bbaec-69b5-4252-99d3-3093e073fdd8", UID:"f5f5fe24-4ce4-44aa-8cd6-40f6b04a0e22", APIVersion
:"v1", ResourceVersion:"76812195", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = Unknown desc = Cannot delete the volume "b765f8b4-1fa8-4f36-b3e3-5d99ef03607f", it's still attach
ed to a node

Although seemingly similar in effect, this does NOT appear related to LP1853566 because we confirmed that the volume is mounted on the kubernetes worker, as well as contains the supported symlink in /dev/drive/by-id.

$ lsblk |grep vdk
vdk 252:160 0 8G 0 disk /var/lib/kubelet/pods/99b1bac3-07c5-4919-a3b8-8e4810423562/volumes/kubernetes.io~csi/pvc-92e8a238-0898-4828-bbb3-31662dfeeb7f/mount
$ ls -al /dev/disk/by-id/ | grep vdk
lrwxrwxrwx 1 root root 9 Apr 7 16:42 virtio-96684e99-0b1e-4373-a -> ../../vdk

Changed in cdk-addons:
importance: Undecided → High
status: New → Triaged
Changed in cdk-addons:
assignee: nobody → Cory Johns (johnsca)
George Kraft (cynerva)
Changed in cdk-addons:
importance: High → Medium
Revision history for this message
George Kraft (cynerva) wrote :

~field-high is subscribed to this

Changed in cdk-addons:
importance: Medium → High
Revision history for this message
Cory Johns (johnsca) wrote :

I was running into some environmental issues yesterday while trying to reproduce this, but I'm going to continue to work on this today.

Changed in cdk-addons:
status: Triaged → In Progress
Revision history for this message
Cory Johns (johnsca) wrote :

I have been unable to reproduce this with the latest (1.19) release of CK, nor with the 1.17+ck2 bundle, both on OpenStack Queens (serverstack). The new pod always goes to Running within about 20 seconds.

It's worth noting that the cloud-provider-openstack component was updated with the 1.19 release of CK, and that the Helm chart for nextcloud in the stable repo is deprecated in favor of https://github.com/nextcloud/helm/tree/master/charts/nextcloud now, but I tested with the one from stable just to be sure.

Can you please confirm if this is still an issue, and, if so, confirm the versions of the CK charms, K8s release, openstack-cloud-controller-manager image, and underlying OpenStack release and / or Cinder involved?

Changed in cdk-addons:
status: In Progress → Incomplete
Changed in cdk-addons:
assignee: Cory Johns (johnsca) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.