timeout waiting for volumes in k8s charm

Bug #1864401 reported by Evan Hanson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

We're running into an intermittent issue deploying charms to our k8s platform where the agent pod hits a timeout while waiting for a volume, causing the application to become stuck in the "allocating" state. The root cause of this problem isn't really Juju's concern since it's a failure in the underlying platform, but I'm also filing an issue here since the problem is never surfaced in Juju so a user wouldn't know about it unless they went digging after becoming suspicious at how long the deploy is taking.

Our first reaction to this issue was to look for some way to set a timeout for charm deploy operations, since we know the characteristics of a successful deploy vs. one that will never finish and could confidently set a two-minute timeout in this case. Is there a way to achieve this? That might offer a flexible way for users to make sure they know about it when things go haywire under the hood.

- juju status: https://paste.ubuntu.com/p/FyHqmGzxGn/
- juju debug-log: https://paste.ubuntu.com/p/SVyFvtmBDV/
- k8s events: https://paste.ubuntu.com/p/qK6GT3ByhJ/
- k8s status: https://paste.ubuntu.com/p/8v9Nt4JR4G/

This bug is similar to [1] in that it's caused by the platform taking a long time to do something before eventually timing out and leaves Juju stuck in "allocating" state, but it's for a different cloud type and the suggested solution on that ticket might not be applicable here.

[1]: https://bugs.launchpad.net/juju/+bug/1828894

Evan Hanson (evhan)
description: updated
Revision history for this message
Ian Booth (wallyworld) wrote :

juju status --storage

would normally be expected to show that storage allocation has an issue. We by default don't show relations and storage in status (--relations and --storage are needed).

status --format yaml also shows everything

Can you confirm that status --storage does or doesn't surface the error?

Revision history for this message
Evan Hanson (evhan) wrote :

Of course now that I need to do so I can't reproduce this hang...

We actually made it this far without noticing the --storage flag for status, so that may do it, I didn't check. I will confirm as soon as I can trigger the timeout again.

Evan Hanson (evhan)
Changed in juju:
assignee: nobody → Evan Hanson (evhan)
Revision history for this message
Evan Hanson (evhan) wrote :

OK, reproduced this. Adding the --status and --relations flags don't seem to change the output. But note the timeout is occurring while setting up the operator pod, not the workload one, in case that changes things. It doesn't get to the point of creating the workload pod when this hang is encountered.

https://paste.ubuntu.com/p/FSRJGDkzJw/ juju status --storage --relations
https://paste.ubuntu.com/p/yhXXQZxWdc/ juju status --format yaml
https://paste.ubuntu.com/p/GNpcDhZvsP/ kubectl -n example get pod/ubuntu-operator-0 --output yaml
https://paste.ubuntu.com/p/m3X67wVrkD/ kubectl -n example get events --sort-by={.metadata.creationTimestamp}

Changed in juju:
assignee: Evan Hanson (evhan) → nobody
Revision history for this message
Ian Booth (wallyworld) wrote :

Thanks for the extra info - we'll see if we can better surface any k8s error in status so that at least the user is aware that something needs looking at.

Changed in juju:
milestone: none → 2.8-beta1
status: New → Triaged
importance: Undecided → High
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.8-beta1 → 2.8.1
Revision history for this message
Tim Penhey (thumper) wrote :

I think this requires slightly more thought and design around juju status with errors, particularly from storage.

Changed in juju:
milestone: 2.8.1 → 2.8-next
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.