Canonical Juju

timeout waiting for volumes in k8s charm

Bug #1864401 reported by Evan Hanson on 2020-02-24

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

We're running into an intermittent issue deploying charms to our k8s platform where the agent pod hits a timeout while waiting for a volume, causing the application to become stuck in the "allocating" state. The root cause of this problem isn't really Juju's concern since it's a failure in the underlying platform, but I'm also filing an issue here since the problem is never surfaced in Juju so a user wouldn't know about it unless they went digging after becoming suspicious at how long the deploy is taking.

Our first reaction to this issue was to look for some way to set a timeout for charm deploy operations, since we know the characteristics of a successful deploy vs. one that will never finish and could confidently set a two-minute timeout in this case. Is there a way to achieve this? That might offer a flexible way for users to make sure they know about it when things go haywire under the hood.

- juju status: https://paste.ubuntu.com/p/FyHqmGzxGn/
- juju debug-log: https://paste.ubuntu.com/p/SVyFvtmBDV/
- k8s events: https://paste.ubuntu.com/p/qK6GT3ByhJ/
- k8s status: https://paste.ubuntu.com/p/8v9Nt4JR4G/

This bug is similar to [1] in that it's caused by the platform taking a long time to do something before eventually timing out and leaves Juju stuck in "allocating" state, but it's for a different cloud type and the suggested solution on that ticket might not be applicable here.

[1]: https://bugs.launchpad.net/juju/+bug/1828894

See original description

Tags:

Evan Hanson (evhan) on 2020-02-24

description:

updated

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-02-25:

#1

juju status --storage

would normally be expected to show that storage allocation has an issue. We by default don't show relations and storage in status (--relations and --storage are needed).

status --format yaml also shows everything

Can you confirm that status --storage does or doesn't surface the error?

Revision history for this message

Evan Hanson (evhan) wrote on 2020-02-25:

#2

Of course now that I need to do so I can't reproduce this hang...

We actually made it this far without noticing the --storage flag for status, so that may do it, I didn't check. I will confirm as soon as I can trigger the timeout again.

Evan Hanson (evhan) on 2020-02-25

Changed in juju:
assignee:	nobody → Evan Hanson (evhan)

Revision history for this message

Evan Hanson (evhan) wrote on 2020-02-25:

#3

OK, reproduced this. Adding the --status and --relations flags don't seem to change the output. But note the timeout is occurring while setting up the operator pod, not the workload one, in case that changes things. It doesn't get to the point of creating the workload pod when this hang is encountered.

https://paste.ubuntu.com/p/FSRJGDkzJw/ juju status --storage --relations
https://paste.ubuntu.com/p/yhXXQZxWdc/ juju status --format yaml
https://paste.ubuntu.com/p/GNpcDhZvsP/ kubectl -n example get pod/ubuntu-operator-0 --output yaml
https://paste.ubuntu.com/p/m3X67wVrkD/ kubectl -n example get events --sort-by={.metadata.creationTimestamp}

Changed in juju:
assignee:	Evan Hanson (evhan) → nobody

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-02-25:

#4

Thanks for the extra info - we'll see if we can better surface any k8s error in status so that at least the user is aware that something needs looking at.

Changed in juju:
milestone:	none → 2.8-beta1
status:	New → Triaged
importance:	Undecided → High

Ian Booth (wallyworld) on 2020-04-02

Changed in juju:
milestone:	2.8-beta1 → 2.8.1

Revision history for this message

Tim Penhey (thumper) wrote on 2020-06-08:

#5

I think this requires slightly more thought and design around juju status with errors, particularly from storage.

Changed in juju:
milestone:	2.8.1 → 2.8-next

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#6

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	High → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.