intermittent snapstore outage should not trigger an error on update-status

Bug #1887973 reported by Alexander Balderson
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Etcd Charm
Fix Released
High
Unassigned
Kubernetes Control Plane Charm
Fix Released
High
Unassigned
Kubernetes Worker Charm
Fix Released
High
Unassigned
Snap Layer
Fix Released
Undecided
Unassigned

Bug Description

A running k8s deployment's etcd units went into error because they could not run snap refresh --list. It doesn't make sense that the charm is error because it couldn't ask the snap store what its most recent revision was. I would expect, instead, that the charm would go into blocked or waiting, to retry the hook, until the snap store comes back up.

2020-07-17 06:59:39 ERROR juju-log Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 73, in main
    hookenv._run_atstart()
  File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1332, in _run_atstart
    callback(*args, **kwargs)
  File "/var/lib/juju/agents/unit-etcd-1/charm/reactive/snap.py", line 93, in check_refresh_available
    available_refreshes = snap.get_available_refreshes()
  File "lib/charms/layer/snap.py", line 385, in get_available_refreshes
    out = subprocess.check_output(['snap', 'refresh', '--list']).decode('utf8')
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['snap', 'refresh', '--list']' returned non-zero exit status 1.

2020-07-17 06:59:39 DEBUG update-status Traceback (most recent call last):
2020-07-17 06:59:39 DEBUG update-status File "/var/lib/juju/agents/unit-etcd-1/charm/hooks/update-status", line 22, in <module>
2020-07-17 06:59:39 DEBUG update-status main()
2020-07-17 06:59:39 DEBUG update-status File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 73, in main
2020-07-17 06:59:39 DEBUG update-status hookenv._run_atstart()
2020-07-17 06:59:39 DEBUG update-status File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1332, in _run_atstart
2020-07-17 06:59:39 DEBUG update-status callback(*args, **kwargs)
2020-07-17 06:59:39 DEBUG update-status File "/var/lib/juju/agents/unit-etcd-1/charm/reactive/snap.py", line 93, in check_refresh_available
2020-07-17 06:59:39 DEBUG update-status available_refreshes = snap.get_available_refreshes()
2020-07-17 06:59:39 DEBUG update-status File "lib/charms/layer/snap.py", line 385, in get_available_refreshes
2020-07-17 06:59:39 DEBUG update-status out = subprocess.check_output(['snap', 'refresh', '--list']).decode('utf8')
2020-07-17 06:59:39 DEBUG update-status File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
2020-07-17 06:59:39 DEBUG update-status **kwargs).stdout
2020-07-17 06:59:39 DEBUG update-status File "/usr/lib/python3.6/subprocess.py", line 438, in run
2020-07-17 06:59:39 DEBUG update-status output=stdout, stderr=stderr)
2020-07-17 06:59:39 DEBUG update-status subprocess.CalledProcessError: Command '['snap', 'refresh', '--list']' returned non-zero exit status 1.

Revision history for this message
George Kraft (cynerva) wrote :

Looks like this is part of the code that was added to layer-snap for snap coherence support[1]. This will need to be fixed in layer-snap.

[1]: https://github.com/stub42/layer-snap/commit/2d3872544653c2fcec4a9b57d594d976b7fd1042

Changed in charm-etcd:
importance: Undecided → High
Changed in charm-kubernetes-master:
importance: Undecided → High
Changed in charm-kubernetes-worker:
importance: Undecided → High
Changed in charm-etcd:
status: New → Triaged
Changed in charm-kubernetes-master:
status: New → Triaged
Changed in charm-kubernetes-worker:
status: New → Triaged
Revision history for this message
Chris Johnston (cjohnston) wrote :

Is this possibly already fixed and just needs to be released?

https://github.com/stub42/layer-snap/blob/master/lib/charms/layer/snap.py#L382

tags: added: sts
Revision history for this message
Chris Johnston (cjohnston) wrote :

Possibly released in etcd/532 and kubernetes-worker/683?

Revision history for this message
Chris Johnston (cjohnston) wrote :
Revision history for this message
Tim Van Steenburgh (tvansteenburgh) wrote :

Thanks Chris, you're right.

The fix to the snap layer is included in the current candidate revisions of the charms, which will be released to stable with the 1.18+ck2 bugfix release later this week. The revisions are:

kubernetes-worker-692
kubernetes-master-865
etcd-531

Changed in layer-snap:
status: New → Fix Released
Changed in charm-etcd:
milestone: none → 1.18+ck2
Changed in charm-kubernetes-master:
milestone: none → 1.18+ck2
Changed in charm-kubernetes-worker:
milestone: none → 1.18+ck2
Changed in charm-etcd:
status: Triaged → Fix Committed
Changed in charm-kubernetes-master:
status: Triaged → Fix Committed
Changed in charm-kubernetes-worker:
status: Triaged → Fix Committed
Changed in charm-etcd:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-worker:
status: Fix Committed → Fix Released
Revision history for this message
Michael Skalka (mskalka) wrote :

We have seen another instance of this bug using etcd-531 during this test run: https://solutions.qa.canonical.com/qa/testRun/7228687f-a7b2-4e7d-9afa-322263933d4c

Link to bundle: https://oil-jenkins.canonical.com/artifacts/7228687f-a7b2-4e7d-9afa-322263933d4c/config/config/bundle.yaml
Link to crashdump: https://oil-jenkins.canonical.com/artifacts/7228687f-a7b2-4e7d-9afa-322263933d4c/generated/generated/openstack/juju-crashdump-openstack-2020-08-18-02.01.35.tar.gz

In this instance the etcd charm's "update-status" hook failed to refresh the snap, resulting in the error seen on the test page.

Changed in charm-etcd:
status: Fix Released → New
Revision history for this message
George Kraft (cynerva) wrote :

Thank you. Yes, this fix was not released with 1.18+ck2, since the layer-snap commit[1] was not backported to the stable branch of our fork[2].

I do not believe we have plans to do a 1.18+ck3 release, so this should go out with CK 1.19.

[1]: https://github.com/stub42/layer-snap/commit/9c69a33ea0586ac68a4e47d6b55b3a0374b96b26
[2]: https://github.com/charmed-kubernetes/layer-snap/commits/stable

Changed in charm-etcd:
milestone: 1.18+ck2 → 1.19
Changed in charm-kubernetes-master:
milestone: 1.18+ck2 → 1.19
Changed in charm-kubernetes-worker:
milestone: 1.18+ck2 → 1.19
Changed in charm-etcd:
status: New → Fix Committed
Changed in charm-kubernetes-master:
status: Fix Released → Fix Committed
Changed in charm-kubernetes-worker:
status: Fix Released → Fix Committed
Changed in charm-etcd:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-worker:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.