cannot deploy bundle, cannot resolve URL, TLS handshake timeout

Bug #1906372 reported by Aurelien Lourot
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Confirmed
High
Unassigned

Bug Description

Maybe related to lp:1899793 ?

We have been seeing this on OSCI more than 5 times per day for the last 1 or 2 weeks on random bundles and charms:

ERROR cannot deploy bundle: cannot resolve URL "cs:~openstack-charmers-next/ceph-mon": cannot resolve charm URL "cs:~openstack-charmers-next/ceph-mon": cannot get "/~openstack-charmers-next/ceph-mon/meta/any?include=id&include=supported-series&include=published": Get "https://api.jujucharms.com/charmstore/v5/~openstack-charmers-next/ceph-mon/meta/any?include=id&include=supported-series&include=published": net/http: TLS handshake timeout

It may not be a Juju issue but it seems to correlate with a recent upgrade from Juju 2.7 to Juju 2.8.6 on OSCI.

Example:
https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-cinder-ceph/761552/1/7525/consoleText.test_charm_func_full_10591.txt

Tags: cdo-qa
Revision history for this message
Ian Booth (wallyworld) wrote :

juju issues a GET http request to the charm store

https://api.jujucharms.com/charmstore/v5/~openstack-charmers-next/ceph-mon/meta/any?include=id&include=supported-series&include=published

to fetch metadata about the charm. This HTTP request is timing out.

It sure seems like there's underlying connectivity issues external to Juju here.

Revision history for this message
Pen Gale (pengale) wrote :

This should get fixed when we move to the new shim api.

Changed in juju:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Alvaro Uria (aluria) wrote :

Hi,

We're consistently seeing this problem when running func tests on an OpenStack Charm. Please have a look to the attached pdf. In all but one of the tests, the issue was related to this bug. The charmstore times out after 10s.

Is there a "juju deploy" option to increase the timeout (eg. 15 or 20s)?

Revision history for this message
Pen Gale (pengale) wrote :

Bumping to high. It sounds like we might be able to increase a timeout value to fix this, and it feels like something that might be affecting production deployments, in addition to the test environment here.

Changed in juju:
importance: Medium → High
milestone: none → 2.8.8
Revision history for this message
Pen Gale (pengale) wrote :

Per conversation in sync, it is probably better to fix this by being smarter about re-using TLS connection, rather than setting a higher timeout.

(If we bump up the timeout, we're probably going to run into other issues talking to the store down the line.)

Revision history for this message
Alvaro Uria (aluria) wrote :

The OpenStack team also suggested a similar approach to tenacity.retry, where connection timeouts may be retried once (or more if an arg like "--connection-retry" could be passed).

Revision history for this message
Alireza Nasri (sysnasri) wrote :

Is there a temp workaround about this?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Sub'd to field-high, this is affecting solutions-qa release testing.

Changed in juju:
milestone: 2.8.8 → 2.8.9
Revision history for this message
Joshua Genet (genet022) wrote :

Here's what we believe is another manifestation of this.
We run a Kubernetes test suite that's failing to pull image.

---

containerd_2/var/log/syslog:Feb 9 11:20:23 juju-074d0d-7 containerd[41538]: time="2021-02-09T11:20:23.753972008Z" level=error msg="PullImage
"rocks.canonical.com/cdk/jujusolutions/jujud-operator:2.8.8"
failed" error="failed to pull and unpack image
"rocks.canonical.com/cdk/jujusolutions/jujud-operator:2.8.8": failed to resolve reference "rocks.canonical.com/cdk/jujusolutions/jujud-operator:2.8.8": failed to do request: Head https://rocks.canonical.com/v2/cdk/jujusolutions/jujud-operator/manifests/2.8.8: net/http: TLS handshake timeout"

---

Example run:
https://solutions.qa.canonical.com/testruns/testRun/c9df119d-7bc6-4ebc-8f7c-7240897c6f85

Juju status at the bottom of this page:
https://oil-jenkins.canonical.com/job/fce_build/9620/console

Juju model config and crashdump:
https://oil-jenkins.canonical.com/artifacts/c9df119d-7bc6-4ebc-8f7c-7240897c6f85/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-02-09-11.24.47.tar.gz

All artifacts:
https://oil-jenkins.canonical.com/artifacts/c9df119d-7bc6-4ebc-8f7c-7240897c6f85/index.html

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1906372] Re: cannot deploy bundle, cannot resolve URL, TLS handshake timeout

So this feels like something on the order of "your VMs are not getting
enough entropy in order to generate private keys for TLS connections".
I don't know that it is the case, but I worry that just doing retries on
Juju's behalf wont make things better (as it just consumes more of whatever
limited resource is causing TLS handshakes to fail).

On Tue, Feb 9, 2021 at 1:51 PM Joshua Genet <email address hidden>
wrote:

> Here's what we believe is another manifestation of this.
> We run a Kubernetes test suite that's failing to pull image.
>
> ---
>
> containerd_2/var/log/syslog:Feb 9 11:20:23 juju-074d0d-7
> containerd[41538]: time="2021-02-09T11:20:23.753972008Z" level=error
> msg="PullImage
> "rocks.canonical.com/cdk/jujusolutions/jujud-operator:2.8.8"
> failed" error="failed to pull and unpack image
> "rocks.canonical.com/cdk/jujusolutions/jujud-operator:2.8.8": failed to
> resolve reference "
> rocks.canonical.com/cdk/jujusolutions/jujud-operator:2.8.8": failed to do
> request: Head
> https://rocks.canonical.com/v2/cdk/jujusolutions/jujud-operator/manifests/2.8.8:
> net/http: TLS handshake timeout"
>
> ---
>
> Example run:
>
> https://solutions.qa.canonical.com/testruns/testRun/c9df119d-7bc6-4ebc-8f7c-7240897c6f85
>
> Juju status at the bottom of this page:
> https://oil-jenkins.canonical.com/job/fce_build/9620/console
>
> Juju model config and crashdump:
>
> https://oil-jenkins.canonical.com/artifacts/c9df119d-7bc6-4ebc-8f7c-7240897c6f85/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-02-09-11.24.47.tar.gz
>
> All artifacts:
>
> https://oil-jenkins.canonical.com/artifacts/c9df119d-7bc6-4ebc-8f7c-7240897c6f85/index.html
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1906372
>
> Title:
> cannot deploy bundle, cannot resolve URL, TLS handshake timeout
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1906372/+subscriptions
>

Revision history for this message
Ian Booth (wallyworld) wrote :

What's interesting is that there's now 2 external services affected, called by 2 separate clients:

1. charm store (used by the juju client)
2. rocks (used by k8s itself, ie containerd)

Given containerd is also affected, ie it connects to rocks to pull an image totally outside of juju, this doesn't look like a Juju issue per se, and really does appear to be an artifact of the deployment environment.

Revision history for this message
Michael Skalka (mskalka) wrote :

Ian,

I'm sorry but I don't buy the "it's your environment" line here. The 2.8.7 client was blessed on Dec 11th, 2020 and we shut our CI off the following Wednesday for the holiday break. We started seeing this sporadically starting On Jan 6th [0], basically a day after turning our CI back on. Assuming you didn't release on a Friday that was only a few days for this issue to present. Between the 2.8.7 release tests and today nothing has changed within our test lab that could have caused this.

This has also been confirmed by the OpenStack engineering team [1] and at least one community member.

So either the Juju client has a defect, or the charm store is working poorly. Either way it's a Juju issue.

0. https://solutions.qa.canonical.com/bugs/bugs/bug/1906372
1. https://bugs.launchpad.net/juju/+bug/1906372/comments/3

Revision history for this message
Ian Booth (wallyworld) wrote :

It's also containerd, independent of Juju.

Below, the containerd service is trying to pull an image from rocks.canonical.com and gets the TLS handshake error. Juju is not involved here.

containerd_2/var/log/syslog:Feb 9 11:20:23 juju-074d0d-7 containerd[41538]: time="2021-02-09T11:20:23.753972008Z" level=error msg="PullImage
"rocks.canonical.com/cdk/jujusolutions/jujud-operator:2.8.8"
failed" error="failed to pull and unpack image
"rocks.canonical.com/cdk/jujusolutions/jujud-operator:2.8.8": failed to resolve reference "rocks.canonical.com/cdk/jujusolutions/jujud-operator:2.8.8": failed to do request: Head https://rocks.canonical.com/v2/cdk/jujusolutions/jujud-operator/manifests/2.8.8: net/http: TLS handshake timeout"

Ian Booth (wallyworld)
Changed in juju:
milestone: 2.8.9 → none
Revision history for this message
Nobuto Murata (nobuto) wrote :

K8s/containred retries pulling images, doesn't it?
https://kubernetes.io/docs/concepts/containers/images/#imagepullbackoff

I'm not saying Juju is doing something wrong here, but having retries and backoffs in Juju on pulling resources from charmhub and such make our life much easier.

tags: added: cdo-qa
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.