Test dns provider fails to exec into pod

Bug #1915042 reported by Alexander Balderson
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Charmed Kubernetes Testing
Triaged
High
Unassigned

Bug Description

Running k8s ontop of vsphere, using calico as the CNI.

The test reports:
2021-02-06 19:10:36.256252 Traceback (most recent call last):
2021-02-06 19:10:36.256261 File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 1792, in test_dns_provider
2021-02-06 19:10:36.256270 await verify_dns_resolution(fresh=True)
2021-02-06 19:10:36.256279 File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 1693, in verify_dns_resolution
2021-02-06 19:10:36.256288 await kubectl(model, f"exec validate-dns -- host {name}")
2021-02-06 19:10:36.256297 File "/home/ubuntu/k8s-validation/jobs/integration/utils.py", line 539, in kubectl
2021-02-06 19:10:36.256307 return await juju_run(
2021-02-06 19:10:36.256315 File "/home/ubuntu/k8s-validation/jobs/integration/utils.py", line 533, in juju_run
2021-02-06 19:10:36.256324 raise JujuRunError(cmd, result)
2021-02-06 19:10:36.256333 integration.utils.JujuRunError: `/snap/bin/kubectl --kubeconfig /root/.kube/config exec validate-dns -- host www.ubuntu.com` failed:
2021-02-06 19:10:36.256342
2021-02-06 19:10:36.256351 error: cannot exec into a container in a completed pod; current phase is Succeeded

It also fails to clean up the k8s model during the post deployment:

2021-02-06 19:09:49.213744 FAILEDCleaning up k8s model
2021-02-06 19:09:49.303620 Disconnecting k8s model
2021-02-06 19:09:49.308088 Destroying k8s model
2021-02-06 19:10:35.972318 Removing k8s cloud

The last pass for this test, on our vsphere cluster was friday, 5th of feburary. after switching to rocks by setting the "caas-image-repo" in our model config.

The testrun can be found at: https://solutions.qa.canonical.com/testruns/testRun/681710cc-62e5-4304-836e-bc2c4b8b5aac

with logs at: https://oil-jenkins.canonical.com/artifacts/681710cc-62e5-4304-836e-bc2c4b8b5aac/index.html

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Sub'd to field high, this is blocking solutions QA release testing

Revision history for this message
George Kraft (cynerva) wrote :

> error: cannot exec into a container in a completed pod; current phase is Succeeded

Looks like the validate-dns pod exited early. Most likely it failed to apt update and apt install[1].

I need logs from the validate-dns pod. Unfortunately, those logs are not in the crashdump because test_dns_provider removed the pod in a finally clause[2].

I'll see if I can repro locally.

[1]: https://github.com/charmed-kubernetes/jenkins/blob/c7b59b7ae8c723d0f415f2afc6627018e50527fa/jobs/integration/templates/validate-dns-spec.yaml#L11
[2]: https://github.com/charmed-kubernetes/jenkins/blob/c7b59b7ae8c723d0f415f2afc6627018e50527fa/jobs/integration/validation.py#L1821-L1826

Revision history for this message
George Kraft (cynerva) wrote :

If you have other tests runs that have hit this, please link them here. I want to see if the failure is always on line 1792 or if it sometimes happens elsewhere in the test too.

Revision history for this message
George Kraft (cynerva) wrote :

I'm unable to reproduce this on our team's vsphere. Please provide more test runs for us to look at.

Changed in charmed-kubernetes-testing:
status: New → Incomplete
Revision history for this message
Chris Sanders (chris.sanders) wrote :

I'm removing field-high as it's remained incomplete for several days. If you return to this please provide more information if you need to re-subscribe.

Revision history for this message
Alexander Balderson (asbalderson) wrote :

we've also hit this on aws
you can view all the runs we've had with this bug at
https://solutions.qa.canonical.com/bugs/bugs/bug/1915042

George Kraft (cynerva)
Changed in charmed-kubernetes-testing:
status: Incomplete → New
Revision history for this message
Michael Skalka (mskalka) wrote :

saw another instance of this bug here: https://solutions.qa.canonical.com/testruns/testRun/a3fc7ae7-5ffb-4df7-a97f-7d14ea9dadaf while running test_dns_provider:

=================================== FAILURES ===================================
______________________________ test_dns_provider _______________________________
Traceback (most recent call last):
  File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 1737, in test_dns_provider
    await verify_dns_resolution(fresh=True)
  File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 1693, in verify_dns_resolution
    await kubectl(model, f"exec validate-dns -- host {name}")
  File "/home/ubuntu/k8s-validation/jobs/integration/utils.py", line 539, in kubectl
    return await juju_run(
  File "/home/ubuntu/k8s-validation/jobs/integration/utils.py", line 533, in juju_run
    raise JujuRunError(cmd, result)
integration.utils.JujuRunError: `/snap/bin/kubectl --kubeconfig /root/.kube/config exec validate-dns -- host www.ubuntu.com` failed:

error: cannot exec into a container in a completed pod; current phase is Succeeded
------------------------------ Captured log setup ------------------------------
WARNING juju.client.connection:connection.py:706 unknown facade CAASModelOperator
WARNING juju.client.connection:connection.py:730 unexpected facade CAASModelOperator found, unable to decipher version to use
WARNING juju.model:model.py:905 unknown delta type: id
WARNING juju.client.connection:connection.py:706 unknown facade CAASModelOperator
WARNING juju.client.connection:connection.py:730 unexpected facade CAASModelOperator found, unable to decipher version to use
WARNING juju.model:model.py:905 unknown delta type: id
- generated xml file: /home/ubuntu/project/generated/kubernetes/k8s-suite/test_dns_provider-junit.xml -
----- generated html file: file:///home/ubuntu/k8s-validation/report.html ------
=========================== short test summary info ============================
FAILED jobs/integration/validation.py::test_dns_provider - integration.utils....
======================== 1 failed in 388.48s (0:06:28) =========================
ERROR: InvocationError for command /home/ubuntu/k8s-validation/.tox/py3/bin/pytest -v -s --junit-xml=/home/ubuntu/project/generated/kubernetes/k8s-suite/test_dns_provider-junit.xml --controller=foundations-maas --model=kubernetes /home/ubuntu/k8s-validation/jobs/integration/validation.py::test_dns_provider (exited with code 1)
___________________________________ summary ____________________________________
ERROR: py3: commands failed

Revision history for this message
George Kraft (cynerva) wrote :

Thanks. It looks like the test is mostly failing in the same spot, after the CoreDNS charm is deployed.

I'm not able to confirm exactly what's happening, but the most likely culprit is that the test does not wait for the CoreDNS charm's pods to come up before trying to verify DNS. (Waiting for "active" status on the charm is not enough.)

I recommend the following fixes:
1. Replace validate-dns pod's image[1] with something that has host/nslookup/dig baked in (busybox maybe?) so we don't have to run `apt install` as a start command. That command creates a lot of race conditions - what if we kubectl exec into the pod before it's done?
2. Change restartPolicy[2] to Always.
3. Add retries to verify_dns_resolution and verify_no_dns_resolution[3].
4. After deploying CoreDNS, before verifying DNS[4], call wait_for_pods_ready.

[1]: https://github.com/charmed-kubernetes/jenkins/blob/9f180e2be0d209a6b82be93bda8f9623cd133bf8/jobs/integration/templates/validate-dns-spec.yaml#L9
[2]: https://github.com/charmed-kubernetes/jenkins/blob/9f180e2be0d209a6b82be93bda8f9623cd133bf8/jobs/integration/templates/validate-dns-spec.yaml#L12
[3]: https://github.com/charmed-kubernetes/jenkins/blob/9f180e2be0d209a6b82be93bda8f9623cd133bf8/jobs/integration/validation.py#L1709-L1724
[4]: https://github.com/charmed-kubernetes/jenkins/blob/9f180e2be0d209a6b82be93bda8f9623cd133bf8/jobs/integration/validation.py#L1811-L1812

Changed in charmed-kubernetes-testing:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Michael Skalka (mskalka) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.