Test sans fails on baremetal deploys

Bug #1896542 reported by Alexander Balderson
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Charmed Kubernetes Testing
Triaged
High
Unassigned
vault-charm
Triaged
High
Unassigned

Bug Description

On our bare-metal k8s deployments, test_sans is failing with the error:
Traceback (most recent call last): File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 1057, in test_sans timeout_msg="extra sans config did not propagate to server certs", File "/home/ubuntu/k8s-validation/jobs/integration/utils.py", line 202, in retry_async_with_timeout raise asyncio.TimeoutError(timeout_msg) concurrent.futures._base.TimeoutError: extra sans config did not propagate to server certs

a bit more from the logs on the error:

2020-09-18 20:09:42,915 - subp.py[DEBUG]: Running command ['/bin/sh', '/run/cloud-init/tmp/tmpuolmztio.sh'] with allowed return codes [0] (shell=False, capture=False)
2020-09-18 20:09:42,974 - util.py[WARNING]: Failed to run bootcmd module bootcmd
2020-09-18 20:09:42,974 - util.py[DEBUG]: Failed to run bootcmd module bootcmd
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_bootcmd.py", line 103, in handle
    subp.subp(cmd, env=env, capture=False)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 291, in subp
    raise ProcessExecutionError(stdout=out, stderr=err,
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['/bin/sh', '/run/cloud-init/tmp/tmpuolmztio.sh']
Exit code: 1
Reason: -
Stdout: -
Stderr: -
2020-09-18 20:09:42,979 - handlers.py[DEBUG]: finish: init-network/config-bootcmd: FAIL: running config-bootcmd with frequency always
2020-09-18 20:09:42,980 - url_helper.py[DEBUG]: [0/1] open 'http://10-244-40-0--21.maas-internal:5248/MAAS/metadata/status/hemhrt' with {'url': 'http://10-244-40-0--21.maas-internal:5248/MAAS/metadata/status/hemhrt', 'allow_redirects': True, 'method': 'POST', 'headers': {'Authorization': 'OAuth oauth_nonce="136753292208031713661600459782", oauth_timestamp="1600459782", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="wMkUJcu6SvBxfFUSDH", oauth_token="YbFzkKjZ23GS3F6RsD", oauth_signature="%26fugqELQPU9FexMeU3ffLZMaw8jd5kUfh"'}} configuration
2020-09-18 20:09:42,988 - url_helper.py[DEBUG]: Read from http://10-244-40-0--21.maas-internal:5248/MAAS/metadata/status/hemhrt (204, 0b) after 1 attempts
2020-09-18 20:09:42,989 - util.py[WARNING]: Running module bootcmd (<module 'cloudinit.config.cc_bootcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_bootcmd.py'>) failed
2020-09-18 20:09:42,989 - util.py[DEBUG]: Running module bootcmd (<module 'cloudinit.config.cc_bootcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_bootcmd.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 848, in _run_modules
    ran, _r = cc.run(run_name, mod.handle, func_args,
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_bootcmd.py", line 103, in handle
    subp.subp(cmd, env=env, capture=False)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 291, in subp
    raise ProcessExecutionError(stdout=out, stderr=err,
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['/bin/sh', '/run/cloud-init/tmp/tmpuolmztio.sh']
Exit code: 1
Reason: -
Stdout: -
Stderr: -

A link to testruns failing with this error will be included once they have been tagged.

Revision history for this message
Alexander Balderson (asbalderson) wrote :

testruns which displayed this bug can be found at
https://solutions.qa.canonical.com/bugs/bugs/bug/1896542

Revision history for this message
George Kraft (cynerva) wrote :

Looks like test_sans is getting confused because kubeapi-load-balancer is present in this deployment, but not related to anything. Removing kubeapi-load-balancer from your bundle/overlay should fix this.

Changed in charmed-kubernetes-testing:
status: New → Invalid
Revision history for this message
Konstantinos Kaskavelis (kaskavel) wrote :

We saw this again in https://solutions.qa.canonical.com/testruns/testRun/96f7e6de-6de6-40f9-884d-65bbef1f749f and https://solutions.qa.canonical.com/v2/testruns/24b447e7-a677-4820-85d3-95d9b0a4d88e. Now we do relate kubeapi-load-balancer with:

- ['filebeat:beats-host', 'kubeapi-load-balancer:juju-info']
- ['hacluster-kubeapi-load-balancer:ha', 'kubeapi-load-balancer:ha']
- ['hacluster-kubeapi-load-balancer:nrpe-external-master', 'nrpe:nrpe-external-master']
- ['kubeapi-load-balancer:apiserver', 'kubernetes-control-plane:kube-api-endpoint']
- ['kubeapi-load-balancer:certificates', 'vault:certificates']
- ['kubeapi-load-balancer:juju-info', 'nrpe:general-info']
- ['kubeapi-load-balancer:juju-info', 'telegraf:juju-info']
- ['kubeapi-load-balancer:loadbalancer', 'kubernetes-control-plane:loadbalancer']
- ['kubeapi-load-balancer:nrpe-external-master', 'nrpe:nrpe-external-master']
- ['kubeapi-load-balancer:website', 'kubernetes-worker:kube-api-endpoint']

Can we investigate this?

George Kraft (cynerva)
Changed in charmed-kubernetes-testing:
status: Invalid → New
Revision history for this message
Jeffrey Chang (modern911) wrote :
Revision history for this message
George Kraft (cynerva) wrote :

Looking at test run 07adb558-f37c-4474-9f97-267f5b243827...

Output from `juju show-unit vault/1` indicates that kubeapi-load-balancer/0 has updated the requested sans on the relation (with "santest.example.com" being the new one):

kubeapi-load-balancer/0:
  in-scope: true
  data:
    certificate_name: 4b5072b4-9720-4878-ab36-5292e4df2904
    common_name: 10.246.166.212
    egress-subnets: 10.246.169.65/32
    ingress-address: 10.246.169.65
    private-address: 10.246.169.65
    sans: '["10.246.166.212", "10.246.169.65", "juju-04a9d1-0-lxd-1", "juju-04a9d1-0-lxd-1.nosilo.lab1.solutionsqa", "santest.example.com"]'

Vault decides to reuse a cached certificate, instead of issuing a new one:

2023-04-04 04:49:56 INFO unit.vault/1.juju-log server.go:316 certificates:57: Processing certificate request from kubeapi-load-balancer_0 for 10.246.166.212
2023-04-04 04:49:56 INFO unit.vault/1.juju-log server.go:316 certificates:57: Reusing certificate for unit 'kubeapi-load-balancer_0' and CN '10.246.166.212' from cache.

Since kubeapi-load-balancer and kubernetes-control-plane never receive new certificates from vault, they continue to serve with the old one, which does not have santest.example.com in its SAN list:

X509v3 Subject Alternative Name:
    DNS:10.246.166.212, DNS:juju-04a9d1-0-lxd-1, DNS:juju-04a9d1-0-lxd-1.nosilo.lab1.solutionsqa, IP Address:10.246.166.212, IP Address:10.246.169.65

This caching behavior in vault appears to be new, as the change that introduced it was merged just 2 months ago[1]. I would call this a bug in the vault charm's handling of updated SANs on the relation.

[1]: https://opendev.org/openstack/charm-vault/commit/15ab73ea723d158f9105f0d603515e5d3ddaa78c

Revision history for this message
George Kraft (cynerva) wrote :

All four test runs from comment #4 appear to follow the same pattern as described in comment #5. The kubernetes-control-plane and kubeapi-load-balancer units update requested SANs on the relation, but vault reuses cached certificates and never issues new certificates with the updated SANs.

Revision history for this message
George Kraft (cynerva) wrote :

Regarding the earlier failures from comment #3 which predate vault's certificate caching, it looks like test_sans failed to get certificate information from kubeapi-load-balancer/0:

Traceback (most recent call last):
  File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 1117, in test_sans
    await retry_async_with_timeout(
  File "/home/ubuntu/k8s-validation/jobs/integration/utils.py", line 198, in retry_async_with_timeout
    if await func(*args):
  File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 1099, in all_certs_in_place
    certs = await get_server_certs()
  File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 1079, in get_server_certs
    action = await juju_run(
  File "/home/ubuntu/k8s-validation/jobs/integration/utils.py", line 578, in juju_run
    raise JujuRunError(unit, cmd, result)
integration.utils.JujuRunError: `openssl s_client -connect 127.0.0.1:443 </dev/null 2>/dev/null | openssl x509 -text` failed on kubeapi-load-balancer/0:

unable to load certificate
140264670332224:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE

It looks like this can happen when the 127.0.0.1:443 endpoint is down, i.e. when nginx or kube-apiserver are restarting. Note that this "Expecting: TRUSTED CERTIFICATE" output is specific to focal; on jammy, openssl outputs something different.

I think we need to update test_sans to handle nginx and kube-apiserver restarts gracefully. Maybe just catch this JujuRunError and retry.

Changed in charmed-kubernetes-testing:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Hello. I've investigated the issue caused the the charm-vault cache v3 patch and reproduced the issue posted in comment #4 above, where the test failed. I cleaned up the env, redeployed, proceeded to make code changes and was able to no longer reproduce the issue, the tests are now passing. So I submitted patch v4 [1].

Here are the instructions.

1) deploy bundle (attached below in next comment)

2) unlock vault and authorize

3) git clone https://github.com/charmed-kubernetes/jenkins.git /home/ubuntu/k8s-validation

4) cd /home/ubuntu/k8s-validation; git checkout 1.24

5) tox --workdir .tox -c /home/ubuntu/k8s-validation/tox.ini -e py3 -- pytest -v -s --junit-xml=/home/ubuntu/project/generated/kubernetes-maas/k8s-suite/test_status_messages-junit.xml --controller=ganso --model=k8s-sqa2 /home/ubuntu/k8s-validation/jobs/integration/validation.py::test_status_messages

6) tox --workdir .tox -c /home/ubuntu/k8s-validation/tox.ini -e py3 -- pytest -v -s --junit-xml=/home/ubuntu/project/generated/kubernetes-maas/k8s-suite/test_sans-junit.xml --controller=ganso --model=k8s-sqa2 /home/ubuntu/k8s-validation/jobs/integration/validation.py::test_sans

the --controller and --model parameters above in (5) and (6) will need to be changed accordingly to the env.

[1] https://review.opendev.org/c/openstack/charm-vault/+/883947

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Changed in vault-charm:
status: New → Triaged
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.