x509 Certificate Validation For LXD Clouds and Credentials

Bug #2003135 reported by Thomas Miller
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Thomas Miller

Bug Description

I have been working with Erik over on discourse about an error he is seeing with some of he's LXD clouds and Juju controller. See https://discourse.charmhub.io/t/cant-deploy-controller-seems-no-to-like-certificates/8073/7 for ongoing conversation.

I've met with Eric to look over the effected controller and clouds. From my initial look I can see that both the server certificate and client certificate provided to Juju are correct and match what is on the LXD host and the clients machine.

The error message:
2023-01-13 12:55:21 INFO juju.worker.provisioner provisioner_task.go:1348 trying machine 0 StartInstance in availability zone dwellir1
2023-01-13 12:55:21 WARNING juju.worker.provisioner provisioner_task.go:1363 machine 0 failed to start in availability zone dwellir1: Post "https://192.168.111.2:8443/1.0/instances?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2
2023-01-13 12:55:21 WARNING juju.worker.provisioner provisioner_task.go:1405 failed to start machine 0 (Post "https://192.168.111.2:8443/1.0/instances?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2), retrying in 10s (10 more attempts)
2023-01-13 12:55:31 INFO juju.worker.provisioner provisioner_task.go:1348 trying machine 0 StartInstance in availability zone dwellir1
2023-01-13 12:55:31 WARNING juju.worker.provisioner provisioner_task.go:1363 machine 0 failed to start in availability zone dwellir1: Get "https://192.168.111.2:8443/1.0/images/aliases/juju%2Fbionic%2Famd64?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2
2023-01-13 12:55:31 WARNING juju.worker.provisioner provisioner_task.go:1405 failed to start machine 0 (Get "https://192.168.111.2:8443/1.0/images/aliases/juju%2Fbionic%2Famd64?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2), retrying in 10s (9 more attempts)

Comes from the Go standard library where it is trying to validate the TLS connection with LXD host.

LXD signs all server certs with just loop back addresses and from my reading of the lxd client code this check should usually be skipped. The current task is to establish under what circumstances are the ip addresses on the certificates not skipped and figure out how this Juju deployment is getting into this state.

As of now we have no reproducible case for this bug and it appears to only be affecting 1 Juju controller.

Current Juju controller version is 2.9.37.

Thomas Miller (tlmiller)
Changed in juju:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Thomas Miller (tlmiller)
Revision history for this message
Erik Lönroth (erik-lonroth) wrote :

More input.

I today upgraded the controller from 2.9.37 -> 2.9.38

    juju upgrade-controller

That worked.

From there, I normally upgrade all models with this command:

    for m in $(juju models --all --format json | jq -r '.models[]["model-uuid"]'); do echo $m; juju upgrade-model -m $m; done

It basically loops over all models in the controller and tries to upgrade them.

This works - partially and the output looks like this the first round.

e444c390-4053-4838-868a-8436dd861b20
best version:
    2.9.38
started upgrade to 2.9.38
e444c390-4053-4838-868a-8436dd861b20
best version:
    2.9.38
started upgrade to 2.9.38
2c7a52c8-76e3-4b49-8f0d-d4e7f75ddc9e
no upgrades available
997bd7de-5062-4a92-8ee9-627b16c3c3d4
best version:
    2.9.38
started upgrade to 2.9.38
ERROR cannot find tool version from simple streams: creating environ for model "controller" (2eb4342a-966c-446d-8fec-3e06bd45c61b): Get "https://192.168.211.2:8443/1.0/profiles?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.211.2
ERROR some agents have not upgraded to the current model version 2.9.37: machine-0, machine-3, unit-besu-0, unit-prysm-beacon-1
cdccba01-df55-493a-8f80-23e376840d4c
best version:
    2.9.38
started upgrade to 2.9.38
a68e2aae-e590-494e-8f0f-c193ba07101a
best version:
    2.9.38
started upgrade to 2.9.38

... and so on, mixed OK, with ERRORS.

So, I continue to run this command, over and over, until most models are upgraded.

The "certificate errors" goes away in the upgrade (after multiple runs) but eventually, all models are upgraded.

===== Unrelated =====

At this point, only one model remains which gives an error like this:

    juju upgrade-model -m d84f172a-9f81-4cd5-8759-2cc786cdec41
    ERROR some agents have not upgraded to the current model version 2.9.37: machine-0, machine-3, unit-besu-0, unit-prysm-beacon-1

So, I introspect the model and see that some agents has lost communications with the controller (See the attached Screenshot):

This is all fine, since the machines in that model are turned off, but I think the "ERROR" should be reduced to WARNING

Revision history for this message
Erik Lönroth (erik-lonroth) wrote :

I have just upgraded from 2.9.37 -> 2.9.38

The problem still remains.

Revision history for this message
Thomas Miller (tlmiller) wrote :
Changed in juju:
milestone: none → 2.9.43
Harry Pidcock (hpidcock)
Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.