Comment 7 for bug 2028867

Revision history for this message
John A Meinel (jameinel) wrote :

While reproducing this, there is still one issue that is quite important. Namely, we don't report any errors to the user while this is happening.

Specifically, I went to reproduce this bug, tweaked my config, forgot about it when I came back 1 week later, and the issue is that we report *0* warnings or failures except at debug level.

So `juju bootstrap lxd lxd` under this situation just sits there and fails, and after 20 minutes comes back with 'failed to bootstrap', and kills your instances. There is also nothing in cloud-init-output.log (because the failure is client side.)

I think at a minimum we should be trying to surface this error:
```
09:12:41 DEBUG juju.provider.common bootstrap.go:669 connection attempt for 10.8.158.125 failed: /home/jameinel/.ssh/config: line 3: Bad configuration option: pubkeyacceptedalgorithms
/home/jameinel/.ssh/config: terminating, 1 bad configuration options
```

I know that the reason we don't surface SSH errors by default is because we *expect* that the controller won't be up immediately, and so we don't want to scare users by saying that we failed to connect.

But we need something that can take a "I think I can retry this error, but I have been retrying it for 1 minute, I should surface something".

Note that the error that we explicitly want to supress are these:
09:18:46 DEBUG juju.provider.common bootstrap.go:669 connection attempt for 10.8.158.251 failed: ssh: connect to host 10.8.158.251 port 22: Connection refused

09:18:52 DEBUG juju.provider.common bootstrap.go:669 connection attempt for 10.8.158.251 failed: /var/lib/juju/nonce.txt does not exist

Those are both cases where the machine hasn't finished initializing, and it is a race condition between the client trying to connect and the machine not being done with cloud-init.

But "terminating, 1 bad configuration options" is a permanent failure that needs human intervention.