juju restore failed with "error: cannot update machines: machine update failed: ssh command failed: "

Bug #1434437 reported by Alex Kang on 2015-03-20
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Undecided
Unassigned
1.22
High
Eric Snow

Bug Description

Hi

I failed to restore with maas envirionment (juju version is 1.21.3)

It looks fine to bootstrap with backup file during the process of restore.

But it failed when it updated services machines and logs are as below

updating all machines
updating machine: 1

updating machine: 2

updating machine: 3

2015-03-20 06:36:45 DEBUG juju.utils.ssh ssh.go:244 using OpenSSH ssh client
2015-03-20 06:36:45 DEBUG juju.utils.ssh ssh.go:244 using OpenSSH ssh client
2015-03-20 06:36:45 DEBUG juju.utils.ssh ssh.go:244 using OpenSSH ssh client
2015-03-20 06:36:45 ERROR juju.plugins.restore restore.go:513 failed to update machine 1: ssh command failed: ("@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\r\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\r\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\r\nThe ECDSA host key for ceph01.maas has changed,\r\nand the key for the corresponding IP address 10.100.1.152\r\nhas a different value. This could either mean that\r\nDNS SPOOFING is happening or the IP address for the host\r\nand its host key have changed at the same time.\r\nOffending key for IP in /home/ubuntu/.ssh/known_hosts:11\r\n remove with: ssh-keygen -f \"/home/ubuntu/.ssh/known_hosts\" -R 10.100.1.152\r\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\r\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\r\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\r\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\r\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\r\nIt is also possible that a host key has just been changed.\r\nThe fingerprint for the ECDSA key sent by the remote host is\nca:3e:d2:b4:e4:83:68:4e:4b:b9:d1:b1:83:eb:20:d9.\r\nPlease contact your system administrator.\r\nAdd correct host key in /home/ubuntu/.ssh/known_hosts to get rid of this message.\r\nOffending ECDSA key in /home/ubuntu/.ssh/known_hosts:5\r\n remove with: ssh-keygen -f \"/home/ubuntu/.ssh/known_hosts\" -R ceph01.maas\r\nKeyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.\r\n+ cd /var/lib/juju/agents\nbash: line 2: cd: /var/lib/juju/agents: No such file or directory\n"): subprocess encountered error code 1

Before restore, juju status show like this

ubuntu@maas01:~$ juju status
environment: maas
machines:
  "0":
    agent-state: started
    agent-version: 1.21.3.1
    dns-name: bootstrap01.maas
    instance-id: /MAAS/api/1.0/nodes/node-9d6be350-cc79-11e4-9b0c-525400b3849e/
    series: trusty
    hardware: arch=amd64 cpu-cores=2 mem=2048M tags=bootstrap,virtual
    state-server-member-status: has-vote
  "1":
    agent-state: started
    agent-version: 1.21.3.1
    dns-name: ceph02.maas
    instance-id: /MAAS/api/1.0/nodes/node-7d34cec6-c7a1-11e4-a137-525400b3849e/
    series: trusty
    hardware: arch=amd64 cpu-cores=2 mem=2048M tags=ceph,virtual
  "2":
    agent-state: started
    agent-version: 1.21.3.1
    dns-name: ceph01.maas
    instance-id: /MAAS/api/1.0/nodes/node-7648fa38-c7a1-11e4-a137-525400b3849e/
    series: trusty
    hardware: arch=amd64 cpu-cores=2 mem=2048M tags=ceph,virtual
  "3":
    agent-state: started
    agent-version: 1.21.3.1
    dns-name: ceph03.maas
    instance-id: /MAAS/api/1.0/nodes/node-8649c0de-c7a1-11e4-a137-525400b3849e/
    series: trusty
    hardware: arch=amd64 cpu-cores=2 mem=2048M tags=ceph,virtual
services:
  ceph:
    charm: local:trusty/ceph-105
    exposed: false
    relations:
      mon:
      - ceph
    units:
      ceph/0:
        agent-state: started
        agent-version: 1.21.3.1
        machine: "1"
        public-address: ceph02.maas
      ceph/1:
        agent-state: started
        agent-version: 1.21.3.1
        machine: "2"
        public-address: ceph01.maas
      ceph/2:
        agent-state: started
        agent-version: 1.21.3.1
        machine: "3"
        public-address: ceph03.maas
networks:
  maas-eth0:
    provider-id: maas-eth0
    cidr: 10.100.1.0/24

and I attached a full log of restore
Thanks

Alex Kang (thkang0) wrote :
Alex Kang (thkang0) on 2015-03-20
description: updated
Curtis Hovey (sinzui) on 2015-03-20
tags: added: backup-restore maas-provider
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.23-beta1
milestone: 1.23-beta1 → 1.23-beta2
Changed in juju-core:
milestone: 1.23-beta2 → 1.24-alpha1
Changed in juju-core:
assignee: nobody → Eric Snow (ericsnowcurrently)
Eric Snow (ericsnowcurrently) wrote :

Could you see if this is still a problem if you run with juju 1.22? I'm checking from my side too.

Eric Snow (ericsnowcurrently) wrote :

The error implies that the /var/lib/juju/agents directory is missing on one of the non-state machines in the environment you are trying to restore. Could you check each of the 3 machines in that environment to make sure they have the directory.

Also, is there a chance that at the time of the backup there was a machine set up in juju that has since been removed from juju (and the agents directory deleted) but is still reachable via SSH?

Alex Kang (thkang0) wrote :

There is no /var/lib/juju/agents directory in the non-state machines.
When the restre process was going, the process deleted that directory.. I think.
And it is also reachable via ssh even though juju restore failed. I can connect to the non-state machines with ip address that machines has.

By the way how can I get the juju 1.22?
I am using the ppa repo as ppa:juju/stable

Alex Kang (thkang0) wrote :

I upgraded juju from 1.21.3 to 1.22 and tried to backup and restore but it failed again.

I attached logs for what I did and state machine log.

The state machine could not run state api server.

Changed in juju-core:
status: Triaged → In Progress
Eric Snow (ericsnowcurrently) wrote :

From the logs it looks like you're running into a different failure mode under 1.22 and it probably isn't restore-related. The key entry is:

  2015-03-25 02:30:01 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, bootstrap04.maas, not juju-mongodb

We ran into this in juju CI when we upgraded to 1.22. See lp:1434680, which has since been fixed. Could you verify?

no longer affects: juju-core/1.23
Changed in juju-core:
milestone: 1.24-alpha1 → none
assignee: Eric Snow (ericsnowcurrently) → nobody
importance: High → Undecided
status: In Progress → Invalid
Alex Kang (thkang0) wrote :

I upgraded 1.22 version two days ago, which is released from lp:1434680 you mentioned.

This is a normal deployment process.

1. I destoryed whole envirionment with "juju destroy-envirionment maas"
2. I bootstrapped a machine : juju bootstrap --constraints tags=bootstrap --upload-tools
3. I deployed ceph envirionment : juju-deployer --config bundle.yaml ceph
4. I back it up : juju backups create
5. I assume that a state machine has a failure situation : Delete the machine from maas
6. I restore the envirionemnt with the backup file : juju-restore --constraints tags=bootstrap backupfile.tar.gz

So in this process I got the error as above.
bootstrap04.maas is an old state machine which I deleted from maas for this test.
But juju restore is still using old hostname? which means do I have to set all server configuration same as old one?

Eric Snow (ericsnowcurrently) wrote :

Did you pull it build juju from source or update the package via apt-get? It's possible that the package wasn't quite up to date. I ask because that message I found in the logs definitely indicates that juju failed due to lp:1434680 (or that the bug isn't actually fixed). I'm not well enough versed in the mechanisms behind the distro package release to give you a more confident expectation (you'd have to ask sinzui). However, I'm still fairly confident that the bug is fixed, which would mean the juju against which you ran wasn't quite up to date yet.

Alex Kang (thkang0) wrote :

I updated this version 1.22 from ppa and it shows that I am using version 1.22

ubuntu@maas01:~$ dpkg -l | grep juju
ii juju-core 1.22.0-0ubuntu1~14.04.2~juju1 amd64 Juju is devops distilled - client
ii juju-deployer 0.4.3-0ubuntu1~ubuntu14.04.1~ppa1 all Deploy complex stacks of services using Juju
ii python-jujuclient 0.50.1-2 amd64 Python API client for juju-core

And I got same error in another envirionment which is same as above error

I deployed an envrionment with juju 1.22 and restarted a state machine but it can't load juju api service

2015-03-26 02:28:28 INFO juju.worker runner.go:261 start "api"
2015-03-26 02:28:28 INFO juju.api apiclient.go:252 dialing "wss://localhost:17070/"
2015-03-26 02:28:28 INFO juju.api apiclient.go:260 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused
2015-03-26 02:28:28 ERROR juju.worker runner.go:219 exited "api": unable to connect to "wss://localhost:17070/"
2015-03-26 02:28:28 INFO juju.worker runner.go:253 restarting "api" in 3s
2015-03-26 02:28:28 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, bootstrap05.maas, not juju-mongodb
2015-03-26 02:28:29 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, bootstrap05.maas, not juju-mongodb
2015-03-26 02:28:30 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, bootstrap05.maas, not juju-mongodb
2015-03-26 02:28:30 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, bootstrap05.maas, not juju-mongodb
2015-03-26 02:28:30 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, bootstrap05.maas, not juju-mongodb
2015-03-26 02:28:31 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, bootstrap05.maas, not juju-mongodb
2015-03-26 02:28:31 INFO juju.worker runner.go:261 start "api"
2015-03-26 02:28:31 INFO juju.api apiclient.go:252 dialing "wss://localhost:17070/"
2015-03-26 02:28:31 INFO juju.api apiclient.go:260 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused

Is there anyway to verify that juju is latest version?

Eric Snow (ericsnowcurrently) wrote :

Thanks for you patience on this. I've verified with Curtis (sinzui) that the fix for lp:1434680 will be in 1.20.1, which will be released in the next few days. Sorry for the confusion.

Alex Kang (thkang0) wrote :

Thanks Eric.
I will test it again when you release 1.20.1 and let you know if I have a problem.

Eric Snow (ericsnowcurrently) wrote :

Sorry, Alex. I meant 1.22.1, not 1.20.1.

Andrew Love (andrew-love) wrote :

I have the exact same issue in an environment I cannot destroy.

I am now at juju version 1.22.1-utopic on a management server. Should this version have the fix included?

How do I fix an existing state server (the bootstrap server) in a non-destructive fashion? At the moment I cannot get the state server service to listen on port 17070 due to the error loop:

2015-04-14 13:47:10 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, cloud-node-03.maas, not juju-mongodb
2015-04-14 13:47:11 INFO juju.worker runner.go:261 start "api"
2015-04-14 13:47:11 INFO juju.api apiclient.go:252 dialing "wss://localhost:17070/"
2015-04-14 13:47:11 INFO juju.api apiclient.go:260 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused
2015-04-14 13:47:11 ERROR juju.worker runner.go:219 exited "api": unable to connect to "wss://localhost:17070/"
2015-04-14 13:47:11 INFO juju.worker runner.go:253 restarting "api" in 3s

Andrew Love (andrew-love) wrote :

As an addition to the above, the state server contains the following in /var/lib/juju/tools/machine-0/downloaded-tools.txt :

{"version":"1.22.0-trusty-amd64","url":"https://streams.canonical.com/juju/tools/releases/juju-1.22.0-trusty-amd64.tgz","sha256":"ea1d9d1af149823a931b16091174fccbcd770f77c77888e50586477c8b0c7892","size":9524428}

If my juju client (management server) is at 1.22.1 and my state server is at 1.22.0, how does one push juju upgrades to the state server?
(This may be the same answer as the question I asked above.)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers