backup restore timeout restarting agent

Bug #1304116 reported by Curtis Hovey on 2014-04-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
High
Horacio Durán

Bug Description

Between r2535 and r2545 A behaviour changed that broke juju restore. Restore reports success. It does an update to the units in the services and exits. The test call juju status checking the status of the machine and unit agents. We see machine 1 and unit ubuntu/0 restart. Machine 0's agent does not return.

restored bootstrap machine
opening state
updating all machines
updating machine: 1
ssh command succedded: "jujud-machine-1 stop/waiting\njujud-machine-1 start/running, process 13871\njujud-unit-ubuntu-0 stop/waiting\njujud-unit-ubuntu-0 start/running, process 13876\n"
updated machine 1
down: 0, 1, ubuntu/0
...
down: 0
...
Timed out waiting for agents to start in test-function-hp-1.

Related branches

Curtis Hovey (sinzui) wrote :

This issue happen as the fix for bug 1291022 was being prepared. This new behaviour affects slow and fast networks.

Curtis Hovey (sinzui) wrote :

I captured a few calls to status as the test was run. This warning looks like the env was changed, not properly restored after the restore:
    WARNING unknown config field "proxy-ssh"

Runnung "Juju ssh 0" fails
    ERROR state/api: websocket.Dial wss://10.0.3.1:17072/: dial tcp 10.0.3.1:17072: connection refused

ssh -i ~/cloud-city/staging-juju-rsa ec2-54-209-166-213.compute-1.amazonaws.com
    ssh: connect to host ec2-54-209-166-213.compute-1.amazonaws.com port 22: Connection timed out

Using the aws console to get the public IP, I tried to ssh with the key tahand was immediately rejected, though that
    ssh -i ~/cloud-city/staging-juju-rsa 54.209.157.232
    Permission denied (publickey)

BEFORE BACKUP
$ juju status -e test-function-aws-1
environment: test-function-aws-1
machines:
  "0":
    agent-state: started
    agent-version: 1.19.0
    dns-name: ec2-54-209-166-213.compute-1.amazonaws.com
    instance-id: i-c334b0e0
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M
  "1":
    agent-state: started
    dns-name: ec2-54-209-96-248.compute-1.amazonaws.com
    instance-id: i-0331b520
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M
services:
  ubuntu:
    charm: local:precise/ubuntu-1
    exposed: false
    units:
      ubuntu/0:
        agent-state: started
        machine: "1"
        public-address: ec2-54-209-96-248.compute-1.amazonaws.com

AFTER RESTORE.
$ juju status -e test-function-aws-1
WARNING unknown config field "proxy-ssh"
environment: test-function-aws-1
machines:
  "0":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.19.0
    dns-name: ec2-54-209-166-213.compute-1.amazonaws.com
    instance-id: i-bb3fbb98
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M
  "1":
    agent-state: started
    agent-version: 1.19.0
    dns-name: ec2-54-209-96-248.compute-1.amazonaws.com
    instance-id: i-0331b520
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M
services:
  ubuntu:
    charm: local:precise/ubuntu-1
    exposed: false
    units:
      ubuntu/0:
        agent-state: started
        agent-version: 1.19.0
        machine: "1"
        public-address: ec2-54-209-96-248.compute-1.amazonaws.com

Curtis Hovey (sinzui) wrote :

Small oops in the previous message. I forgot to be user ubuntu when I tried the public IP from the console. I could get in and found these procs running.

ssh -i ~/cloud-city/id_rsa ubuntu@54.84.28.104
ubuntu@ip-172-31-32-176:~$ ps ax | grep juju
15499 ? Ssl 0:06 /usr/bin/mongod --auth --dbpath=/var/lib/juju/db --sslOnNormalPorts --sslPEMKeyFile /var/lib/juju/server.pem --sslPEMKeyPassword xxxxxxx --bind_ip 0.0.0.0 --port 37017 --noprealloc --syslog --smallfiles
15542 ? Rsl 0:22 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug

I captures some logs

PS ssh -i ~/cloud-city/id_rsa <email address hidden> times out

Curtis Hovey (sinzui) wrote :

The all-machines log from the restores state-server machine.

Curtis Hovey (sinzui) wrote :

Log of just the new state-server.

Curtis Hovey (sinzui) on 2014-04-11
Changed in juju-core:
importance: Critical → High
Curtis Hovey (sinzui) on 2014-04-11
Changed in juju-core:
milestone: 1.19.1 → 1.19.0
Curtis Hovey (sinzui) on 2014-04-14
Changed in juju-core:
milestone: 1.19.0 → 1.19.1
Curtis Hovey (sinzui) wrote :

Starting with lp:juju-core r2627, the failure has changed. The restore plugin does not restore after the new bootstrap node is brought up.

Fetching tools: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' -o $bin/tools.tar.gz 'http://juju-dist.s3.amazonaws.com/testing/tools/releases/juju-1.19.0-precise-amd64.tgz'
Bootstrapping Juju machine agent
Starting Juju machine agent (jujud-machine-0)
error: cannot open state: no reachable servers

Curtis Hovey (sinzui) wrote :

juju restored fails with this message
    error: cannot open state: no reachable servers
    ERROR exit status 1

This is what is happening on the new bootstrap machine.

2014-04-15 21:15:19 INFO juju.state.api apiclient.go:209 error dialing "wss://10.3.203.200:17070/": websocket.Dial wss://10.3.203.200:17070/: dial tcp 10.3.203.200:17070: no route to host
2014-04-15 21:15:19 ERROR juju runner.go:220 worker: exited "api": timed out connecting to "wss://10.3.203.200:17070/"
2014-04-15 21:15:19 INFO juju runner.go:254 worker: restarting "api" in 3s
2014-04-15 21:15:22 INFO juju runner.go:262 worker: start "api"
2014-04-15 21:15:22 INFO juju.state.api apiclient.go:201 dialing "wss://10.3.203.200:17070/"

Attached is the log from the new machine.

Curtis Hovey (sinzui) wrote :

All test results are available at
    http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/functional-backup-restore-devel/
Each test has a link to the console output.

The most recent test I see
    http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/functional-backup-restore-devel/641/console
shows that lp:juju-core r2657 backup succeeded, but restore failed for precise.

John A Meinel (jameinel) on 2014-04-23
Changed in juju-core:
assignee: nobody → Horacio Durán (hduran-8)
Curtis Hovey (sinzui) on 2014-04-25
Changed in juju-core:
milestone: 1.19.1 → 1.20.0
Changed in juju-core:
milestone: 1.20.0 → next-stable
Go Bot (go-bot) on 2014-05-13
Changed in juju-core:
status: Triaged → Fix Committed
Curtis Hovey (sinzui) on 2014-05-14
Changed in juju-core:
milestone: next-stable → 1.19.3
Curtis Hovey (sinzui) on 2014-05-30
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers