juju restore fails with "/var/lib/juju/agents: No such file or directory"

Bug #1431372 reported by Aaron Bentley
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Horacio Durán
1.24
Fix Released
High
Horacio Durán

Bug Description

Juju restore can die with "cd: /var/lib/juju/agents: No such file or directory"

Since juju restore is responsible for placing /var/lib/juju/agents on the machine, it should not blindly assume that it is installed.

log: http://data.vapour.ws/juju-ci/products/version-2440/functional-backup-restore/build-2334/consoleText

Full error:
ERROR failed to update machine 1: ssh command failed: ("Warning: Permanently added 'juju-functional-backup-restore-l1lbl7q7m0.cloudapp.net,104.45.213.33' (ECDSA) to the list of known hosts.\r\n+ cd /var/lib/juju/agents\nbash: line 2: cd: /var/lib/juju/agents: No such file or directory\n"): subprocess encountered error code 1
error: cannot update machines: machine update failed: ssh command failed: ("Warning: Permanently added 'juju-functional-backup-restore-l1lbl7q7m0.cloudapp.net,104.45.213.33' (ECDSA) to the list of known hosts.\r\n+ cd /var/lib/juju/agents\nbash: line 2: cd: /var/lib/juju/agents: No such file or directory\n"): subprocess encountered error code 1

Revision history for this message
Alex Kang (thkang0) wrote :

I have the same error when I restore the envirionments.

juju version I am using is 1.21.3

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Yeah, this looks a lot like lp:1434437.

Changed in juju-core:
assignee: nobody → Horacio Durán (hduran-8)
Revision history for this message
Horacio Durán (hduran-8) wrote :

This is different: For what I see the error in this case happens when updating the units. restore script ssh's into each unit and updates the agent.conf to point to the newly created API server.
Perhaps this should just be logged as an error and the script continue with the updating to get as much machines updates as possible since we cannot go back at this point.
As a side note, if /var/lib/juju/agents is not there, something is broken in the unit, perhaps it is still coming up?

Revision history for this message
Alex Kang (thkang0) wrote :

Yes, there is no /var/lib/juju/agents in machines side.
Therefore juju status shows all of machines are "down" status

Revision history for this message
Horacio Durán (hduran-8) wrote :

Can you please list me the contents of /var/lib/juju if any? I am a bit puzzled by the fact that there is no /var/lib/juju/agents (in the context of this error, this actually happened in one of the units)

Revision history for this message
Alex Kang (thkang0) wrote :

The juju-core version I tested before is deprecated.
Now it's already released in 1.22 and tested backup and restore again.

The result looks same as before.

Backup is fine and it has errors while restoring. The logs are as below

Installing package: cloud-image-utils
Fetching tools: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' --retry 10 -o $bin/tools.tar.gz <[https://streams.canonical.com/juju/tools/releases/juju-1.22.0-trusty-amd64.tgz]>
Bootstrapping Juju machine agent
Starting Juju machine agent (jujud-machine-0)
2015-04-07 02:02:41 INFO juju.cmd cmd.go:113 Bootstrap complete
connecting to newly bootstrapped instance
2015-04-07 02:02:41 DEBUG juju.environs utils.go:93 StateServerInstances returned: [/MAAS/api/1.0/nodes/node-2389fbd2-d28f-11e4-8436-525400b3849e/]
2015-04-07 02:02:41 INFO juju.api apiclient.go:252 dialing "wss://bootstrap03.maas:17070/"
2015-04-07 02:02:41 DEBUG juju.api apiclient.go:258 error dialing "wss://bootstrap03.maas:17070/", will retry: websocket.Dial wss://bootstrap03.maas:17070/: dial tcp 10.100.1.151:17070: connection refused
.......................................................................................... ==> same error
2015-04-07 02:14:04 INFO juju.api apiclient.go:260 error dialing "wss://10.100.1.151:17070/": websocket.Dial wss://10.100.1.151:17070/: dial tcp 10.100.1.151:17070: connection refused
opening state
error: cannot connect to api server: unable to connect to "wss://10.100.1.151:17070/"

And this is the log for juju state machine
015-04-07 02:29:11 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, bootstrap05.maas, not juju-mongodb

it's same error as above I updated before.

ubuntu@bootstrap03:~$ ls /var/lib/juju
agents db nonce.txt server.pem shared-secret system-identity tools

There is no directory for juju in machines
ubuntu@ceph01:~$ ls /var/lib/juju
ls: cannot access /var/lib/juju: No such file or directory

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24-alpha1 → 1.24.0
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24.0 → 1.25.0
Revision history for this message
Horacio Durán (hduran-8) wrote :

I have checked on an openstack deployed with juju 1.20.14 I checked all of the ceph nodes and allof them had a /var/lib/juju directory.
I used:
 juju run --service=ceph 'ls -l /var/lib/juju/agents/'
nevertheless I will add an exception for this, we should not stop restoring because one of the machines is not in shape since this is beyond the point of no return and we should try to do as much as we can given that there are many factors why the machine could be in a weird state and it is very possible that the integrity of the services is ok and that our backup has machines that where removed after.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

As indicated in lp:1434437, the following line from the log indicates a bug that has been fixed in 1.22.1:

015-04-07 02:29:11 DEBUG juju.mongo open.go:122 TLS handshake failed: x509: certificate is valid for localhost, juju-apiserver, bootstrap05.maas, not juju-mongodb

Revision history for this message
Horacio Durán (hduran-8) wrote :

@ericsnowcurrently indeed, I was talking about the initial bug of this report.

Revision history for this message
Horacio Durán (hduran-8) wrote :
Ian Booth (wallyworld)
Changed in juju-core:
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.