manual provisioned systems stuck in pending on arm64

Bug #1302205 reported by dann frazier on 2014-04-03
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Critical
Ian Booth
1.18
Critical
Ian Booth
juju-core (Ubuntu)
Undecided
Unassigned
Trusty
High
Unassigned

Bug Description

This is on arm64, juju-core version 1.17.7-0ubuntu1.

I setup some arm64 systems (VMs running the trusty cloud image) to test with juju's manual provider. I was able to bootstrap w/o problem, and able to deploy the juju-gui to the bootstrap node. I then add-machined a couple other VMs, and this reported success. I can login into these systems and see jujud running. However, these machines never leave the "agent-state: pending" state.

juju debug-log is emitting the following messages periodically:

ubuntu@ms01a:~$ juju debug-log
machine-0: 2014-04-03 20:26:32 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.7": no instances found
machine-0: 2014-04-03 20:27:12 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.8": no instances found
machine-0: 2014-04-03 20:43:36 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.7": no instances found
machine-0: 2014-04-03 20:44:16 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.8": no instances found
machine-0: 2014-04-03 21:00:40 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.7": no instances found
machine-0: 2014-04-03 21:01:20 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.8": no instances found
machine-0: 2014-04-03 21:17:44 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.7": no instances found
machine-0: 2014-04-03 21:18:24 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.8": no instances found
machine-0: 2014-04-03 21:34:48 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.7": no instances found
machine-0: 2014-04-03 21:35:28 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.8": no instances found

Related branches

dann frazier (dannf) wrote :
dann frazier (dannf) wrote :
dann frazier (dannf) wrote :
dann frazier (dannf) wrote :

ubuntu@ms01a:~$ juju status
environment: manual
machines:
  "0":
    agent-state: started
    agent-version: 1.17.7
    dns-name: 10.0.128.4
    instance-id: 'manual:'
    series: trusty
    hardware: arch=arm64 cpu-cores=1 mem=1895M
  "1":
    agent-state: pending
    agent-version: 1.17.7
    dns-name: 10.0.128.7
    instance-id: manual:10.0.128.7
    series: trusty
    hardware: arch=arm64 cpu-cores=1 mem=1895M
  "2":
    agent-state: pending
    agent-version: 1.17.7
    dns-name: 10.0.128.8
    instance-id: manual:10.0.128.8
    series: trusty
    hardware: arch=arm64 cpu-cores=1 mem=1895M
services:
  juju-gui:
    charm: local:trusty/juju-gui-108
    exposed: true
    units:
      juju-gui/0:
        agent-state: started
        agent-version: 1.17.7
        machine: "0"
        open-ports:
        - 80/tcp
        - 443/tcp
        public-address: 10.0.128.4
  mysql:
    charm: local:trusty/mysql-311
    exposed: false
    relations:
      cluster:
      - mysql
    units:
      mysql/0:
        agent-state: pending
        machine: "1"
        public-address: 10.0.128.7

Sean Feole (sfeole) wrote :

I've seen this before, #Bug 1280953 ?

Curtis Hovey (sinzui) on 2014-04-04
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: add-machine manual-provider
Curtis Hovey (sinzui) on 2014-04-04
Changed in juju-core:
importance: High → Critical
milestone: none → 1.18.0
Kapil Thangavelu (hazmat) wrote :

the debug-log output here is normal fwiw on manual provider. it looks odd though as the machines do connect back to state server, but are listed as pending in status.

dann frazier (dannf) wrote :

On Kapil's suggestion, I tried logging into a pending node and restarting the machine agent. This worked:

ubuntu@ms01a:~$ juju status
environment: manual
machines:
  "0":
    agent-state: started
    agent-version: 1.17.7
    dns-name: 10.0.128.4
    instance-id: 'manual:'
    series: trusty
    hardware: arch=arm64 cpu-cores=1 mem=1895M
  "1":
    agent-state: started
    agent-version: 1.17.7
    dns-name: 10.0.128.7
    instance-id: manual:10.0.128.7
    series: trusty
    hardware: arch=arm64 cpu-cores=1 mem=1895M
services: {}

Quesiton is - why does it always seem to fail the first time?

Kapil Thangavelu (hazmat) wrote :

so there's a panic on the state server here which seems to be interfering with the normal registration of the machines, bouncing the machine agents on the pending machines post the panic seems to have worked around the issue.

dann frazier (dannf) wrote :

Workaround seems to have just been temporary. I added a second node, then ran juju status, and both non-bootstrap nodes now appear to be down:

ubuntu@ms01a:~$ juju status
environment: manual
machines:
  "0":
    agent-state: started
    agent-version: 1.17.7
    dns-name: 10.0.128.4
    instance-id: 'manual:'
    series: trusty
    hardware: arch=arm64 cpu-cores=1 mem=1895M
  "1":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.17.7
    dns-name: 10.0.128.7
    instance-id: manual:10.0.128.7
    series: trusty
    hardware: arch=arm64 cpu-cores=1 mem=1895M
  "2":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.17.7
    dns-name: 10.0.128.8
    instance-id: manual:10.0.128.8
    series: trusty
    hardware: arch=arm64 cpu-cores=1 mem=1895M
services: {}

dann frazier (dannf) wrote :
dann frazier (dannf) wrote :
dann frazier (dannf) wrote :
dann frazier (dannf) wrote :

Sorry, last machine-0 log was out of date - here's the current one.

Curtis Hovey (sinzui) on 2014-04-05
Changed in juju-core:
milestone: 1.18.0 → 1.19.0
John A Meinel (jameinel) on 2014-04-07
summary: - manual provisioned systems stuck in pending
+ manual provisioned systems stuck in pending on arm64
Changed in juju-core:
milestone: 1.19.0 → 1.19.1
dann frazier (dannf) wrote :

I've been trying to debug this on my own, but I'm enough of a novice that I'm not really sure where to start. Here's a couple things I've noticed..

I do see that that machine-1's jujud makes a connection to the bootstrap node that remains open for a while, but that socket later closes and never appears to reconnect. This seems to be around the time the agent-state switches to down.

I see this error on the bootstrap node (whether or not I specify the port in my environments.yaml):
/var/log/juju/machine-0.log:2014-04-08 17:08:25 ERROR juju.worker environ.go:56 loaded invalid environment configuration: storage-port: expected int, got float64(8040)

I tried to reproduce this using EC2-provisioned VMs, but the problem was not reproducible there.

dann frazier (dannf) wrote :

Another observation - when I deploy a charm (tried juju-gui and mysql), I see errors like this in the unit log:

/bin/sh: 1: exec: /var/lib/juju/tools/unit-mysql-0/jujud: not found

Ian Booth (wallyworld) wrote :

These warnings:

machine-0: 2014-04-03 21:35:28 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "manual:10.0.128.8": no instances found

are log noise related to the instance poller waiting for the machine to be marked as provisioned. They appear to be symptoms rather than the root cause. There's another bug raised to deal with this: bug 1304132

Ian Booth (wallyworld) wrote :

I've found the cause for one of the errors shown in the logs:

2014-04-04 16:31:59 ERROR juju.worker environ.go:56 loaded invalid environment configuration: storage-port: expected int, got float64(8040)

There's a configuration issue that needs to be fixed for manual provider.
I'm not yet certain if this error prevents the various moving parts from initialising correctly in order to cause the observed symptoms, but the error needs to be fixed nonetheless.

On Tue, Apr 8, 2014 at 6:58 PM, Ian Booth <email address hidden> wrote:
> I've found the cause for one of the errors shown in the logs:
>
> 2014-04-04 16:31:59 ERROR juju.worker environ.go:56 loaded invalid
> environment configuration: storage-port: expected int, got float64(8040)
>
> There's a configuration issue that needs to be fixed for manual provider.
> I'm not yet certain if this error prevents the various moving parts from initialising correctly in order to cause the observed symptoms, but the error needs to be fixed nonetheless.

Thanks Ian.

If direct access to the systems in question would help, grab me on IRC
(dannf). We've got an openstack setup where its relatively easy/quick to launch
new instances.

Ian Booth (wallyworld) on 2014-04-09
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Ian Booth (wallyworld) on 2014-04-09
Changed in juju-core:
milestone: 1.19.1 → 1.19.0

I observe exactly the same issue with juju/custom-openstack.

Juju-status report: https://pastebin.canonical.com/108228/ (agent-state is pending for both mysql and wordpress).

ubuntu@juju-openstack-machine-2 $ cat /var/log/juju/unit-wordpress-0.log
/bin/sh: 1: exec: /var/lib/juju/tools/unit-wordpress-0/jujud: not found
<...>
/bin/sh: 1: exec: /var/lib/juju/tools/unit-wordpress-0/jujud: not found

Interesting observation is that I have jujud downloaded but to a different folder:

ubuntu@juju-openstack-machine-2 $ ls /var/lib/juju/tools/machine-2
downloaded-tools.txt jujud

John A Meinel (jameinel) wrote :

There is supposed to be a symlink of /var/lib/juju/tools => /var/lib/juju/$JUJU_VERSION_NUMBER which would provide the tools that a given agent is supposed to use.

The machine agent is responsible for creating the Unit agent job and for creating the symlink for what tools that agent should be using. It looks like there is a bug where some of the set up (getting a unit agent script configured in upstart) is completed, but the rest (creating the tools symlink) is not.

Ian, can you give an update where you feel this is at?

Changed in juju-core:
milestone: 1.19.0 → 1.19.1
Ian Booth (wallyworld) wrote :

Current status is that 3 bugs affecting manual provisioning on all providers have been fixed. I've been fighting with gcc-go on ARM trying to reproduce the issue after the bug fixes. Turns out there's an incompatibility between gcc-go and ARM and trusty which I've been told a workaround for. So I can continue to look into the issue. Hopefully I'll know more soon.

Curtis Hovey (sinzui) on 2014-04-11
Changed in juju-core:
milestone: 1.19.1 → 1.19.0
Ian Booth (wallyworld) wrote :

I've verified that this appears to be fixed now deploying using the manual provider on ARM.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2014-04-15
Changed in juju-core:
status: Fix Committed → Fix Released
Robie Basak (racb) wrote :

I think I hit this on 1.18.1-0ubuntu1 on Trusty. Adding an LXC container on the manual provider failed. It appears fixed in ppa:juju/stable in 1.18.2-0ubuntu1~14.04.1~juju1. Adding a Trusty task.

Changed in juju-core (Ubuntu Trusty):
importance: Undecided → High
Curtis Hovey (sinzui) on 2015-04-06
Changed in juju-core (Ubuntu):
status: New → Fix Released
Changed in juju-core (Ubuntu Trusty):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers