Machines fail in 'down' state during bundle deployment

Bug #1626484 reported by Gary Mackenzie on 2016-09-22
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
High
Unassigned
2.0
Undecided
Unassigned

Bug Description

When deploying a bundle (multiple bundles tried) using Juju 2 RC1 to AWS (multiple regions tried), the requested machines intermittently fail to provision and fails into the 'down' state as below:

MACHINE STATE DNS INS-ID SERIES AZ
0 started 54.170.48.100 i-9f398b12 trusty eu-west-1b
1 started 54.216.208.71 i-280dfb19 trusty eu-west-1a
2 down pending trusty
3 down pending trusty
4 down pending trusty
5 down pending trusty
6 down pending trusty

I have tried:

- Destroying and recreating controller in a different AWS region
- Multiple bundles (apache-processing-mapreduce and bigtop-processing-mapreduce specifically tested).

Have not tried:

- Different providers

Full juju status:

garym@latitude:/$ juju status
MODEL CONTROLLER CLOUD/REGION VERSION
mapreduce osd-nordics aws/eu-west-1 2.0-rc1

APP VERSION STATUS SCALE CHARM STORE REV OS NOTES
client waiting 1 hadoop-client jujucharms 3 ubuntu
ganglia unknown 1 ganglia jujucharms 2 ubuntu
ganglia-node waiting 0 ganglia-node jujucharms 2 ubuntu
namenode waiting 0/1 apache-bigtop-namenode jujucharms 13 ubuntu
plugin blocked 1 apache-bigtop-plugin jujucharms 9 ubuntu
resourcemanager waiting 0/1 apache-bigtop-resourcemanager jujucharms 12 ubuntu
slave waiting 0/3 apache-bigtop-slave jujucharms 11 ubuntu

UNIT WORKLOAD AGENT MACHINE PUBLIC-ADDRESS PORTS MESSAGE
client/0 waiting idle 0 54.170.48.100 Waiting for Plugin to become ready
  plugin/0 blocked idle 54.170.48.100 missing required namenode relation
ganglia/0 unknown idle 1 54.216.208.71 80/tcp
namenode/0 waiting allocating 2 waiting for machine
resourcemanager/0 waiting allocating 3 waiting for machine
slave/0 waiting allocating 4 waiting for machine
slave/1 waiting allocating 5 waiting for machine
slave/2 waiting allocating 6 waiting for machine

MACHINE STATE DNS INS-ID SERIES AZ
0 started 54.170.48.100 i-9f398b12 trusty eu-west-1b
1 started 54.216.208.71 i-280dfb19 trusty eu-west-1a
2 down pending trusty
3 down pending trusty
4 down pending trusty
5 down pending trusty
6 down pending trusty

RELATION PROVIDES CONSUMES TYPE
hadoop-plugin client plugin subordinate
node ganglia ganglia-node regular
juju-info ganglia-node namenode regular
juju-info ganglia-node resourcemanager regular
juju-info ganglia-node slave regular
juju-info namenode ganglia-node subordinate
namenode namenode plugin regular
namenode namenode resourcemanager regular
namenode namenode slave regular
resourcemanager plugin resourcemanager regular
juju-info resourcemanager ganglia-node subordinate
resourcemanager resourcemanager slave regular
juju-info slave ganglia-node subordinate

No related logs in juju debug-log during deployment.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0.0
Changed in juju:
assignee: nobody → Richard Harding (rharding)
Richard Harding (rharding) wrote :

can you please provide the output of juju status --format=yaml and the juju logs from the controller.

Gary Mackenzie (5ello) wrote :
Download full text (7.9 KiB)

As requested:

garym@latitude:~$ juju status --format=yaml
model:
  name: mapreduce
  controller: osd-nordics
  cloud: aws
  region: eu-west-1
  version: 2.0-rc1
machines:
  "0":
    juju-status:
      current: started
      since: 22 Sep 2016 11:13:17+01:00
      version: 2.0-rc1
    dns-name: 54.170.48.100
    instance-id: i-9f398b12
    machine-status:
      current: running
      message: running
      since: 22 Sep 2016 11:11:34+01:00
    series: trusty
    hardware: arch=amd64 cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-zone=eu-west-1b
  "1":
    juju-status:
      current: started
      since: 22 Sep 2016 11:13:18+01:00
      version: 2.0-rc1
    dns-name: 54.216.208.71
    instance-id: i-280dfb19
    machine-status:
      current: running
      message: running
      since: 22 Sep 2016 11:11:30+01:00
    series: trusty
    hardware: arch=amd64 cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-zone=eu-west-1a
  "2":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 11:31:30+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 11:11:06+01:00
    series: trusty
  "3":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 11:32:29+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 11:11:06+01:00
    series: trusty
  "4":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 11:13:03+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 11:11:06+01:00
    series: trusty
  "5":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 11:13:39+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 11:11:07+01:00
    series: trusty
  "6":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 11:14:15+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 11:11:07+01:00
    series: trusty
applications:
  client:
    charm: cs:trusty/hadoop-client-3
    series: trusty
    os: ubuntu
    charm-origin: jujucharms
    charm-name: hadoop-client
    charm-rev: 3
    can-upgrade-to: cs:trusty/hadoop-client-5
    exposed: false
    application-status:
      current: waiting
      message: Waiting for Plugin to become ready
      since: 29 Sep 2016 06:11:22+01:00
    relations:
      hadoop:
      - plugin
    units:
      client/0:
        workload-status:
          current: waiting
          message: Waiting for Plugin to become ready
          since: 29 Sep 2016 06:11:22+01:00
        juju-status:
          current: idle
          since: 29 Sep 2016 06:11:22+01:00
          version: 2.0-rc1
        machine: "0"
        public-address: 54.170.48.100
        subordinates:
          plugin/0:
            workload-status:
              current: blocked
              message: mis...

Read more...

Gary Mackenzie (5ello) wrote :
Download full text (7.2 KiB)

I have also tested and reproduced on Azure, logs below:

garym@latitude:~$ juju status --format=yaml
model:
  name: bigtop
  controller: osd-nordics-azure
  cloud: azure
  region: northeurope
  version: 2.0-rc1
machines:
  "0":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 16:14:37+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 16:09:41+01:00
    series: trusty
  "1":
    juju-status:
      current: started
      since: 22 Sep 2016 16:24:54+01:00
      version: 2.0-rc1
    dns-name: 52.169.11.116
    instance-id: machine-1
    machine-status:
      current: running
      since: 22 Sep 2016 16:24:49+01:00
    series: trusty
    hardware: arch=amd64 cores=1 mem=1792M root-disk=30720M
  "2":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 16:11:26+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 16:09:46+01:00
    series: trusty
  "3":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 16:12:03+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 16:09:49+01:00
    series: trusty
  "4":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 16:12:45+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 16:09:51+01:00
    series: trusty
  "5":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 16:13:22+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 16:09:53+01:00
    series: trusty
  "6":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 22 Sep 2016 16:13:59+01:00
    instance-id: pending
    machine-status:
      current: pending
      since: 22 Sep 2016 16:09:56+01:00
    series: trusty
applications:
  client:
    charm: cs:trusty/hadoop-client-3
    series: trusty
    os: ubuntu
    charm-origin: jujucharms
    charm-name: hadoop-client
    charm-rev: 3
    can-upgrade-to: cs:trusty/hadoop-client-5
    exposed: false
    application-status:
      current: waiting
      message: waiting for machine
      since: 22 Sep 2016 16:09:40+01:00
    relations:
      hadoop:
      - plugin
    units:
      client/0:
        workload-status:
          current: waiting
          message: waiting for machine
          since: 22 Sep 2016 16:09:40+01:00
        juju-status:
          current: allocating
          since: 22 Sep 2016 16:09:40+01:00
        machine: "0"
  ganglia:
    charm: cs:trusty/ganglia-2
    series: trusty
    os: ubuntu
    charm-origin: jujucharms
    charm-name: ganglia
    charm-rev: 2
    exposed: false
    application-status:
      current: unknown
      since: 22 Sep 2016 16:27:04+01:00
    relations:
      node:
      - ganglia-node
    units:
      ganglia/0:
        workload-status:
     ...

Read more...

Richard Harding (rharding) wrote :

Hmm, so you have a bundle you were using? The error in the log that looks interesting is:

ERROR juju.provisioner provisioner_task.go:682 cannot start instance for machine "2": cannot run instances: The specified instance type can only be used in a VPC. A subnet ID or network interface ID is required to carry out the request. (VPCResourceNotSpecified)

I'm curious if this was the default instance or if you've supplied some constraints which triggered some failure to land on an instance that is usable.

Gary Mackenzie (5ello) wrote :

This is a straight 'juju deploy [apache-processing-mapreduce|bigtop-processing-mapreduce] of the bundle from the charm store, no constraints or anything unusual. The Azure account was a brand new one created specifically to test this so nothing odd setup at all. It was intended as a simple dmeo to a customer of a big data deployment...

It sounds like the bundle is looking for a certain spec and the nearest match is an AWS type which is only available in VPS - but that doesn't explain the failure in Azure...

Changed in juju:
milestone: 2.0.0 → 2.1.0
Neil Jerram (neil-jerram) wrote :

I've also been seeing this for a while with RC1 and RC2, with GCE as the cloud provider. It is still occurring with RC3.

Changed in juju:
assignee: Richard Harding (rharding) → nobody
Anastasia (anastasia-macmood) wrote :

Removing 2.1 milestone as we will not be addressing this issue in 2.1.
Marking as Won't Fix for 2.0 as we are not planning another 2.0.x release.

Changed in juju:
milestone: 2.1-rc2 → none
Tim McNamara (tim-clicks) wrote :

I believe this has been resolved in recent Juju releases. Please re-open or file another bug report if you encounter more issues with deploying bundles.

Changed in juju:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers