juju machine agents suiciding

Bug #1345014 reported by Kapil Thangavelu
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Andrew Wilkins
1.20
Fix Released
High
Andrew Wilkins

Bug Description

I've got a system with several containers are created and come up, but then they suicide after contacting the state server (and removing their upstart jobs), with their status (and their units) forever stuck in pending.

ie. status excerpt

      0/lxc/6:
        agent-state: pending
        instance-id: juju-machine-0-lxc-6
        series: trusty
        hardware: arch=amd64
      0/lxc/7:
        agent-state: pending
        instance-id: juju-machine-0-lxc-7
        series: trusty
        hardware: arch=amd64

per cloudinit the machine comes up fine.. per the agent log it suicides

2014-07-19 13:33:27 INFO juju.cmd supercommand.go:37 running jujud [1.20.1-trusty-amd64 gc]
2014-07-19 13:33:27 INFO juju.cmd.jujud machine.go:156 machine agent machine-0-lxc-7 start (1.20.1-trusty-amd64 [gc])
2014-07-19 13:33:27 DEBUG juju.agent agent.go:377 read agent config, format "1.18"
2014-07-19 13:33:27 INFO juju.worker runner.go:260 start "api"
2014-07-19 13:33:27 INFO juju.worker runner.go:260 start "statestarter"
2014-07-19 13:33:27 INFO juju.worker runner.go:260 start "termination"
2014-07-19 13:33:27 INFO juju.state.api apiclient.go:242 dialing "wss://192.168.9.74:17070/"
2014-07-19 13:33:27 INFO juju.state.api apiclient.go:176 connection established to "wss://192.168.9.74:17070/"
2014-07-19 13:33:29 INFO juju.state.api apiclient.go:242 dialing "wss://192.168.9.74:17070/"
2014-07-19 13:33:29 INFO juju.state.api apiclient.go:176 connection established to "wss://192.168.9.74:17070/"
2014-07-19 13:33:31 ERROR juju.worker runner.go:207 fatal "api": agent should be terminated
2014-07-19 13:33:31 DEBUG juju.worker runner.go:241 killing "statestarter"
2014-07-19 13:33:31 DEBUG juju.worker runner.go:241 killing "termination"
2014-07-19 13:33:31 INFO juju.cmd supercommand.go:329 command finished

re machine-0.log around that same time only these lines

2014-07-19 13:31:14 ERROR juju.state.unit unit.go:556 unit nova-cloud-controller/0 cannot get assigned machine: unit "nova-cloud-controller/0" is not assigned to a machine
2014-07-19 13:31:14 ERROR juju.state.unit unit.go:556 unit nova-cloud-controller/0 cannot get assigned machine: unit "nova-cloud-controller/0" is not assigned to a machine

the agent's upstart job is not on the system, and nothing is running. there are several containers exhibiting this

summary: - juju agents suicide if state server under load
+ juju agents suiciding
summary: - juju agents suiciding
+ juju machine agents suiciding
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

also to note both machines and units assigned are left in pending forever.

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.21-alpha1
tags: added: deploy lxc
tags: added: cloud-installer landscape
Revision history for this message
Andrew Wilkins (axwalk) wrote :

There is a race in the provisioner that may explain this. If the machine agent on the provisioned instance comes up before the provisioner has recorded its instance ID in state, then the machine agent may connect to state and find that it doesn't exist in state, and then kill itself. I can reproduce this easily by adding a sleep in the provisioner.

Andrew Wilkins (axwalk)
Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Andrew Wilkins (axwalk)
Revision history for this message
Andrew Wilkins (axwalk) wrote :
Andrew Wilkins (axwalk)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.