Agent shutdown can cause cert updater channel already closed panic

Bug #1472729 reported by Andreas Hasenack
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Ian Booth
1.24
Fix Released
High
Ian Booth

Bug Description

A landscape cloud deployment is stuck right after bootstrap. The landscape juju client is getting an "upgrade in progress" error from juju for about 20min now. This is the first occurance:

Jul 8 17:26:16 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.juju.errors.RequestError'>: upgrade in progress - Juju functionality is limited

This environment was bootstrapped with the setting "agent-version: 1.24.1", so it shouldn't have even tried to upgrade the tools. Bootstrap was kicked at 17:17:52.

A juju status in that env works fine now:
$ juju status
environment: "3"
machines:
  "0":
    agent-state: started
    agent-version: 1.24.1
    dns-name: barley.scapestack
    instance-id: /MAAS/api/1.0/nodes/node-65d52b5c-546c-11e4-821d-2c59e54ace74/
    series: trusty
    hardware: arch=amd64 cpu-cores=4 mem=16384M
    state-server-member-status: has-vote
services: {}

As does a simple ubuntu deployment --to lxc:0
environment: "3"
machines:
  "0":
    agent-state: started
    agent-version: 1.24.1
    dns-name: barley.scapestack
    instance-id: /MAAS/api/1.0/nodes/node-65d52b5c-546c-11e4-821d-2c59e54ace74/
    series: trusty
    containers:
      0/lxc/0:
        agent-state: started
        agent-version: 1.24.1
        dns-name: 10.96.7.179
        instance-id: juju-machine-0-lxc-0
        series: trusty
        hardware: arch=amd64
    hardware: arch=amd64 cpu-cores=4 mem=16384M
    state-server-member-status: has-vote
services:
  ubuntu:
    charm: cs:trusty/ubuntu-3
    exposed: false
    service-status:
      current: unknown
      since: 08 Jul 2015 17:45:12Z
    units:
      ubuntu/0:
        workload-status:
          current: unknown
          since: 08 Jul 2015 17:45:12Z
        agent-status:
          current: idle
          since: 08 Jul 2015 17:45:15Z
          version: 1.24.1
        agent-state: started
        agent-version: 1.24.1
        machine: 0/lxc/0
        public-address: 10.96.7.179

I also see a panic in machine-0.log after a lot of EOF errors:
(...)
2015-07-08 17:26:09 ERROR juju.worker runner.go:219 exited "authenticationworker": watcher iteration error: EOF
2015-07-08 17:26:09 ERROR juju.worker runner.go:219 exited "authenticationworker": watcher iteration error: EOF
panic: runtime error: send on closed channel
        panic: runtime error: close of closed channel

goroutine 626 [running]:
runtime.panic(0x189df20, 0x3fb45b5)
        /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/juju/juju/worker/certupdater.(*CertificateUpdater).TearDown(0xc21035aa00, 0xc200000000, 0x30)
        /build/buildd/juju-core-1.24.1/src/github.com/juju/juju/worker/certupdater/certupdater.go:133 +0x2b
(...)

juju logs attached, and bits of landscape logs too, useful for timestamps.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Landscape's job-handler.log. Shows the bootstrap starting at 17:17:52 and subsequent errors.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Landscape's juju-sync log. That's our juju client daemon that keeps an open connection to the juju state server, recording changes to the environment.

You can see that it connected to the state server at 17:26:44, and at 17:42:44 was my manual deploy of "ubuntu --to lxc:0".

Revision history for this message
Curtis Hovey (sinzui) wrote :

1.24.1 was never released because it had regressions and notably upgrade bugs that make is difficult to fix once it is installed. 1.24.1 is unsafe and not suitable for production uses. 1.24.2 is the current stable juju and addresses upgrade issues found in 1.23.x and 1.24.0.

In my experience with this case, Juju actually aborted the upgrade and reverted to the previous agents. In one test, I waited more than 60 minutes, and I had to restart a few agents. then the upgrade completed.

Changed in juju-core:
status: New → Triaged
importance: Undecided → Medium
tags: added: upgrade-juju
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Note that the environment was bootstrapped with "agent-version: 1.24.1", so it shouldn't even have tried to upgrade, if that's what it did. We started pinning the agent like that exactly because we do not want juju upgrading behind our backs whenever it thinks it's cool.

summary: - juju stuck in "upgrade in progress " for 20min
+ Deploy with pinned agent-version still tried to upgrade
summary: - Deploy with pinned agent-version still tried to upgrade
+ Deploy with pinned agent-version still tried to upgrade, and panic()ed
summary: - Deploy with pinned agent-version still tried to upgrade, and panic()ed
+ Deploy with pinned agent-version still tried to upgrade, panic()ed
Curtis Hovey (sinzui)
tags: added: regression
Changed in juju-core:
importance: Medium → High
milestone: none → 1.25.0
Revision history for this message
Ian Booth (wallyworld) wrote : Re: Deploy with pinned agent-version still tried to upgrade, panic()ed

I can see no evidence of Juju upgrading beyond 1.24.1. The debug log messages are a little misleading if you don't know Juju's internals. What Juju does at start up is lock the full API while it checks to see if an upgrade is needed. During this checking time, and when any upgrade is being performed, an "upgrade in progress" debug message may be emitted if an attempt is made to use the api. In this case, that attempt is the bootstrap process polling to see if the agent has started.

The real cause of the EOF messages has been fixed in bug 1468581

The remaining issue, the attempt to close an already closed channel on restart due to error, will be fixed.

summary: - Deploy with pinned agent-version still tried to upgrade, panic()ed
+ Agent shutdown can cause cert updater channel already closed panic
Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.