juju-core

all units have false hook errors after reboot

Bug #1499356 reported by Edward Hope-Morley on 2015-09-24

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Expired	High	Unassigned

Bug Description

I have a multinode deployment with services running in LXC across nodes (HA). When I reboot a node, once it comes back up all units on that node show an error for the workload-status. This is an issue that has been seen in the past and supposedly fixed but I am running Juju 1.24.5 and I still see this.

# Juju status for one of the services:

environment: test
machines:
  "1":
    agent-state: started
    agent-version: 1.24.5
    dns-name: krueger.maas
    instance-id: /MAAS/api/1.0/nodes/node-5f080e98-5d26-11e5-947f-525400719f4d/
    series: trusty
    containers:
      1/lxc/0:
        agent-state: started
        agent-version: 1.24.5
        dns-name: 10.232.16.32
        instance-id: juju-machine-1-lxc-0
        series: trusty
        hardware: arch=amd64
    hardware: arch=amd64 cpu-cores=32 mem=32768M tags=api
  "2":
    agent-state: started
    agent-version: 1.24.5
    dns-name: kearns.maas
    instance-id: /MAAS/api/1.0/nodes/node-602c60c6-5d26-11e5-807f-525400719f4d/
    series: trusty
    containers:
      2/lxc/0:
        agent-state: started
        agent-version: 1.24.5
        dns-name: 10.232.16.44
        instance-id: juju-machine-2-lxc-0
        series: trusty
        hardware: arch=amd64
    hardware: arch=amd64 cpu-cores=32 mem=32768M tags=api
  "3":
    agent-state: started
    agent-version: 1.24.5
    dns-name: doble.maas
    instance-id: /MAAS/api/1.0/nodes/node-614b0700-5d26-11e5-ad7c-525400719f4d/
    series: trusty
    containers:
      3/lxc/0:
        agent-state: started
        agent-version: 1.24.5
        dns-name: 10.232.16.45
        instance-id: juju-machine-3-lxc-0
        series: trusty
        hardware: arch=amd64
    hardware: arch=amd64 cpu-cores=32 mem=32768M tags=api
services:
  cinder:
    charm: local:trusty/cinder-136
    exposed: false
    service-status:
      current: unknown
      since: 17 Sep 2015 11:48:58Z
    relations:
      amqp:
      - rabbitmq-server
      ceph:
      - ceph
      cinder-volume-service:
      - nova-cloud-controller
      cluster:
      - cinder
      ha:
      - cinder-hacluster
      identity-service:
      - keystone
      image-service:
      - glance
      shared-db:
      - percona-cluster
    units:
      cinder/0:
        workload-status:
          current: unknown
          since: 17 Sep 2015 11:46:25Z
        agent-status:
          current: idle
          since: 24 Sep 2015 12:32:35Z
          version: 1.24.5
        agent-state: started
        agent-version: 1.24.5
        machine: 1/lxc/0
        public-address: 10.232.16.32
        subordinates:
          cinder-hacluster/0:
            workload-status:
              current: error
              message: 'hook failed: "config-changed"'
              since: 24 Sep 2015 12:31:35Z
            agent-status:
              current: idle
              since: 24 Sep 2015 12:31:35Z
              version: 1.24.5
            agent-state: error
            agent-state-info: 'hook failed: "config-changed"'
            agent-version: 1.24.5
            upgrading-from: local:trusty/hacluster-0
            public-address: 10.232.16.32
      cinder/1:
        workload-status:
          current: unknown
          since: 17 Sep 2015 11:48:58Z
        agent-status:
          current: idle
          since: 24 Sep 2015 12:32:21Z
          version: 1.24.5
        agent-state: started
        agent-version: 1.24.5
        machine: 2/lxc/0
        public-address: 10.232.16.44
        subordinates:
          cinder-hacluster/2:
            workload-status:
              current: unknown
              since: 17 Sep 2015 11:50:43Z
            agent-status:
              current: idle
              since: 17 Sep 2015 11:52:48Z
              version: 1.24.5
            agent-state: started
            agent-version: 1.24.5
            upgrading-from: local:trusty/hacluster-0
            public-address: 10.232.16.44
      cinder/2:
        workload-status:
          current: unknown
          since: 17 Sep 2015 11:48:51Z
        agent-status:
          current: idle
          since: 24 Sep 2015 12:32:20Z
          version: 1.24.5
        agent-state: started
        agent-version: 1.24.5
        machine: 3/lxc/0
        public-address: 10.232.16.45
        subordinates:
          cinder-hacluster/1:
            workload-status:
              current: unknown
              since: 17 Sep 2015 11:50:42Z
            agent-status:
              current: idle
              since: 24 Sep 2015 12:27:55Z
              version: 1.24.5
            agent-state: started
            agent-version: 1.24.5
            upgrading-from: local:trusty/hacluster-0
            public-address: 10.232.16.45
  cinder-hacluster:
    charm: local:trusty/hacluster-0
    exposed: false
    service-status: {}
    relations:
      ha:
      - cinder
      hanode:
      - cinder-hacluster
    subordinate-to:
    - cinder
networks:
  maas-eth0:
    provider-id: maas-eth0
    cidr: 10.232.16.0/21

# Looking at the unit logs there are no errors:

ubuntu@maas-boot-vm-dc1:~⟫ juju ssh cinder/0 sudo grep ERROR /var/log/juju/unit-cinder-0.log
Warning: Permanently added '10.232.16.5' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.232.16.32' (ECDSA) to the list of known hosts.
sudo: unable to resolve host juju-machine-1-lxc-0
Connection to 10.232.16.32 closed.

On the juju state-server I see a ton of this in /var/log/juju/machine-0.log:

2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
...

Tags:

Edward Hope-Morley (hopem) on 2015-09-24

tags:

added: sts

Curtis Hovey (sinzui) on 2015-09-24

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 1.25-beta2

Alexis Bruemmer (alexis-bruemmer) on 2015-09-24

Changed in juju-core:
milestone:	1.25-beta2 → 1.24.7

Mark Ramm (mark-ramm) on 2015-09-24

Changed in juju-core:
importance:	High → Critical

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-09-24:

Taking a look...

Changed in juju-core:
assignee:	nobody → Cheryl Jennings (cherylj)

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-09-24:

Trying to reproduce. In the meantime, can you include machine and unit logs?

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-09-24:

machine-0.log Edit (110.6 KiB, text/plain)

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-09-24:

unit-cinder-0.log Edit (376.3 KiB, text/plain)

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-09-24:

the reboot was at around 12:16

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-09-24:

There's definitely something going wrong with machine 0. It looks like some of the workers were having problems connecting to mongo, but didn't get restarted with a new connection.

There have been a number of bugs recently relating to recovering from EOF errors. I'm still digging around those to see if this was fixed in a later release.

Revision history for this message

Tim Penhey (thumper) wrote on 2015-09-25:

There is certainly something strange going on, but in order to track it down, I'm hoping you can add some extra logging config for us:

juju set-env logging-config=juju=DEBUG

Then ideally bounce the machine agent or agents that are running the API servers. Are you running in HA mode? The status you showed above did not list machine-0 yet the machine-0.log was added.

The logging configuration change should propagate to all the agents in the environment. If you then restart one of the nodes, we should have much more useful debugging information.

Thanks.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-09-25: Re: [Bug 1499356] Re: all units have false hook errors after reboot

I'm not running the state-server in HA so just one unit of state-server.
I don't have this env running anymore but it is easily reproducible so
I'll provide more info when I hit it again.

On 25/09/15 03:32, Tim Penhey wrote:
> There is certainly something strange going on, but in order to track it
> down, I'm hoping you can add some extra logging config for us:
>
> juju set-env logging-config=juju=DEBUG
>
> Then ideally bounce the machine agent or agents that are running the API
> servers. Are you running in HA mode? The status you showed above did not
> list machine-0 yet the machine-0.log was added.
>
> The logging configuration change should propagate to all the agents in
> the environment. If you then restart one of the nodes, we should have
> much more useful debugging information.
>
> Thanks.
>

Revision history for this message

William Reade (fwereade) wrote on 2015-09-25:

Will this be addressed by the auto-retry work which bogdanteleaga/axw are addressing? Currently, if you happen to reboot while a hook is executing, the agent will consider the hook to have errored out, and will wait for user intervention; this will change shortly.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-09-25:

#10

Well ftr if i reboot a node that has say 6 units (1 unit of each 6 services) running on it and none of them are seemingly doing anything (agent-state is started and they have been quiet for several hours), all units will go to error state on reboot.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-09-25:

#11

Talked with bogdanteleaga this morning, and we're not sure that this is caused by reboots happening while the hook is running. To investigate this further, we need to wait for the recreate with the logging set to DEBUG as mentioned in comment #7

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-10-01:

#12

I redeployed my environment with --debug using Juju 1.24.6-0ubuntu1~14.04.1~juju1 and I no longer see this problem. I will continue rebooting nodes to see if the problem reoccurs and will report back if i hit anything but for the moment I guess we can put this on ice.

Curtis Hovey (sinzui) on 2015-10-08

Changed in juju-core:
importance:	Critical → High

Curtis Hovey (sinzui) on 2015-10-15

Changed in juju-core:
milestone:	1.24.7 → 1.24.8

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-10-21:

#13

Moving to incomplete as we're waiting for further debug logs.

Changed in juju-core:
status:	Triaged → Incomplete

Revision history for this message

Edwin Gnichtel (ned-4) wrote on 2015-11-10:

#14

After an upgrade from 1.23.3 to 1.25.0, we also appear to be seeing this issue in our production environment.

We are seeing:

2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF

in the machine-0.log and we are seeing similar "unknowns" under workload-status after rebooting a machine as shown (example shown is juju-gui):

  jujuwebgui03:
    charm: cs:trusty/juju-gui-41
    exposed: true
    service-status:
      current: unknown
      since: 07 Nov 2015 04:32:51Z
    units:
      jujuwebgui03/0:
        workload-status:
          current: unknown
          since: 07 Nov 2015 04:32:51Z
        agent-status:
          current: idle
          since: 10 Nov 2015 02:52:55Z
          version: 1.25.0
        agent-state: started
        agent-version: 1.25.0
        machine: 6/lxc/12
        open-ports:
        - 80/tcp
        - 443/tcp
        public-address: 10.0.1.38

Incidentally we also have hit: https://bugs.launchpad.net/juju-gui/+bug/1485249 though I don't think its related.

Happy to provide logs/debug info; I will start by enabling " juju set-env logging-config=juju=DEBUG " .

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-11-10:

#15

Yes, please enable DEBUG logging and upload the logs once you recreate. I'll take a look once they're up.

Revision history for this message

Edwin Gnichtel (ned-4) wrote on 2015-11-10:

#16

Cheryl,

Which specific logs do you want us to attach? I have enabled the extended logging per your request and have rebooted one of the machine instances.

-N

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-11-10:

#17

It would be useful to have:
- The machine logs from each state server
- The machine logs from machines hosting units with hook failures
- The unit logs from units with hook failures

Revision history for this message

Edwin Gnichtel (ned-4) wrote on 2015-11-16:

#18

Sorry for the delay, our logs are being reviewed and sanitized since they are from a production environment. Should have them attached here in in the next 24 hours or so.

Thanks,

-N

Cheryl Jennings (cherylj) on 2015-11-16

Changed in juju-core:
milestone:	1.24.8 → none

Cheryl Jennings (cherylj) on 2015-12-01

Changed in juju-core:
assignee:	Cheryl Jennings (cherylj) → nobody

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-01-31:

#19

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.