all units have false hook errors after reboot

Bug #1499356 reported by Edward Hope-Morley
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Expired
High
Unassigned

Bug Description

I have a multinode deployment with services running in LXC across nodes (HA). When I reboot a node, once it comes back up all units on that node show an error for the workload-status. This is an issue that has been seen in the past and supposedly fixed but I am running Juju 1.24.5 and I still see this.

# Juju status for one of the services:

environment: test
machines:
  "1":
    agent-state: started
    agent-version: 1.24.5
    dns-name: krueger.maas
    instance-id: /MAAS/api/1.0/nodes/node-5f080e98-5d26-11e5-947f-525400719f4d/
    series: trusty
    containers:
      1/lxc/0:
        agent-state: started
        agent-version: 1.24.5
        dns-name: 10.232.16.32
        instance-id: juju-machine-1-lxc-0
        series: trusty
        hardware: arch=amd64
    hardware: arch=amd64 cpu-cores=32 mem=32768M tags=api
  "2":
    agent-state: started
    agent-version: 1.24.5
    dns-name: kearns.maas
    instance-id: /MAAS/api/1.0/nodes/node-602c60c6-5d26-11e5-807f-525400719f4d/
    series: trusty
    containers:
      2/lxc/0:
        agent-state: started
        agent-version: 1.24.5
        dns-name: 10.232.16.44
        instance-id: juju-machine-2-lxc-0
        series: trusty
        hardware: arch=amd64
    hardware: arch=amd64 cpu-cores=32 mem=32768M tags=api
  "3":
    agent-state: started
    agent-version: 1.24.5
    dns-name: doble.maas
    instance-id: /MAAS/api/1.0/nodes/node-614b0700-5d26-11e5-ad7c-525400719f4d/
    series: trusty
    containers:
      3/lxc/0:
        agent-state: started
        agent-version: 1.24.5
        dns-name: 10.232.16.45
        instance-id: juju-machine-3-lxc-0
        series: trusty
        hardware: arch=amd64
    hardware: arch=amd64 cpu-cores=32 mem=32768M tags=api
services:
  cinder:
    charm: local:trusty/cinder-136
    exposed: false
    service-status:
      current: unknown
      since: 17 Sep 2015 11:48:58Z
    relations:
      amqp:
      - rabbitmq-server
      ceph:
      - ceph
      cinder-volume-service:
      - nova-cloud-controller
      cluster:
      - cinder
      ha:
      - cinder-hacluster
      identity-service:
      - keystone
      image-service:
      - glance
      shared-db:
      - percona-cluster
    units:
      cinder/0:
        workload-status:
          current: unknown
          since: 17 Sep 2015 11:46:25Z
        agent-status:
          current: idle
          since: 24 Sep 2015 12:32:35Z
          version: 1.24.5
        agent-state: started
        agent-version: 1.24.5
        machine: 1/lxc/0
        public-address: 10.232.16.32
        subordinates:
          cinder-hacluster/0:
            workload-status:
              current: error
              message: 'hook failed: "config-changed"'
              since: 24 Sep 2015 12:31:35Z
            agent-status:
              current: idle
              since: 24 Sep 2015 12:31:35Z
              version: 1.24.5
            agent-state: error
            agent-state-info: 'hook failed: "config-changed"'
            agent-version: 1.24.5
            upgrading-from: local:trusty/hacluster-0
            public-address: 10.232.16.32
      cinder/1:
        workload-status:
          current: unknown
          since: 17 Sep 2015 11:48:58Z
        agent-status:
          current: idle
          since: 24 Sep 2015 12:32:21Z
          version: 1.24.5
        agent-state: started
        agent-version: 1.24.5
        machine: 2/lxc/0
        public-address: 10.232.16.44
        subordinates:
          cinder-hacluster/2:
            workload-status:
              current: unknown
              since: 17 Sep 2015 11:50:43Z
            agent-status:
              current: idle
              since: 17 Sep 2015 11:52:48Z
              version: 1.24.5
            agent-state: started
            agent-version: 1.24.5
            upgrading-from: local:trusty/hacluster-0
            public-address: 10.232.16.44
      cinder/2:
        workload-status:
          current: unknown
          since: 17 Sep 2015 11:48:51Z
        agent-status:
          current: idle
          since: 24 Sep 2015 12:32:20Z
          version: 1.24.5
        agent-state: started
        agent-version: 1.24.5
        machine: 3/lxc/0
        public-address: 10.232.16.45
        subordinates:
          cinder-hacluster/1:
            workload-status:
              current: unknown
              since: 17 Sep 2015 11:50:42Z
            agent-status:
              current: idle
              since: 24 Sep 2015 12:27:55Z
              version: 1.24.5
            agent-state: started
            agent-version: 1.24.5
            upgrading-from: local:trusty/hacluster-0
            public-address: 10.232.16.45
  cinder-hacluster:
    charm: local:trusty/hacluster-0
    exposed: false
    service-status: {}
    relations:
      ha:
      - cinder
      hanode:
      - cinder-hacluster
    subordinate-to:
    - cinder
networks:
  maas-eth0:
    provider-id: maas-eth0
    cidr: 10.232.16.0/21

# Looking at the unit logs there are no errors:

ubuntu@maas-boot-vm-dc1:~⟫ juju ssh cinder/0 sudo grep ERROR /var/log/juju/unit-cinder-0.log
Warning: Permanently added '10.232.16.5' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.232.16.32' (ECDSA) to the list of known hosts.
sudo: unable to resolve host juju-machine-1-lxc-0
Connection to 10.232.16.32 closed.

On the juju state-server I see a ton of this in /var/log/juju/machine-0.log:

2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
...

Tags: sts
tags: added: sts
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.25-beta2
Changed in juju-core:
milestone: 1.25-beta2 → 1.24.7
Mark Ramm (mark-ramm)
Changed in juju-core:
importance: High → Critical
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Taking a look...

Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Trying to reproduce. In the meantime, can you include machine and unit logs?

Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :

the reboot was at around 12:16

Revision history for this message
Cheryl Jennings (cherylj) wrote :

There's definitely something going wrong with machine 0. It looks like some of the workers were having problems connecting to mongo, but didn't get restarted with a new connection.

There have been a number of bugs recently relating to recovering from EOF errors. I'm still digging around those to see if this was fixed in a later release.

Revision history for this message
Tim Penhey (thumper) wrote :

There is certainly something strange going on, but in order to track it down, I'm hoping you can add some extra logging config for us:

  juju set-env logging-config=juju=DEBUG

Then ideally bounce the machine agent or agents that are running the API servers. Are you running in HA mode? The status you showed above did not list machine-0 yet the machine-0.log was added.

The logging configuration change should propagate to all the agents in the environment. If you then restart one of the nodes, we should have much more useful debugging information.

Thanks.

Revision history for this message
Edward Hope-Morley (hopem) wrote : Re: [Bug 1499356] Re: all units have false hook errors after reboot

I'm not running the state-server in HA so just one unit of state-server.
I don't have this env running anymore but it is easily reproducible so
I'll provide more info when I hit it again.

On 25/09/15 03:32, Tim Penhey wrote:
> There is certainly something strange going on, but in order to track it
> down, I'm hoping you can add some extra logging config for us:
>
> juju set-env logging-config=juju=DEBUG
>
> Then ideally bounce the machine agent or agents that are running the API
> servers. Are you running in HA mode? The status you showed above did not
> list machine-0 yet the machine-0.log was added.
>
> The logging configuration change should propagate to all the agents in
> the environment. If you then restart one of the nodes, we should have
> much more useful debugging information.
>
> Thanks.
>

Revision history for this message
William Reade (fwereade) wrote :

Will this be addressed by the auto-retry work which bogdanteleaga/axw are addressing? Currently, if you happen to reboot while a hook is executing, the agent will consider the hook to have errored out, and will wait for user intervention; this will change shortly.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Well ftr if i reboot a node that has say 6 units (1 unit of each 6 services) running on it and none of them are seemingly doing anything (agent-state is started and they have been quiet for several hours), all units will go to error state on reboot.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Talked with bogdanteleaga this morning, and we're not sure that this is caused by reboots happening while the hook is running. To investigate this further, we need to wait for the recreate with the logging set to DEBUG as mentioned in comment #7

Revision history for this message
Edward Hope-Morley (hopem) wrote :

I redeployed my environment with --debug using Juju 1.24.6-0ubuntu1~14.04.1~juju1 and I no longer see this problem. I will continue rebooting nodes to see if the problem reoccurs and will report back if i hit anything but for the moment I guess we can put this on ice.

Curtis Hovey (sinzui)
Changed in juju-core:
importance: Critical → High
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24.7 → 1.24.8
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Moving to incomplete as we're waiting for further debug logs.

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Edwin Gnichtel (ned-4) wrote :

After an upgrade from 1.23.3 to 1.25.0, we also appear to be seeing this issue in our production environment.

We are seeing:

2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF

in the machine-0.log and we are seeing similar "unknowns" under workload-status after rebooting a machine as shown (example shown is juju-gui):

  jujuwebgui03:
    charm: cs:trusty/juju-gui-41
    exposed: true
    service-status:
      current: unknown
      since: 07 Nov 2015 04:32:51Z
    units:
      jujuwebgui03/0:
        workload-status:
          current: unknown
          since: 07 Nov 2015 04:32:51Z
        agent-status:
          current: idle
          since: 10 Nov 2015 02:52:55Z
          version: 1.25.0
        agent-state: started
        agent-version: 1.25.0
        machine: 6/lxc/12
        open-ports:
        - 80/tcp
        - 443/tcp
        public-address: 10.0.1.38

Incidentally we also have hit: https://bugs.launchpad.net/juju-gui/+bug/1485249 though I don't think its related.

Happy to provide logs/debug info; I will start by enabling " juju set-env logging-config=juju=DEBUG " .

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Yes, please enable DEBUG logging and upload the logs once you recreate. I'll take a look once they're up.

Revision history for this message
Edwin Gnichtel (ned-4) wrote :

Cheryl,

Which specific logs do you want us to attach? I have enabled the extended logging per your request and have rebooted one of the machine instances.

-N

Revision history for this message
Cheryl Jennings (cherylj) wrote :

It would be useful to have:
- The machine logs from each state server
- The machine logs from machines hosting units with hook failures
- The unit logs from units with hook failures

Revision history for this message
Edwin Gnichtel (ned-4) wrote :

Sorry for the delay, our logs are being reviewed and sanitized since they are from a production environment. Should have them attached here in in the next 24 hours or so.

Thanks,

-N

Changed in juju-core:
milestone: 1.24.8 → none
Changed in juju-core:
assignee: Cheryl Jennings (cherylj) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.