all units have false hook errors after reboot
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
High
|
Unassigned | ||
Bug Description
I have a multinode deployment with services running in LXC across nodes (HA). When I reboot a node, once it comes back up all units on that node show an error for the workload-status. This is an issue that has been seen in the past and supposedly fixed but I am running Juju 1.24.5 and I still see this.
# Juju status for one of the services:
environment: test
machines:
"1":
agent-state: started
agent-version: 1.24.5
dns-name: krueger.maas
instance-id: /MAAS/api/
series: trusty
containers:
1/lxc/0:
dns-name: 10.232.16.32
series: trusty
hardware: arch=amd64
hardware: arch=amd64 cpu-cores=32 mem=32768M tags=api
"2":
agent-state: started
agent-version: 1.24.5
dns-name: kearns.maas
instance-id: /MAAS/api/
series: trusty
containers:
2/lxc/0:
dns-name: 10.232.16.44
series: trusty
hardware: arch=amd64
hardware: arch=amd64 cpu-cores=32 mem=32768M tags=api
"3":
agent-state: started
agent-version: 1.24.5
dns-name: doble.maas
instance-id: /MAAS/api/
series: trusty
containers:
3/lxc/0:
dns-name: 10.232.16.45
series: trusty
hardware: arch=amd64
hardware: arch=amd64 cpu-cores=32 mem=32768M tags=api
services:
cinder:
charm: local:trusty/
exposed: false
service-status:
current: unknown
since: 17 Sep 2015 11:48:58Z
relations:
amqp:
- rabbitmq-server
ceph:
- ceph
cinder-
- nova-cloud-
cluster:
- cinder
ha:
- cinder-hacluster
identity-
- keystone
image-
- glance
shared-db:
- percona-cluster
units:
cinder/0:
current: unknown
since: 17 Sep 2015 11:46:25Z
current: idle
since: 24 Sep 2015 12:32:35Z
version: 1.24.5
machine: 1/lxc/0
cinder/1:
current: unknown
since: 17 Sep 2015 11:48:58Z
current: idle
since: 24 Sep 2015 12:32:21Z
version: 1.24.5
machine: 2/lxc/0
cinder/2:
current: unknown
since: 17 Sep 2015 11:48:51Z
current: idle
since: 24 Sep 2015 12:32:20Z
version: 1.24.5
machine: 3/lxc/0
cinder-hacluster:
charm: local:trusty/
exposed: false
service-status: {}
relations:
ha:
- cinder
hanode:
- cinder-hacluster
subordinate-to:
- cinder
networks:
maas-eth0:
provider-id: maas-eth0
cidr: 10.232.16.0/21
# Looking at the unit logs there are no errors:
ubuntu@
Warning: Permanently added '10.232.16.5' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.232.16.32' (ECDSA) to the list of known hosts.
sudo: unable to resolve host juju-machine-
Connection to 10.232.16.32 closed.
On the juju state-server I see a ton of this in /var/log/
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
2015-09-24 12:26:48 ERROR juju.rpc server.go:573 error writing response: EOF
...
| tags: | added: sts |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → 1.25-beta2 |
| Changed in juju-core: | |
| milestone: | 1.25-beta2 → 1.24.7 |
| Changed in juju-core: | |
| importance: | High → Critical |
| Cheryl Jennings (cherylj) wrote : | #1 |
| Changed in juju-core: | |
| assignee: | nobody → Cheryl Jennings (cherylj) |
| Cheryl Jennings (cherylj) wrote : | #2 |
Trying to reproduce. In the meantime, can you include machine and unit logs?
| Edward Hope-Morley (hopem) wrote : | #3 |
| Edward Hope-Morley (hopem) wrote : | #4 |
| Edward Hope-Morley (hopem) wrote : | #5 |
the reboot was at around 12:16
| Cheryl Jennings (cherylj) wrote : | #6 |
There's definitely something going wrong with machine 0. It looks like some of the workers were having problems connecting to mongo, but didn't get restarted with a new connection.
There have been a number of bugs recently relating to recovering from EOF errors. I'm still digging around those to see if this was fixed in a later release.
| Tim Penhey (thumper) wrote : | #7 |
There is certainly something strange going on, but in order to track it down, I'm hoping you can add some extra logging config for us:
juju set-env logging-
Then ideally bounce the machine agent or agents that are running the API servers. Are you running in HA mode? The status you showed above did not list machine-0 yet the machine-0.log was added.
The logging configuration change should propagate to all the agents in the environment. If you then restart one of the nodes, we should have much more useful debugging information.
Thanks.
| Edward Hope-Morley (hopem) wrote : Re: [Bug 1499356] Re: all units have false hook errors after reboot | #8 |
I'm not running the state-server in HA so just one unit of state-server.
I don't have this env running anymore but it is easily reproducible so
I'll provide more info when I hit it again.
On 25/09/15 03:32, Tim Penhey wrote:
> There is certainly something strange going on, but in order to track it
> down, I'm hoping you can add some extra logging config for us:
>
> juju set-env logging-
>
> Then ideally bounce the machine agent or agents that are running the API
> servers. Are you running in HA mode? The status you showed above did not
> list machine-0 yet the machine-0.log was added.
>
> The logging configuration change should propagate to all the agents in
> the environment. If you then restart one of the nodes, we should have
> much more useful debugging information.
>
> Thanks.
>
| William Reade (fwereade) wrote : | #9 |
Will this be addressed by the auto-retry work which bogdanteleaga/axw are addressing? Currently, if you happen to reboot while a hook is executing, the agent will consider the hook to have errored out, and will wait for user intervention; this will change shortly.
| Edward Hope-Morley (hopem) wrote : | #10 |
Well ftr if i reboot a node that has say 6 units (1 unit of each 6 services) running on it and none of them are seemingly doing anything (agent-state is started and they have been quiet for several hours), all units will go to error state on reboot.
| Cheryl Jennings (cherylj) wrote : | #11 |
Talked with bogdanteleaga this morning, and we're not sure that this is caused by reboots happening while the hook is running. To investigate this further, we need to wait for the recreate with the logging set to DEBUG as mentioned in comment #7
| Edward Hope-Morley (hopem) wrote : | #12 |
I redeployed my environment with --debug using Juju 1.24.6-
| Changed in juju-core: | |
| importance: | Critical → High |
| Changed in juju-core: | |
| milestone: | 1.24.7 → 1.24.8 |
| Cheryl Jennings (cherylj) wrote : | #13 |
Moving to incomplete as we're waiting for further debug logs.
| Changed in juju-core: | |
| status: | Triaged → Incomplete |
| Edwin Gnichtel (ned-4) wrote : | #14 |
After an upgrade from 1.23.3 to 1.25.0, we also appear to be seeing this issue in our production environment.
We are seeing:
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
2015-11-10 02:17:30 ERROR juju.rpc server.go:573 error writing response: EOF
in the machine-0.log and we are seeing similar "unknowns" under workload-status after rebooting a machine as shown (example shown is juju-gui):
jujuwebgui03:
charm: cs:trusty/
exposed: true
service-status:
current: unknown
since: 07 Nov 2015 04:32:51Z
units:
jujuwebgu
current: unknown
since: 07 Nov 2015 04:32:51Z
current: idle
since: 10 Nov 2015 02:52:55Z
version: 1.25.0
machine: 6/lxc/12
open-ports:
- 80/tcp
- 443/tcp
Incidentally we also have hit: https:/
Happy to provide logs/debug info; I will start by enabling " juju set-env logging-
| Cheryl Jennings (cherylj) wrote : | #15 |
Yes, please enable DEBUG logging and upload the logs once you recreate. I'll take a look once they're up.
| Edwin Gnichtel (ned-4) wrote : | #16 |
Cheryl,
Which specific logs do you want us to attach? I have enabled the extended logging per your request and have rebooted one of the machine instances.
-N
| Cheryl Jennings (cherylj) wrote : | #17 |
It would be useful to have:
- The machine logs from each state server
- The machine logs from machines hosting units with hook failures
- The unit logs from units with hook failures
| Edwin Gnichtel (ned-4) wrote : | #18 |
Sorry for the delay, our logs are being reviewed and sanitized since they are from a production environment. Should have them attached here in in the next 24 hours or so.
Thanks,
-N
| Changed in juju-core: | |
| milestone: | 1.24.8 → none |
| Changed in juju-core: | |
| assignee: | Cheryl Jennings (cherylj) → nobody |
| Launchpad Janitor (janitor) wrote : | #19 |
[Expired for juju-core because there has been no activity for 60 days.]
| Changed in juju-core: | |
| status: | Incomplete → Expired |


Taking a look...