Juju add-unit stuck on new lxc containers

Bug #1466570 reported by Jorge Niedbalski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Expired
Undecided
Unassigned
1.20
Won't Fix
Medium
Cheryl Jennings

Bug Description

[Environment]

Trusty 14.04.2
Juju 1.20.11

[Description]

Juju add-unit hangs for new lxc containers.

- This environment has 3 state servers running, and 3 hypervisor machines for adding LXC guests.
- Jujud is running in all of the agents/states , listening correctly on 17070.

1) Running juju add-unit

ubuntu@customer:~$ juju -v --debug add-unit rabbitmq-slo --to lxc:3

2015-06-17 16:22:21 INFO juju.cmd supercommand.go:37 running juju [1.20.11-trusty-amd64 gc]
2015-06-17 16:22:21 DEBUG juju.conn api.go:187 trying cached API connection settings
2015-06-17 16:22:21 INFO juju.conn api.go:270 connecting to API addresses: [bootstrap.customer.net:17070 bootstrap.customer.net:17070 10.92.68.32:17070 juju-state2.customer.net:17070 juju-state2.customer.net:17070 10.92.68.178:17070 juju-state1.customer.net:17070 juju-state1.customer.net:17070 10.92.68.177:17070]
2015-06-17 16:22:21 INFO juju.state.api apiclient.go:242 dialing "wss://bootstrap.customer.net:17070/environment/89abab6f-d45d-4f19-8a29-88d759e98a53/api"
2015-06-17 16:22:21 INFO juju.state.api apiclient.go:176 connection established to "wss://bootstrap.customer.net:17070/environment/89abab6f-d45d-4f19-8a29-88d759e98a53/api"
2015-06-17 16:22:22 DEBUG juju.conn api.go:407 API addresses changed from ["bootstrap.customer.net:17070" "bootstrap.customer.net:17070" "10.92.68.32:17070" "juju-state2.customer.net:17070" "juju-state2.customer.net:17070" "10.92.68.178:17070" "juju-state1.customer.net:17070" "juju-state1.customer.net:17070" "10.92.68.177:17070"] to ["bootstrap.customer.net:17070" "bootstrap.customer.net:17070" "10.92.68.32:17070" "juju-state1.customer.net:17070" "juju-state1.customer.net:17070" "10.92.68.177:17070" "juju-state2.customer.net:17070" "juju-state2.customer.net:17070" "10.92.68.178:17070"]
2015-06-17 16:22:22 INFO juju.conn api.go:418 updated API connection settings cache

- What's printed at that point on the state server juju log:

2015-06-17 14:18:51 ERROR juju.rpc server.go:554 error writing response: EOF
2015-06-17 14:18:51 ERROR juju.rpc server.go:554 error writing response: EOF
2015-06-17 14:18:51 ERROR juju.rpc server.go:554 error writing response: EOF
2015-06-17 14:18:51 ERROR juju.rpc server.go:554 error writing response: EOF
2015-06-17 14:18:51 ERROR juju.rpc server.go:554 error writing response: EOF
2015-06-17 14:18:51 ERROR juju.rpc server.go:554 error writing response: EOF
2015-06-17 14:18:51 ERROR juju.rpc server.go:554 error writing response: EOF
2015-06-17 15:18:07 INFO juju.cmd supercommand.go:37 running jujud [1.20.11.1-trusty-amd64 gc]
2015-06-17 15:18:07 INFO juju.cmd.jujud machine.go:158 machine agent machine-0 start (1.20.11.1-trusty-amd64 [gc])
2015-06-17 15:18:07 DEBUG juju.agent agent.go:377 read agent config, format "1.18"
2015-06-17 15:18:07 INFO juju.cmd.jujud machine.go:169 no upgrade steps required or upgrade steps for 1.20.11.1 have already be
en run.
2015-06-17 15:18:07 INFO juju.worker runner.go:260 start "api"
2015-06-17 15:18:07 INFO juju.worker runner.go:260 start "statestarter"
2015-06-17 15:18:07 INFO juju.worker runner.go:260 start "termination"
2015-06-17 15:18:07 INFO juju.state.api apiclient.go:242 dialing "wss://localhost:17070/"
2015-06-17 15:18:07 INFO juju.worker runner.go:260 start "state"
2015-06-17 15:18:07 INFO juju.state.api apiclient.go:250 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost
:17070/: dial tcp 127.0.0.1:17070: connection refused
2015-06-17 15:18:07 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://localhost:17070/"
2015-06-17 15:18:07 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-06-17 15:18:07 INFO juju.mongo mongo.go:171 Ensuring mongo server is running; data directory /var/lib/juju; port 37017
2015-06-17 15:18:07 INFO juju.mongo mongo.go:326 installing juju-mongodb
2015-06-17 15:18:07 INFO juju.utils.apt apt.go:132 Running: [apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::op
tions::=--force-unsafe-io --assume-yes --quiet install juju-mongodb]
2015-06-17 15:18:08 DEBUG juju.mongo mongo.go:275 using mongod: /usr/lib/juju/bin/mongod --version: "db version v2.4.9\nWed Jun
 17 15:18:08.599 git version: nogitversion\n"
2015-06-17 15:18:08 DEBUG juju.mongo mongo.go:201 mongo exists as expected
2015-06-17 15:18:08 INFO juju.state open.go:47 opening state, mongo addresses: ["127.0.0.1:37017"]; entity "machine-0"
2015-06-17 15:18:08 DEBUG juju.state open.go:48 dialing mongo
2015-06-17 15:18:08 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-17 15:18:08 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-17 15:18:08 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-17 15:18:08 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-17 15:18:08 DEBUG juju.state open.go:53 connection established
2015-06-17 15:18:08 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-17 15:18:09 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-17 15:18:10 INFO juju.worker.singular singular.go:57 runner created; isMaster false
2015-06-17 15:18:10 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 2
2015-06-17 15:18:10 INFO juju.worker runner.go:260 start "instancepoller"
2015-06-17 15:18:10 INFO juju.worker runner.go:260 start "peergrouper"
2015-06-17 15:18:10 INFO juju.worker runner.go:260 start "apiserver"
2015-06-17 15:18:10 INFO juju.state.apiserver apiserver.go:70 listening on "[::]:17070"
2015-06-17 15:18:10 INFO juju.worker.singular singular.go:104 standby "cleaner"
2015-06-17 15:18:10 DEBUG juju.provider.maas environprovider.go:30 opening environment "maas".
2015-06-17 15:18:10 INFO juju.worker.instanceupdater worker.go:50 instance poller received inital environment configuration
2015-06-17 15:18:10 INFO juju.worker.singular singular.go:104 standby "resumer"
2015-06-17 15:18:10 INFO juju.worker.singular singular.go:104 standby "minunitsworker"
2015-06-17 15:18:10 INFO juju.worker runner.go:260 start "cleaner"
2015-06-17 15:18:10 INFO juju.worker runner.go:260 start "resumer"
2015-06-17 15:18:10 INFO juju.worker runner.go:260 start "minunitsworker"
2015-06-17 15:18:10 DEBUG juju.provider.maas environprovider.go:30 opening environment "maas".
2015-06-17 15:18:10 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-17 15:18:10 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-17 15:18:10 DEBUG juju.worker.peergrouper worker.go:422 found new machine "0"
2015-06-17 15:18:10 DEBUG juju.worker.peergrouper worker.go:422 found new machine "81"
2015-06-17 15:18:10 DEBUG juju.worker.peergrouper worker.go:422 found new machine "82"
2015-06-17 15:18:10 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-17 15:18:10 INFO juju.worker runner.go:260 start "api"

- What's printed on the machine 3 ( the hypervisor guest ):

2015-06-17 15:49:00 INFO juju.worker runner.go:260 start "api"
2015-06-17 15:49:00 INFO juju.worker runner.go:260 start "statestarter"
2015-06-17 15:49:00 INFO juju.state.api apiclient.go:242 dialing "wss://bootstrap.lsd-slo.bskyb.com:17070/"
2015-06-17 15:49:00 INFO juju.worker runner.go:260 start "termination"
2015-06-17 15:49:00 INFO juju.state.api apiclient.go:242 dialing "wss://juju-state1.lsd-slo.bskyb.com:17070/"
2015-06-17 15:49:00 INFO juju.state.api apiclient.go:242 dialing "wss://juju-state2.lsd-slo.bskyb.com:17070/"
2015-06-17 15:49:00 INFO juju.state.api apiclient.go:176 connection established to "wss://juju-state1.lsd-slo.bskyb.com:17070/"
2015-06-17 15:49:01 INFO juju.worker runner.go:260 start "upgrader"
2015-06-17 15:49:01 INFO juju.worker runner.go:260 start "upgrade-steps"
2015-06-17 15:49:01 INFO juju.worker runner.go:260 start "machiner"
2015-06-17 15:49:01 INFO juju.worker runner.go:260 start "apiaddressupdater"
2015-06-17 15:49:01 INFO juju.worker runner.go:260 start "logger"
2015-06-17 15:49:01 DEBUG juju.worker.logger logger.go:35 initial log config: "<root>=DEBUG"
2015-06-17 15:49:01 INFO juju.worker runner.go:260 start "machineenvironmentworker"
2015-06-17 15:49:01 DEBUG juju.worker.logger logger.go:60 logger setup
2015-06-17 15:49:01 DEBUG juju.worker.machineenvironment machineenvironmentworker.go:71 write system files: true
2015-06-17 15:49:01 INFO juju.worker runner.go:260 start "rsyslog"
2015-06-17 15:49:01 INFO juju.worker runner.go:260 start "authenticationworker"
2015-06-17 15:49:01 DEBUG juju.worker.rsyslog worker.go:75 starting rsyslog worker mode 1 for "machine-3" ""
2015-06-17 15:49:01 DEBUG juju.container.kvm kvm.go:69 kvm-ok output:
INFO: /dev/kvm exists
KVM acceleration can be used

2015-06-17 15:49:01 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "juju.container.lxc=TRACE;unit=DEBUG"
2015-06-17 16:18:42 ERROR juju.provisioner container_initialisation.go:158 cannot get tools from machine for kvm container
2015-06-17 16:18:42 WARNING juju.provisioner container_initialisation.go:114 not stopping machine agent container watcher due to error: initialising container infrastructure on host machine: connection is shut down
2015-06-17 16:18:42 ERROR juju.provisioner container_initialisation.go:96 starting container provisioner for kvm: initialising container infrastructure on host machine: connection is shut down
2015-06-17 16:18:42 ERROR juju.provisioner container_initialisation.go:158 cannot get tools from machine for lxc container
2015-06-17 16:18:42 WARNING juju.provisioner container_initialisation.go:114 not stopping machine agent container watcher due to error: initialising container infrastructure on host machine: connection is shut down
2015-06-17 16:18:42 ERROR juju.provisioner container_initialisation.go:96 starting container provisioner for lxc: initialising container infrastructure on host machine: connection is shut down
2015-06-17 16:18:42 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-17 16:18:42 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-17 16:18:42 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-17 16:18:42 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-17 16:18:42 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-17 16:18:42 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-17 16:18:42 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-17 16:18:42 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-17 16:18:42 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-17 16:18:42 ERROR juju.worker runner.go:207 fatal "upgrader": connection is shut down
2015-06-17 16:18:42 ERROR juju.worker runner.go:207 fatal "authenticationworker": connection is shut down
2015-06-17 16:18:42 ERROR juju.worker runner.go:207 fatal "machiner": connection is shut down
2015-06-17 16:18:42 ERROR juju.worker runner.go:207 fatal "logger": connection is shut down

Aaron Bentley (abentley)
Changed in juju-core:
status: New → Incomplete
Revision history for this message
Cheryl Jennings (cherylj) wrote :

The container is failing to start because it cannot get the tools required to do so. It is getting an error that the connection to the state server is shut down. It is troubling that:
1 - this happens every time, and
2 - this makes the add-unit command hang, instead of just failing.

I see that there are rpc errors from the state server that indicate it could not write a response, and I am wondering if there's something about this specific response that's causing the rpc connection to shut down.

I have asked for a recreate with trace level logging with logs from the state servers and machine 3. In the meantime, I am still digging through the code and attempting a local recreate.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

@cherylj,

Things has become worse. The following log entries are being written in the state servers:

2015-06-17 15:19:13 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "juju.worker.runner=TRACE;unit=DEBUG"
2015-06-18 10:39:53 ERROR juju.rpc server.go:554 error writing response: EOF
2015-06-19 01:49:20 ERROR juju.worker.peergrouper worker.go:138 peergrouper loop terminated: tomb: dying
2015-06-19 01:49:20 ERROR juju.worker runner.go:207 fatal "peergrouper": watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.entityWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.relationUnitsWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.relationUnitsWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.entityWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.entityWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.entityWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.entityWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.relationUnitsWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.relationUnitsWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.relationUnitsWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.relationUnitsWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.entityWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.relationUnitsWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.relationUnitsWatcher resource: watcher iteration error: EOF
2015-06-19 01:49:20 ERROR juju.state.apiserver.common resource.go:102 error stopping *state.entityWatcher resource: watcher iteration error: EOF

And since those errors started it is now just spitting out the following message every few seconds:

2015-06-19 08:55:59 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://localhost:17070/"

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Discussed this problem a bit more with other core team members. It is highly likely that the add-unit command is hanging because of transactions that are hanging as a result of problems with the mongodb. To verify, we would like to take a look at a backup / dump of mongo.

There were a number of txn fixes that went into 1.22.5 and it is highly recommended that this installation is upgraded to that level.

First, we should get the mongo backup / dump to examine and confirm, and then attempt to repair the database.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Looked through the database with Menno this afternoon. We tried running the mgopurge tool, but didn't see any missing transactions. There were no transactions in an unexpected state. We're asking John to chime in on this once he starts his day.

In the meantime, I'm going to try and recreate the lxc failure to see if that leads to a recreate of the add-unit hang.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

I wasn't able to recreate the add-unit hang.

I spoke with Menno more today and his discussions with John did not turn up much. At this point, we need to start looking more broadly at the state servers.

Could we get an sosreport from the three state servers in the environment?

tags: removed: sts
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status: Incomplete → Expired
Revision history for this message
Anastasia (anastasia-macmood) wrote :

targeted against old milestone

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.