local provider machine agent uses 100% CPU after host reboot

Reported by Māris Fogels on 2012-05-30
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
pyjuju
High
Benjamin Saller
0.5
High
Unassigned
juju (Ubuntu)
High
Unassigned
Precise
High
Unassigned

Bug Description

The juju program uses 100% CPU if the host is rebooted while running a LXC environment. The CPU consumption does not happen immediately after reboot, but starts within a day or two of regular system use.

Destroying the environment kills the processes.

Here is the runaway process command:

  /usr/bin/python -m juju.agents.machine --nodaemon --logfile /tmp/local-juju/mars-local/machine-agent.log --session-file /var/run/juju/mars-local-machine-agent.zksession

Steps to reproduce:

 1. Create an environment with the local (LXC) provider
 2. Set up the wordpress+mysql as-per the juju tutorial
 3. Reboot the host system
 4. Use the system for a day

What happens: juju runs away, taking 100% of a CPU with it

I am using juju version 0.5+bzr537-1juju5~precise1

Related branches

Māris Fogels (mars) wrote :
Clint Byrum (clint-fewbar) wrote :

Hi Maris. When I try this with the latest juju on precise, I don't get a runaway process. Mine just reconnects every few seconds...

[pid 1090] poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLOUT}], 2, 3333) = 1 ([{fd=10, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
[pid 1090] getsockopt(10, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid 1090] close(10) = 0
[pid 1090] poll([{fd=4, events=POLLIN}], 1, 3333) = 0 (Timeout)
[pid 1090] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
[pid 1090] setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 1090] fcntl(10, F_GETFL) = 0x2 (flags O_RDWR)
[pid 1090] fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 1090] connect(10, {sa_family=AF_INET, sin_port=htons(55429), sin_addr=inet_addr("192.168.122.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 1090] poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLOUT}], 2, 3333) = 1 ([{fd=10, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
[pid 1090] getsockopt(10, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid 1090] close(10) = 0
[pid 1090] poll([{fd=4, events=POLLIN}], 1, 3333) = 0 (Timeout)
[pid 1090] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
[pid 1090] setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 1090] fcntl(10, F_GETFL) = 0x2 (flags O_RDWR)
[pid 1090] fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 1090] connect(10, {sa_family=AF_INET, sin_port=htons(55429), sin_addr=inet_addr("192.168.122.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 1090] poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLOUT}], 2, 3333) = 1 ([{fd=10, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
[pid 1090] getsockopt(10, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid 1090] close(10) = 0
[pid 1090] poll([{fd=4, events=POLLIN}], 1, 3333

Can you strace the runaway process for a while with

strace -f -o /tmp/strace-machine-agent.log -p $PID_OF_MACHINE_AGENT

And then attach that as well?

Thanks!

Changed in juju:
status: New → Incomplete
importance: Undecided → High
Māris Fogels (mars) wrote :

Here is the output of strace, as requested.

Changed in juju:
status: Incomplete → New
Māris Fogels (mars) wrote :

You said your system reconnects every few seconds? My system reconnects hundreds of times a second. That 22MiB logfile was generated in about ten seconds.

Clint Byrum (clint-fewbar) wrote :

After some IRC debugging with mars, Its clear to me there is something deep within libzookeeper that causes it to reconnect furiously after a long time of zookeeper being unavailable.

I think the right thing to do here then is to just not bring up the agents at boot time, until ZooKeeper and the containers are also brought up at boot time.

Changed in juju:
status: New → Confirmed
status: Confirmed → Triaged
Changed in juju:
milestone: none → galapagos
assignee: nobody → Benjamin Saller (bcsaller)
Neal McBurnett (nealmcb) wrote :

I tried the strace suggested in #2, with the output below - pretty much the same as Māris.

The nice thing is, doing the strace also stopped the 100% cpu usage and the process I traced is no longer there.

The odd thing is that there were two processes at 100% cpu according to htop, but only one according to ps and only one of my 4 cpus was pegged. I traced the one that was not visible in ps, and now it is gone.....

1343 poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
1343 poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
1343 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
1343 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
1343 fcntl(10, F_GETFL) = 0x2 (flags O_RDWR)
1343 fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
1343 connect(10, {sa_family=AF_INET, sin_port=htons(52489), sin_addr=inet_addr("192.168.122.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
1343 close(10) = 0

Neal McBurnett (nealmcb) wrote :

Another symptom related to this seems to be this output in the machine-agent.log, after each boot, as well as when I did the strace described in comment #2:

$ tail -10 /home/neal/juju/neal-local/machine-agent.log
2012-06-21 10:27:29,961: twisted@ERROR: Unhandled error in Deferred:
2012-06-21 10:27:29,987: twisted@ERROR: Unhandled Error
Traceback (most recent call last):
Failure: txzookeeper.client.ConnectionTimeoutException: could not connect before timeout

2012-06-22 07:03:32,081: twisted@ERROR: Unhandled error in Deferred:
2012-06-22 07:03:32,082: twisted@ERROR: Unhandled Error
Traceback (most recent call last):
Failure: txzookeeper.client.ConnectionTimeoutException: could not connect before timeout

Changed in juju:
milestone: galapagos → honolulu
Clint Byrum (clint-fewbar) wrote :

Neal thanks for the extra info. This is pretty well understood now. There are really two bugs here. One is that we are starting the machine agent on reboot, which we shouldn't do since we know there will be no ZooKeeper on reboot, and the other is that libzookeeper seems to loop uncontrollably on reconnecting at some point.

We'll fix the former in juju by not attempting to restart the machine agent. The latter, I don't think we'll get to unfortunately.

Neal McBurnett (nealmcb) wrote :

Thanks, Clint - makes sense, though I wonder how we'll configure services that need to restart on reboot.

For the record, here are some more kern.log messages, which accompanied the strace that I talked about before, and give some more background on why the cpu calmed down.

Jun 22 07:03:21 rf kernel: [74141.409402] init: juju-neal-local-machine-agent main process (1100) killed by TRAP signal
Jun 22 07:03:21 rf kernel: [74141.429782] init: juju-neal-local-machine-agent main process ended, respawning

Based on the respawn, I guess it may come back, so instructions geared to any newbies who run across this bug on how to most easily destroy the environment would help. I tried to dig up what I had done in my testing via the status command, but it doesn't help:

$ juju status
could not connect before timeout
2012-06-22 18:04:00,125 ERROR could not connect before timeout

I can't run 'juju destroy-service' without knowing what service I started ;) So e.g. how can we figure that out?

juju destroy-environment will clean things out and clear out the upstart
jobs. the machine agent is a singleton on the host in a local provider
environment, regardless of the services deployed in containers.

On Fri, Jun 22, 2012 at 8:13 PM, Neal McBurnett <email address hidden>wrote:

> Thanks, Clint - makes sense, though I wonder how we'll configure
> services that need to restart on reboot.
>
> For the record, here are some more kern.log messages, which accompanied
> the strace that I talked about before, and give some more background on
> why the cpu calmed down.
>
> Jun 22 07:03:21 rf kernel: [74141.409402] init:
> juju-neal-local-machine-agent main process (1100) killed by TRAP signal
> Jun 22 07:03:21 rf kernel: [74141.429782] init:
> juju-neal-local-machine-agent main process ended, respawning
>
> Based on the respawn, I guess it may come back, so instructions geared
> to any newbies who run across this bug on how to most easily destroy the
> environment would help. I tried to dig up what I had done in my testing
> via the status command, but it doesn't help:
>
> $ juju status
> could not connect before timeout
> 2012-06-22 18:04:00,125 ERROR could not connect before timeout
>
> I can't run 'juju destroy-service' without knowing what service I
> started ;) So e.g. how can we figure that out?
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/1006553
>
> Title:
> Juju uses 100% CPU after host reboot
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1006553/+subscriptions
>

It would be good to allow an environment to remain in place but not run amok, so the user doesn't have to recrete it every time.

Neal McBurnett (nealmcb) wrote :

Adrien, I agree, something less destructive than juju destroy-environment would be helpful for people that run into this. The "lxc-list" command will show which containers exist, but juju destroy-service fails for me after a reboot also. It's still unclear how to avoid cpu churning but also preserve the environment.

Is there a workaround to cleanly resume after reboot, restarting the zookeeper, agents, etc?

restarting zookeeper will stop the cpu load as the machine agent will be
able to connect. However the containers would still need to be started, the
machine agent should do this, but i don't believe its implemented as such
yet (detecting existing container and restarting vs starting new).

-k

On Tue, Jun 26, 2012 at 8:56 AM, Neal McBurnett <email address hidden>wrote:

> Adrien, I agree, something less destructive than juju destroy-
> environment would be helpful for people that run into this. The "lxc-
> list" command will show which containers exist, but juju destroy-service
> fails for me after a reboot also. It's still unclear how to avoid cpu
> churning but also preserve the environment.
>
> Is there a workaround to cleanly resume after reboot, restarting the
> zookeeper, agents, etc?
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/1006553
>
> Title:
> Juju uses 100% CPU after host reboot
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1006553/+subscriptions
>

summary: - Juju uses 100% CPU after host reboot
+ local provider machine agent uses 100% CPU after host reboot
Selene Scriven (toykeeper) wrote :

I discovered this shortly after my cpu fan driver stopped working. The two issues combined to make my notebook overheat and shut down.

Changed in juju:
status: Triaged → In Progress
Changed in juju (Ubuntu):
importance: Undecided → High
Changed in juju (Ubuntu Precise):
importance: Undecided → High
Changed in juju (Ubuntu):
status: New → Triaged
Changed in juju (Ubuntu Precise):
status: New → Triaged
milestone: none → ubuntu-12.04.1
Changed in juju (Ubuntu Precise):
milestone: ubuntu-12.04.1 → none
Clint Byrum (clint-fewbar) wrote :

We don't want to backport this to 0.5.x until the issues with lxc requiring root access are addressed.

Changed in juju:
status: In Progress → Fix Committed
Andreas Hasenack (ahasenack) wrote :

This burns my lap everyday on precise :(

Changed in juju:
status: Fix Committed → Fix Released
Clint Byrum (clint-fewbar) wrote :

Fixed in 0.6 upstream.

Changed in juju (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers