pyjuju

Interruption of network connectivity should be handled gracefully

Bug #846106 reported by Adam Gandelman on 2011-09-09

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	pyjuju	Fix Released	High	Kapil Thangavelu	pyjuju florence

Bug Description

While deploying to hardware, we've hit a couple of instances where network connectivity has been briefly interrupted while hooks were executing. The result is disconnect errors on the service unit and 'null' state from the client. In one instance, the formula's install hook is configuring a new network bridge and requires a networking restart to activate the new interface. In the other, powernap was meddling with the network stack after boot and caused brief interruptions of connectivity to zookeeper. In any case, juju should do everything it can to handle and recover from network interruption.

Attached is a the tail end of an install hook.

See original description

Tags:

Related branches

lp:~hazmat/txzookeeper/managed-watch-and-ephemeral

Merged into lp:txzookeeper at revision 47

Juju Engineering: Pending requested 2012-04-04

Revision history for this message

Adam Gandelman (gandelman-a) wrote on 2011-09-09:

ensemble_disconnect.log Edit (696 bytes, text/plain)

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2011-09-29:

This is definitely a problem and has been seen multiple times. Its probably going to have to be split into several smaller bugs detailing each thing that has to be fixed.

Changed in juju:
status:	New → Confirmed
importance:	Undecided → High
tags:	added: hardware production

Revision history for this message

Juan L. Negron (negronjl) wrote on 2011-09-29:

I did some meddling with the zookeeper node:

- deployed several charms, added relations, etc.
- ssh into zookeeper, apt-get update, apt-get dist-upgrade and this seems to have worked fine when the zookeeper node came back up.
- if I manually kill the zookeeper process, it never came back.

The last point seems strange to me given the importance of having zookeeper up and running for the entire juju deployment to work properly. It appears that there isn't anything watching the zookeeper process to ensure that it stays running.

Another strange thing that I saw... After juju destroy-service <service> and juju terminate-machine <machine>, I ended up with two nodes ( the zookeeper node and one of my charms ); the strange thing I noticed was when I did juju destroy-environment. My initial ( pre zookeeper meddling ) environment had 6 instances ( 5 charms and the zookeeper node ) but, after my experiment, I only had two yet, juju was still trying ( and waiting ) for 6 instances to terminate and not 2.

It appears that something went out of sync.

-Juan

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2011-09-29: Re: [Bug 846106] Re: Interruption of network connectivity should be handled gracefully

Excerpts from Juan L. Negron's message of Thu Sep 29 19:45:18 UTC 2011:
> I did some meddling with the zookeeper node:
>
> - deployed several charms, added relations, etc.
> - ssh into zookeeper, apt-get update, apt-get dist-upgrade and this seems to have worked fine when the zookeeper node came back up.
> - if I manually kill the zookeeper process, it never came back.
>
> The last point seems strange to me given the importance of having
> zookeeper up and running for the entire juju deployment to work
> properly. It appears that there isn't anything watching the zookeeper
> process to ensure that it stays running.
>

I opened bug #862762 to address this. zookeeperd can be migrated from
sysvinit to upstart to achieve more reliability.

Its important to note that in the future there will be multiple instances
running zookeeper, so its less critical I think. We'll also I think
need to be able to point any monitoring systems in an environment at
the bootstrap node easily.. I believe the long term plan is for the bootstrap
node(s) to be managed as a service.

> Another strange thing that I saw... After juju destroy-service <service>
> and juju terminate-machine <machine>, I ended up with two nodes ( the
> zookeeper node and one of my charms ); the strange thing I noticed was
> when I did juju destroy-environment. My initial ( pre zookeeper
> meddling ) environment had 6 instances ( 5 charms and the zookeeper node
> ) but, after my experiment, I only had two yet, juju was still trying (
> and waiting ) for 6 instances to terminate and not 2.
>
> It appears that something went out of sync.
>

Was this by chance with openstack instances? I found a nasty bug regarding
terminating instances here:

https://bugs.launchpad.net/ubuntu/+source/txaws/+bug/862595

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2011-10-04:

Iid some testing today, for a slightly easier scenario, namely killing zookeeper for extended periods when there is no active hook execution. The agents going into polling mode, trying to reconnect to zk every few seconds. For standalone zk, even after extended periods, when the server is brought back the existing agent sessions are still active, so the agents reconnect, and everything continues as normal.

Kapil Thangavelu (hazmat) on 2011-12-07

Changed in juju:
milestone:	none → florence

Clint Byrum (clint-fewbar) on 2012-03-23

description:

updated

Kapil Thangavelu (hazmat) on 2012-04-04

Changed in juju:
assignee:	nobody → Kapil Thangavelu (hazmat)

Kapil Thangavelu (hazmat) on 2012-04-04

Changed in juju:
status:	Confirmed → In Progress

Kapil Thangavelu (hazmat) on 2012-04-12

Changed in juju:
status:	In Progress → Triaged
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

ensemble_disconnect.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.