Interruption of network connectivity should be handled gracefully

Bug #846106 reported by Adam Gandelman
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
pyjuju
Fix Released
High
Kapil Thangavelu

Bug Description

While deploying to hardware, we've hit a couple of instances where network connectivity has been briefly interrupted while hooks were executing. The result is disconnect errors on the service unit and 'null' state from the client. In one instance, the formula's install hook is configuring a new network bridge and requires a networking restart to activate the new interface. In the other, powernap was meddling with the network stack after boot and caused brief interruptions of connectivity to zookeeper. In any case, juju should do everything it can to handle and recover from network interruption.

Attached is a the tail end of an install hook.

Related branches

Revision history for this message
Adam Gandelman (gandelman-a) wrote :
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

This is definitely a problem and has been seen multiple times. Its probably going to have to be split into several smaller bugs detailing each thing that has to be fixed.

Changed in juju:
status: New → Confirmed
importance: Undecided → High
tags: added: hardware production
Revision history for this message
Juan L. Negron (negronjl) wrote :

I did some meddling with the zookeeper node:

- deployed several charms, added relations, etc.
- ssh into zookeeper, apt-get update, apt-get dist-upgrade and this seems to have worked fine when the zookeeper node came back up.
- if I manually kill the zookeeper process, it never came back.

The last point seems strange to me given the importance of having zookeeper up and running for the entire juju deployment to work properly. It appears that there isn't anything watching the zookeeper process to ensure that it stays running.

Another strange thing that I saw... After juju destroy-service <service> and juju terminate-machine <machine>, I ended up with two nodes ( the zookeeper node and one of my charms ); the strange thing I noticed was when I did juju destroy-environment. My initial ( pre zookeeper meddling ) environment had 6 instances ( 5 charms and the zookeeper node ) but, after my experiment, I only had two yet, juju was still trying ( and waiting ) for 6 instances to terminate and not 2.

It appears that something went out of sync.

-Juan

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 846106] Re: Interruption of network connectivity should be handled gracefully

Excerpts from Juan L. Negron's message of Thu Sep 29 19:45:18 UTC 2011:
> I did some meddling with the zookeeper node:
>
> - deployed several charms, added relations, etc.
> - ssh into zookeeper, apt-get update, apt-get dist-upgrade and this seems to have worked fine when the zookeeper node came back up.
> - if I manually kill the zookeeper process, it never came back.
>
> The last point seems strange to me given the importance of having
> zookeeper up and running for the entire juju deployment to work
> properly. It appears that there isn't anything watching the zookeeper
> process to ensure that it stays running.
>

I opened bug #862762 to address this. zookeeperd can be migrated from
sysvinit to upstart to achieve more reliability.

Its important to note that in the future there will be multiple instances
running zookeeper, so its less critical I think. We'll also I think
need to be able to point any monitoring systems in an environment at
the bootstrap node easily.. I believe the long term plan is for the bootstrap
node(s) to be managed as a service.

> Another strange thing that I saw... After juju destroy-service <service>
> and juju terminate-machine <machine>, I ended up with two nodes ( the
> zookeeper node and one of my charms ); the strange thing I noticed was
> when I did juju destroy-environment. My initial ( pre zookeeper
> meddling ) environment had 6 instances ( 5 charms and the zookeeper node
> ) but, after my experiment, I only had two yet, juju was still trying (
> and waiting ) for 6 instances to terminate and not 2.
>
> It appears that something went out of sync.
>

Was this by chance with openstack instances? I found a nasty bug regarding
terminating instances here:

https://bugs.launchpad.net/ubuntu/+source/txaws/+bug/862595

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

Iid some testing today, for a slightly easier scenario, namely killing zookeeper for extended periods when there is no active hook execution. The agents going into polling mode, trying to reconnect to zk every few seconds. For standalone zk, even after extended periods, when the server is brought back the existing agent sessions are still active, so the agents reconnect, and everything continues as normal.

Changed in juju:
milestone: none → florence
description: updated
Changed in juju:
assignee: nobody → Kapil Thangavelu (hazmat)
Changed in juju:
status: Confirmed → In Progress
Changed in juju:
status: In Progress → Triaged
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.