juju relation-list doesn't report full units list when unit is down

Bug #1267913 reported by Yolanda Robla
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Won't Fix
High
Unassigned
juju-core (Ubuntu)
Won't Fix
High
Unassigned

Bug Description

Testing rabbit active/active charm, found an issue. I add some clustered rabbit units, for example 4 units. When i list all the relations with relation-list i get the correct list:
"INFO" "got units: ['rabbitmq-server/0', 'rabbitmq-server/1', 'rabbitmq-server/2']"]

Then i stop rabbitmq-server/0 unit badly, with a nova stop. After i do the same call i receive:

"INFO" "got units: ['rabbitmq-server/0']"

I can reproduce that every time i start/stop the instance. That's an issue for HA because if first unit is stopped, i'm not able to add more units and cluster with other ones.

Revision history for this message
Yolanda Robla (yolanda.robla) wrote :

Problem reduced to the following:

When first unit is up, and a relation is added, relation-list stops in the first failing unit, so it doesn't list the rest.

That is a trouble for HA because if for some reason first unit is lost, we are unable to add more units to the service, the only way to fix it, is to destroy the service and recreate again.

Revision history for this message
James Page (james-page) wrote :
Download full text (3.4 KiB)

ubuntu@james-page-bastion:~/charms⟫ juju status
verbose is deprecated with the current meaning, use show-log
2014-01-16 12:11:18 INFO juju.provider.openstack provider.go:116 opening environment "jamespage-serverstack"
2014-01-16 12:11:18 INFO juju.state open.go:68 opening state; mongo addresses: ["10.5.0.29:37017"]; entity ""
2014-01-16 12:11:18 INFO juju.state open.go:106 connection established
environment: jamespage-serverstack
machines:
  "0":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.29
    instance-id: d7509837-8a15-4522-bda3-2b9832bff96a
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "1":
    agent-state: started
    agent-version: 1.16.5
    instance-id: 53bc707c-6309-46f7-a832-95da15a36e19
    instance-state: missing
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "2":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.31
    instance-id: baaccb7a-c590-436e-bd04-3637074b2f9d
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "3":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.32
    instance-id: 5e20f3c2-6dfe-4b34-8155-c3e02ce737c3
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
services:
  rabbitmq-server:
    charm: cs:precise/rabbitmq-server-17
    exposed: false
    relations:
      cluster:
      - rabbitmq-server
    units:
      rabbitmq-server/0:
        agent-state: started
        agent-version: 1.16.5 "2":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.31
    instance-id: baaccb7a-c590-436e-bd04-3637074b2f9d
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "3":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.32
    instance-id: 5e20f3c2-6dfe-4b34-8155-c3e02ce737c3
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
services:
  rabbitmq-server:
    charm: cs:precise/rabbitmq-server-17
    exposed: false
    relations:
      cluster:
      - rabbitmq-server
    units:
      rabbitmq-server/0:
        agent-state: started
        agent-version: 1.16.5
        machine: "1"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.30
      rabbitmq-server/1:
        agent-state: started
        agent-version: 1.16.5
        machine: "2"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.31
      rabbitmq-server/2:
        agent-state: error
        agent-state-info: 'hook failed: "cluster-relation-changed"'
        agent-version: 1.16.5
        machine: "3"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.32
2014-01-16 12:11:18 INFO juju supercommand.go:286 command finished

        machine: "1"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.30
      rabbitmq-server/1:
        agent-state: started
        agent-version: 1.16.5
        machine: "2"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.31
      rabbitmq-server/2:
        agent-state: error
        agent-state-in...

Read more...

Revision history for this message
James Page (james-page) wrote :

Confirmed; it also looks odd that the agent state remains started despite the fact that the instance is actually stopped.

Maybe that is related.

Changed in juju (Ubuntu):
status: New → Confirmed
importance: Undecided → High
no longer affects: juju (Ubuntu)
Changed in juju-core (Ubuntu):
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
James Page (james-page) wrote :

Details:

1.16.5 release of juju-core, openstack provider.

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.17.1
Revision history for this message
William Reade (fwereade) wrote :

To be clear: none of this is unexpected (apart from the distinct agent-state issue noted by jamespage that looks like lp:1205451). Juju communicates state changes to units as a series of single-change deltas, effectively; when a new unit comes up in a relation with several others, it will first see 0, then 1, then 2, then... related units 00 and so, really, this bug reduces to "when I tell juju there's an error, it acts like there's an error", and that's a straight-up WONTFIX in juju-core.

BUT if you *didn't* return an error in that situation, and you really were in a temporarily failed active/active configuration, you'd also have problems, because you can't reasonably configure and start that unit without splitting the service's brain. We can mitigate this situation in a number of ways:

* we could restore the pyjuju behaviour, and just not inform you of units that are not really present. This won't work, because you'll just configure in standalone mode, and thus split-brain your intended active/active deployment.

* we could expose some independent mechanism by which a unit could report whether it's actually working or not; this would allow you to leave rabbit unconfigured and exit the hook without error, having made sure to set the "not actually working" flag. This'd (1) give users a way to observe this condition from outside, and (2) allow the unit to recover once more units came online (by having set the "not working" flag, rather than setting a unit error state).

* we could add new hooks: let's strawman "<name>-relation-idle", which would fire whenever one of the named relation's hook queues emptied. This would let you defer *all* relation setup work until the whole picture was available to you -- at the cost of maybe waiting a long time before it actually ran -- or at least to let you avoid putting the unit in an error state before you're really sure it's in one (because you always know you'll have at least one more hook in which to correct yourself.

Both the second and third options have some glimmers of value to them, but even together I'm not sure they're a complete solution. I'd like to hear your thoughts.

Martin Packman (gz)
Changed in juju-core:
milestone: 1.17.1 → 1.18.0
James Page (james-page)
Changed in juju-core (Ubuntu):
importance: Critical → High
Revision history for this message
William Reade (fwereade) wrote :

I'm closing this WONTFIX and have opened lp:1312173 to track readiness-reporting.

Changed in juju-core:
status: Triaged → Won't Fix
milestone: 1.20.0 → none
Changed in juju-core (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.