Bug #1267913 “juju relation-list doesn't report full units list ...” : Bugs : juju-core

Revision history for this message

Yolanda Robla (yolanda.robla) wrote on 2014-01-13:

#1

Problem reduced to the following:

When first unit is up, and a relation is added, relation-list stops in the first failing unit, so it doesn't list the rest.

That is a trouble for HA because if for some reason first unit is lost, we are unable to add more units to the service, the only way to fix it, is to destroy the service and recreate again.

Revision history for this message

James Page (james-page) wrote on 2014-01-16:

#2

Download full text (3.4 KiB)

ubuntu@james-page-bastion:~/charms⟫ juju status
verbose is deprecated with the current meaning, use show-log
2014-01-16 12:11:18 INFO juju.provider.openstack provider.go:116 opening environment "jamespage-serverstack"
2014-01-16 12:11:18 INFO juju.state open.go:68 opening state; mongo addresses: ["10.5.0.29:37017"]; entity ""
2014-01-16 12:11:18 INFO juju.state open.go:106 connection established
environment: jamespage-serverstack
machines:
  "0":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.29
    instance-id: d7509837-8a15-4522-bda3-2b9832bff96a
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "1":
    agent-state: started
    agent-version: 1.16.5
    instance-id: 53bc707c-6309-46f7-a832-95da15a36e19
    instance-state: missing
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "2":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.31
    instance-id: baaccb7a-c590-436e-bd04-3637074b2f9d
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "3":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.32
    instance-id: 5e20f3c2-6dfe-4b34-8155-c3e02ce737c3
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
services:
  rabbitmq-server:
    charm: cs:precise/rabbitmq-server-17
    exposed: false
    relations:
      cluster:
      - rabbitmq-server
    units:
      rabbitmq-server/0:
        agent-state: started
        agent-version: 1.16.5 "2":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.31
    instance-id: baaccb7a-c590-436e-bd04-3637074b2f9d
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "3":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.32
    instance-id: 5e20f3c2-6dfe-4b34-8155-c3e02ce737c3
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
services:
  rabbitmq-server:
    charm: cs:precise/rabbitmq-server-17
    exposed: false
    relations:
      cluster:
      - rabbitmq-server
    units:
      rabbitmq-server/0:
        agent-state: started
        agent-version: 1.16.5
        machine: "1"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.30
      rabbitmq-server/1:
        agent-state: started
        agent-version: 1.16.5
        machine: "2"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.31
      rabbitmq-server/2:
        agent-state: error
        agent-state-info: 'hook failed: "cluster-relation-changed"'
        agent-version: 1.16.5
        machine: "3"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.32
2014-01-16 12:11:18 INFO juju supercommand.go:286 command finished

        machine: "1"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.30
      rabbitmq-server/1:
        agent-state: started
        agent-version: 1.16.5
        machine: "2"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.31
      rabbitmq-server/2:
        agent-state: error
        agent-state-in...

ubuntu@james-page-bastion:~/charms⟫ juju status
verbose is deprecated with the current meaning, use show-log
2014-01-16 12:11:18 INFO juju.provider.openstack provider.go:116 opening environment "jamespage-serverstack"
2014-01-16 12:11:18 INFO juju.state open.go:68 opening state; mongo addresses: ["10.5.0.29:37017"]; entity ""
2014-01-16 12:11:18 INFO juju.state open.go:106 connection established
environment: jamespage-serverstack
machines:
  "0":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.29
    instance-id: d7509837-8a15-4522-bda3-2b9832bff96a
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "1":
    agent-state: started
    agent-version: 1.16.5
    instance-id: 53bc707c-6309-46f7-a832-95da15a36e19
    instance-state: missing
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "2":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.31
    instance-id: baaccb7a-c590-436e-bd04-3637074b2f9d
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "3":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.32
    instance-id: 5e20f3c2-6dfe-4b34-8155-c3e02ce737c3
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
services:
  rabbitmq-server:
    charm: cs:precise/rabbitmq-server-17
    exposed: false
    relations:
      cluster:
      - rabbitmq-server
    units:
      rabbitmq-server/0:
        agent-state: started
        agent-version: 1.16.5  "2":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.31
    instance-id: baaccb7a-c590-436e-bd04-3637074b2f9d
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
  "3":
    agent-state: started
    agent-version: 1.16.5
    dns-name: 10.5.0.32
    instance-id: 5e20f3c2-6dfe-4b34-8155-c3e02ce737c3
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=900M
services:
  rabbitmq-server:
    charm: cs:precise/rabbitmq-server-17
    exposed: false
    relations:
      cluster:
      - rabbitmq-server
    units:
      rabbitmq-server/0:
        agent-state: started
        agent-version: 1.16.5
        machine: "1"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.30
      rabbitmq-server/1:
        agent-state: started
        agent-version: 1.16.5
        machine: "2"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.31
      rabbitmq-server/2:
        agent-state: error
        agent-state-info: 'hook failed: "cluster-relation-changed"'
        agent-version: 1.16.5
        machine: "3"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.32
2014-01-16 12:11:18 INFO juju supercommand.go:286 command finished

machine: "1"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.30
      rabbitmq-server/1:
        agent-state: started
        agent-version: 1.16.5
        machine: "2"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.31
      rabbitmq-server/2:
        agent-state: error
        agent-state-info: 'hook failed: "cluster-relation-changed"'
        agent-version: 1.16.5
        machine: "3"
        open-ports:
        - 5672/tcp
        public-address: 10.5.0.32
2014-01-16 12:11:18 INFO juju supercommand.go:286 command finished

Revision history for this message

James Page (james-page) wrote on 2014-01-16:

#3

Confirmed; it also looks odd that the agent state remains started despite the fact that the instance is actually stopped.

Maybe that is related.

Changed in juju (Ubuntu):
status:	New → Confirmed
importance:	Undecided → High
no longer affects:	juju (Ubuntu)
Changed in juju-core (Ubuntu):
status:	New → Confirmed
importance:	Undecided → Critical

Revision history for this message

James Page (james-page) wrote on 2014-01-16:

#4

Details:

1.16.5 release of juju-core, openstack provider.

Curtis Hovey (sinzui) on 2014-01-16

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 1.17.1

Revision history for this message

William Reade (fwereade) wrote on 2014-01-21:

#5

To be clear: none of this is unexpected (apart from the distinct agent-state issue noted by jamespage that looks like lp:1205451). Juju communicates state changes to units as a series of single-change deltas, effectively; when a new unit comes up in a relation with several others, it will first see 0, then 1, then 2, then... related units 00 and so, really, this bug reduces to "when I tell juju there's an error, it acts like there's an error", and that's a straight-up WONTFIX in juju-core.

BUT if you *didn't* return an error in that situation, and you really were in a temporarily failed active/active configuration, you'd also have problems, because you can't reasonably configure and start that unit without splitting the service's brain. We can mitigate this situation in a number of ways:

* we could restore the pyjuju behaviour, and just not inform you of units that are not really present. This won't work, because you'll just configure in standalone mode, and thus split-brain your intended active/active deployment.

* we could expose some independent mechanism by which a unit could report whether it's actually working or not; this would allow you to leave rabbit unconfigured and exit the hook without error, having made sure to set the "not actually working" flag. This'd (1) give users a way to observe this condition from outside, and (2) allow the unit to recover once more units came online (by having set the "not working" flag, rather than setting a unit error state).

* we could add new hooks: let's strawman "<name>-relation-idle", which would fire whenever one of the named relation's hook queues emptied. This would let you defer *all* relation setup work until the whole picture was available to you -- at the cost of maybe waiting a long time before it actually ran -- or at least to let you avoid putting the unit in an error state before you're really sure it's in one (because you always know you'll have at least one more hook in which to correct yourself.

Both the second and third options have some glimmers of value to them, but even together I'm not sure they're a complete solution. I'd like to hear your thoughts.

To be clear: none of this is unexpected (apart from the distinct agent-state issue noted by jamespage that looks like lp:1205451). Juju communicates state changes to units as a series of single-change deltas, effectively; when a new unit comes up in a relation with several others, it will first see 0, then 1, then 2, then... related units 00 and so, really, this bug reduces to "when I tell juju there's an error, it acts like there's an error", and that's a straight-up WONTFIX in juju-core.

BUT if you *didn't* return an error in that situation, and you really were in a temporarily failed active/active configuration, you'd also have problems, because you can't reasonably configure and start that unit without splitting the service's brain. We can mitigate this situation in a number of ways:

* we could restore the pyjuju behaviour, and just not inform you of units that are not really present. This won't work, because you'll just configure in standalone mode, and thus split-brain your intended active/active deployment.

* we could expose some independent mechanism by which a unit could report whether it's actually working or not; this would allow you to leave rabbit unconfigured and exit the hook without error, having made sure to set the "not actually working" flag. This'd (1) give users a way to observe this condition from outside, and (2) allow the unit to recover once more units came online (by having set the "not working" flag, rather than setting a unit error state).

* we could add new hooks: let's strawman "<name>-relation-idle", which would fire whenever one of the named relation's hook queues emptied. This would let you defer *all* relation setup work until the whole picture was available to you -- at the cost of maybe waiting a long time before it actually ran -- or at least to let you avoid putting the unit in an error state before you're really sure it's in one (because you always know you'll have at least one more hook in which to correct yourself.

Both the second and third options have some glimmers of value to them, but even together I'm not sure they're a complete solution. I'd like to hear your thoughts.

Martin Packman (gz) on 2014-01-23

Changed in juju-core:
milestone:	1.17.1 → 1.18.0

James Page (james-page) on 2014-03-24

Changed in juju-core (Ubuntu):
importance:	Critical → High

Revision history for this message

William Reade (fwereade) wrote on 2014-04-24:

#6

I'm closing this WONTFIX and have opened lp:1312173 to track readiness-reporting.

Changed in juju-core:
status:	Triaged → Won't Fix
milestone:	1.20.0 → none
Changed in juju-core (Ubuntu):
status:	Confirmed → Won't Fix

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Won't Fix	High	Unassigned
	juju-core (Ubuntu)	Won't Fix	High	Unassigned

juju-core

juju relation-list doesn't report full units list when unit is down

Bug Description

Other bug subscribers

Remote bug watches