Bug #1730709 “[17.08+][ocata+] Unit has peers, but RabbitMQ not ...” : Bugs : OpenStack RabbitMQ Server Charm

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-11-07:

#1

rabbitmq-server-2.log Edit (16.0 MiB, text/plain)

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-11-07:

#2

rabbitmq.tar.gz Edit (21.6 KiB, application/x-tar)

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-11-07:

#3

After redeploying rabbitmq-server it clusterized just fine:

juju remove-application rabbitmq-server

cat ../rabbitmq.yaml
rabbitmq-server:
  source: cloud:xenial-ocata
  min-cluster-size: 3
  cluster-partition-handling: pause_minority

juju deploy rabbitmq-server --config ../rabbitmq.yaml -n 3 --to lxd:0,lxd:1,lxd:12 --bind "space-oam amqp=space-oam cluster=space-oam"

juju s rabbitmq-server
Unit Workload Agent Machine Public address Ports Message
rabbitmq-server/3* active idle 0/lxd/9 10.30.21.164 5672/tcp Unit is ready and clustered
rabbitmq-server/4 active idle 1/lxd/9 10.30.21.104 5672/tcp Unit is ready and clustered
rabbitmq-server/5 active idle 12/lxd/9 10.30.21.119 5672/tcp Unit is ready and clustered

Machine State DNS Inst id Series AZ Message
0 started 103.77.105.129 7nc46s xenial z01 Deployed
0/lxd/9 started 10.30.21.164 juju-ada6ad-0-lxd-9 xenial z01 Container started
1 started 103.77.105.130 hxgx7c xenial z02 Deployed
1/lxd/9 started 10.30.21.104 juju-ada6ad-1-lxd-9 xenial z02 Container started
12 started 103.77.105.132 prbbg4 xenial z03 Deployed
12/lxd/9 started 10.30.21.119 juju-ada6ad-12-lxd-9 xenial z03 Container started

initial_client_update_done is now set to true as it should be:

root@juju-ada6ad-0-lxd-9:/var/lib/juju/agents/unit-rabbitmq-server-3/charm# sqlite3 .unit-state.db

sqlite> select * from kv;
charm_revisions|["150"]
env|{"JUJU_MACHINE_ID": "0/lxd/9", "NO_PROXY": "127.0.0.1,::1,localhost", "JUJU_CHARM_DIR": "/var/lib/juju/agents/unit-rabbitmq-server-3/charm", "JUJU_API_ADDRESSES": "10.30.20.42:17070", "JUJU_CONTEXT_ID": "rabbitmq-server/3-cluster-relation-changed-3592942888485342330", "PATH": "/var/lib/juju/tools/unit-rabbitmq-server-3:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "JUJU_MODEL_UUID": "8580a94a-5a38-44db-8375-98e4a3ada6ad", "JUJU_METER_INFO": "not set", "JUJU_AGENT_SOCKET": "@/var/lib/juju/agents/unit-rabbitmq-server-3/agent.socket", "no_proxy": "127.0.0.1,::1,localhost", "JUJU_METER_STATUS": "AMBER", "DEBIAN_FRONTEND": "noninteractive", "JUJU_UNIT_NAME": "rabbitmq-server/3", "JUJU_RELATION": "cluster", "JUJU_AVAILABILITY_ZONE": "z01", "JUJU_REMOTE_UNIT": "rabbitmq-server/4", "JUJU_PRINCIPAL_UNIT": "", "APT_LISTCHANGES_FRONTEND": "none", "CHARM_DIR": "/var/lib/juju/agents/unit-rabbitmq-server-3/charm", "JUJU_SLA": "unsupported", "JUJU_MODEL_NAME": "openstack", "JUJU_RELATION_ID": "cluster:228"}
unit|"rabbitmq-server/3"
relid|"cluster:228"
initial_client_update_done|true

After redeploying rabbitmq-server it clusterized just fine:

juju remove-application rabbitmq-server

cat ../rabbitmq.yaml 
rabbitmq-server:
  source: cloud:xenial-ocata
  min-cluster-size: 3
  cluster-partition-handling: pause_minority

juju deploy rabbitmq-server --config ../rabbitmq.yaml -n 3 --to lxd:0,lxd:1,lxd:12 --bind "space-oam amqp=space-oam cluster=space-oam"

juju s rabbitmq-server
Unit                Workload  Agent  Machine   Public address  Ports     Message
rabbitmq-server/3*  active    idle   0/lxd/9   10.30.21.164    5672/tcp  Unit is ready and clustered
rabbitmq-server/4   active    idle   1/lxd/9   10.30.21.104    5672/tcp  Unit is ready and clustered
rabbitmq-server/5   active    idle   12/lxd/9  10.30.21.119    5672/tcp  Unit is ready and clustered

Machine   State    DNS             Inst id               Series  AZ   Message
0         started  103.77.105.129  7nc46s                xenial  z01  Deployed
0/lxd/9   started  10.30.21.164    juju-ada6ad-0-lxd-9   xenial  z01  Container started
1         started  103.77.105.130  hxgx7c                xenial  z02  Deployed
1/lxd/9   started  10.30.21.104    juju-ada6ad-1-lxd-9   xenial  z02  Container started
12        started  103.77.105.132  prbbg4                xenial  z03  Deployed
12/lxd/9  started  10.30.21.119    juju-ada6ad-12-lxd-9  xenial  z03  Container started

initial_client_update_done is now set to true as it should be:

root@juju-ada6ad-0-lxd-9:/var/lib/juju/agents/unit-rabbitmq-server-3/charm# sqlite3 .unit-state.db

sqlite> select * from kv;
charm_revisions|["150"]
env|{"JUJU_MACHINE_ID": "0/lxd/9", "NO_PROXY": "127.0.0.1,::1,localhost", "JUJU_CHARM_DIR": "/var/lib/juju/agents/unit-rabbitmq-server-3/charm", "JUJU_API_ADDRESSES": "10.30.20.42:17070", "JUJU_CONTEXT_ID": "rabbitmq-server/3-cluster-relation-changed-3592942888485342330", "PATH": "/var/lib/juju/tools/unit-rabbitmq-server-3:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "JUJU_MODEL_UUID": "8580a94a-5a38-44db-8375-98e4a3ada6ad", "JUJU_METER_INFO": "not set", "JUJU_AGENT_SOCKET": "@/var/lib/juju/agents/unit-rabbitmq-server-3/agent.socket", "no_proxy": "127.0.0.1,::1,localhost", "JUJU_METER_STATUS": "AMBER", "DEBIAN_FRONTEND": "noninteractive", "JUJU_UNIT_NAME": "rabbitmq-server/3", "JUJU_RELATION": "cluster", "JUJU_AVAILABILITY_ZONE": "z01", "JUJU_REMOTE_UNIT": "rabbitmq-server/4", "JUJU_PRINCIPAL_UNIT": "", "APT_LISTCHANGES_FRONTEND": "none", "CHARM_DIR": "/var/lib/juju/agents/unit-rabbitmq-server-3/charm", "JUJU_SLA": "unsupported", "JUJU_MODEL_NAME": "openstack", "JUJU_RELATION_ID": "cluster:228"}
unit|"rabbitmq-server/3"
relid|"cluster:228"
initial_client_update_done|true

Revision history for this message

James Page (james-page) wrote on 2018-01-04:

#4

Dmitrii

I think this might be due to the leadership changing in the rmq application part way through the install/deployment, which causes the cluster to never actually form (as somehow all units thing some other unit is going to seed the cluster).

We've seen this happen on deployments with high loading during the deployment of OpenStack.

Changed in charm-rabbitmq-server:
status:	New → Incomplete

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2018-01-05:

#5

I will gather more logs if I encounter this again and check unit leader elections by uniter logs.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-01-31:

#6

juju-crashdump-15c37d65-29cc-4a97-8e28-224b112b0185.tar.gz Edit (21.3 MiB, application/x-tar)

We hit this today in our test lab. crashdump attached.

bundle: http://paste.ubuntu.com/26495985/

Changed in charm-rabbitmq-server:
status:	Incomplete → New
tags:	added: cdo-qa cdo-qa-blocker cpe-foundations

Dmitrii Shcherbakov (dmitriis) on 2018-01-31

summary:

- [17.08][ocata] Unit has peers, but RabbitMQ not clustered
+ [17.08+][ocata+] Unit has peers, but RabbitMQ not clustered

Jason Hobbs (jason-hobbs) on 2018-02-05

tags:

added: foundations-engine
removed: cpe-foundations

Alex Kavanagh (ajkavanagh) on 2019-04-02

Changed in charm-rabbitmq-server:
status:	New → Confirmed

Alex Kavanagh (ajkavanagh) on 2019-04-02

Changed in charm-rabbitmq-server:
importance:	Undecided → Low

Revision history for this message

David Ames (thedac) wrote on 2020-05-11:

#7

We are seeing this in the 20.05 release candidates:

https://review.opendev.org/#/c/725909/
https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-rabbitmq-server/725909/2/5616/test_charm_func_full_8777/juju-status-zaza-10e8b3c22119.txt

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2020-05-18:

#8

Download full text (4.1 KiB)

In the CI report from #7, I see that initial_client_update_done==true on both the problematic minion unit and the leader unit so I am not sure my initial comment in the bug description is valid anymore (and also considering that this value is set to true on the first client update):

https://paste.ubuntu.com/p/z8d6qYGW2V/

Looking further:

1) rabbitmq/0 was ready at 20:24:57:

2020-05-16 20:24:57 INFO juju-log Unit is ready

2) A few seconds after that the stop hook was fired:

juju-crashdump-4009e8cc-5c02-4ce0-bf61-227517302e4b/3/baremetal/var/lib/juju/agents/unit-rabbitmq-server-0/charm

sqlite> select * from hooks;
# ...
125|hooks/stop|2020-05-16T20:25:06.088873
126|hooks/stop|2020-05-16T20:25:07.176063
127|hooks/stop|2020-05-16T20:25:07.178036

While juju status yaml shows:

      rabbitmq-server/0:
        workload-status:
          current: waiting
          message: Unit has peers, but RabbitMQ not clustered
          since: 16 May 2020 20:25:08Z
        juju-status:
          current: idle
          since: 16 May 2020 20:25:09Z
          version: 2.7.6
        machine: "3"

So we have:

2020-05-16 20:24:57 INFO juju-log Unit is ready

# stop hook -> calls leave_cluster which does stop_app, reset, start_app and logs a bunch of messages about the startup after reset, and then finally logs "Successfully left cluster gracefully"

2020-05-16 20:25:00 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'stop_app']
2020-05-16 20:25:01 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'reset']
2020-05-16 20:25:03 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'start_app']
2020-05-16 20:25:04 DEBUG juju-log Waiting for rabbitmq app to start: /<email address hidden>
2020-05-16 20:25:04 DEBUG juju-log Running ['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']
2020-05-16 20:25:05 DEBUG juju-log Confirmed rabbitmq app is running
2020-05-16 20:25:05 INFO juju-log Successfully left cluster gracefully.
2020-05-16 20:25:05 DEBUG juju-log Calculating erl vm io thread pool size based on num_cpus=2 and multiplier=24
2020-05-16 20:25:05 DEBUG juju-log erl vm io thread pool size = 48 (capped=False)
2020-05-16 20:25:07 DEBUG juju-log Checking for minimum of 2 peer units
2020-05-16 20:25:07 INFO juju-log Sufficient number of peer units to form cluster 2

3) Per the test code, a unit is not actually removed from the model - instead `juju run --unit <unit> hooks/stop` is done:

https://github.com/openstack-charmers/zaza-openstack-tests/blob/08e42db7c3803b2c833e8d8e83791d6e46e8e2ee/zaza/openstack/charm_tests/rabbitmq_server/tests.py#L359-L363

The "removal" test runs first:

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-rabbitmq-server/725909/5/5729/test_charm_func_full_8878/func.txt
2020-05-16 20:23:43 [INFO] test_921_remove_unit (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)
2020-05-16 20:23:43 [INFO] Test if unit cleans up when removed from Rmq cluster.

And succeeds by printing "OK" (logging.info('OK') at the end of the test case code):

2020-05-16 20:25:18 [INFO] OK

And then aft...

In the CI report from #7, I see that initial_client_update_done==true on both the problematic minion unit and the leader unit so I am not sure my initial comment in the bug description is valid anymore (and also considering that this value is set to true on the first client update):

https://paste.ubuntu.com/p/z8d6qYGW2V/

Looking further:

1) rabbitmq/0 was ready at 20:24:57:

2020-05-16 20:24:57 INFO juju-log Unit is ready

2) A few seconds after that the stop hook was fired:

juju-crashdump-4009e8cc-5c02-4ce0-bf61-227517302e4b/3/baremetal/var/lib/juju/agents/unit-rabbitmq-server-0/charm

sqlite> select * from hooks;
# ...
125|hooks/stop|2020-05-16T20:25:06.088873
126|hooks/stop|2020-05-16T20:25:07.176063
127|hooks/stop|2020-05-16T20:25:07.178036

While juju status yaml shows:

rabbitmq-server/0:
        workload-status:
          current: waiting
          message: Unit has peers, but RabbitMQ not clustered
          since: 16 May 2020 20:25:08Z
        juju-status:
          current: idle
          since: 16 May 2020 20:25:09Z
          version: 2.7.6
        machine: "3"

So we have:

2020-05-16 20:24:57 INFO juju-log Unit is ready

# stop hook -> calls leave_cluster which does stop_app, reset, start_app and logs a bunch of messages about the startup after reset, and then finally logs "Successfully left cluster gracefully"

2020-05-16 20:25:00 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'stop_app']
2020-05-16 20:25:01 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'reset']
2020-05-16 20:25:03 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'start_app']
2020-05-16 20:25:04 DEBUG juju-log Waiting for rabbitmq app to start: /var/lib/rabbitmq/mnesia/rabbit@juju-a575fe-zaza-f6ef0b8bad38-3.pid
2020-05-16 20:25:04 DEBUG juju-log Running ['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/var/lib/rabbitmq/mnesia/rabbit@juju-a575fe-zaza-f6ef0b8bad38-3.pid']
2020-05-16 20:25:05 DEBUG juju-log Confirmed rabbitmq app is running
2020-05-16 20:25:05 INFO juju-log Successfully left cluster gracefully.
2020-05-16 20:25:05 DEBUG juju-log Calculating erl vm io thread pool size based on num_cpus=2 and multiplier=24
2020-05-16 20:25:05 DEBUG juju-log erl vm io thread pool size = 48 (capped=False)
2020-05-16 20:25:07 DEBUG juju-log Checking for minimum of 2 peer units
2020-05-16 20:25:07 INFO juju-log Sufficient number of peer units to form cluster 2

3) Per the test code, a unit is not actually removed from the model - instead `juju run --unit <unit> hooks/stop` is done:

https://github.com/openstack-charmers/zaza-openstack-tests/blob/08e42db7c3803b2c833e8d8e83791d6e46e8e2ee/zaza/openstack/charm_tests/rabbitmq_server/tests.py#L359-L363

The "removal" test runs first:

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-rabbitmq-server/725909/5/5729/test_charm_func_full_8878/func.txt
2020-05-16 20:23:43 [INFO] test_921_remove_unit (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)
2020-05-16 20:23:43 [INFO] Test if unit cleans up when removed from Rmq cluster.

And succeeds by printing "OK" (logging.info('OK') at the end of the test case code):

2020-05-16 20:25:18 [INFO] OK

And then after that we have a failure of the "pause" test case which expects a unit to be "active", not "waiting" because the "test_921_remove_unit" test case does not clean up after itself:

2020-05-16 20:25:18 [INFO] FAIL: test_910_pause_and_resume (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)
2020-05-16 20:25:18 [INFO] The services can be paused and resumed.
2020-05-16 20:25:18 [INFO] ----------------------------------------------------------------------
2020-05-16 20:25:18 [INFO] Traceback (most recent call last):
2020-05-16 20:25:18 [INFO]   File "/tmp/tmp.AdGtY3fqac/func/lib/python3.5/site-packages/zaza/openstack/charm_tests/rabbitmq_server/tests.py", line 279, in test_910_pause_and_resume
2020-05-16 20:25:18 [INFO]     assert unit.workload_status == "active"
2020-05-16 20:25:18 [INFO] AssertionError

So I think the failure in #7 is the test ordering issue: I see that whenever test_921_remove_unit is at the end of all func tests for a given model there are no errors.

Revision history for this message

Aurelien Lourot (aurelien-lourot) wrote on 2021-01-19:

#9

Seen on OSCI twice in a row on bionic-queens:
http://osci:8080/job/mojo_runner/24776/console
http://osci:8080/job/mojo_runner/24843/console

OpenStack RabbitMQ Server Charm

[17.08+][ocata+] Unit has peers, but RabbitMQ not clustered

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches