[17.08+][ocata+] Unit has peers, but RabbitMQ not clustered

Bug #1730709 reported by Dmitrii Shcherbakov
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Confirmed
Low
Unassigned

Bug Description

On a fresh 17.08 ocata deployment rabbitmq server used to clusterize just fine but I was able to get to a state where is does not.

All units "see each other" via relation data but a cluster is not formed. Update status sets an appropriate status message: "Unit has peers, but RabbitMQ not clustered"

http://paste.ubuntu.com/25911855/

Looking at .unit-state.db for rabbitmq-server/2 which is a leader unit I do not see initial_client_update_done set (see https://review.openstack.org/#/c/511511/1/hooks/rabbitmq_server_relations.py):

sqlite> select * from kv;
charm_revisions|["150"]
env|{"CHARM_DIR": "/var/lib/juju/agents/unit-rabbitmq-server-2/charm", "JUJU_MACHINE_ID": "12/lxd/8", "JUJU_METER_INFO": "not set", "JUJU_AGENT_SOCKET": "@/var/lib/juju/agents/unit-rabbitmq-server-2/agent.socket", "JUJU_METER_STATUS": "AMBER", "no_proxy": "127.0.0.1,::1,localhost", "DEBIAN_FRONTEND": "noninteractive", "JUJU_UNIT_NAME": "rabbitmq-server/2", "JUJU_CHARM_DIR": "/var/lib/juju/agents/unit-rabbitmq-server-2/charm", "JUJU_SLA": "unsupported", "NO_PROXY": "127.0.0.1,::1,localhost", "JUJU_API_ADDRESSES": "10.30.20.42:17070", "JUJU_MODEL_UUID": "8580a94a-5a38-44db-8375-98e4a3ada6ad", "JUJU_AVAILABILITY_ZONE": "z03", "JUJU_CONTEXT_ID": "rabbitmq-server/2-update-status-6925875850199507958", "JUJU_MODEL_NAME": "openstack", "PATH": "/var/lib/juju/tools/unit-rabbitmq-server-2:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "JUJU_PRINCIPAL_UNIT": "", "APT_LISTCHANGES_FRONTEND": "none"}
unit|"rabbitmq-server/2"
relid|null

# sqlite> select * from hooks;
# for rabbitmq-server/2
http://paste.ubuntu.com/25911866/

ubuntu@juju-ada6ad-12-lxd-8:~$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@juju-ada6ad-12-lxd-8' ...
[{nodes,[{disc,['rabbit@juju-ada6ad-12-lxd-8']}]},
 {running_nodes,['rabbit@juju-ada6ad-12-lxd-8']},
 {cluster_name,<<"rabbit@juju-ada6ad-12-lxd-8">>},
 {partitions,[]}]

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

After redeploying rabbitmq-server it clusterized just fine:

juju remove-application rabbitmq-server

cat ../rabbitmq.yaml
rabbitmq-server:
  source: cloud:xenial-ocata
  min-cluster-size: 3
  cluster-partition-handling: pause_minority

juju deploy rabbitmq-server --config ../rabbitmq.yaml -n 3 --to lxd:0,lxd:1,lxd:12 --bind "space-oam amqp=space-oam cluster=space-oam"

juju s rabbitmq-server
Unit Workload Agent Machine Public address Ports Message
rabbitmq-server/3* active idle 0/lxd/9 10.30.21.164 5672/tcp Unit is ready and clustered
rabbitmq-server/4 active idle 1/lxd/9 10.30.21.104 5672/tcp Unit is ready and clustered
rabbitmq-server/5 active idle 12/lxd/9 10.30.21.119 5672/tcp Unit is ready and clustered

Machine State DNS Inst id Series AZ Message
0 started 103.77.105.129 7nc46s xenial z01 Deployed
0/lxd/9 started 10.30.21.164 juju-ada6ad-0-lxd-9 xenial z01 Container started
1 started 103.77.105.130 hxgx7c xenial z02 Deployed
1/lxd/9 started 10.30.21.104 juju-ada6ad-1-lxd-9 xenial z02 Container started
12 started 103.77.105.132 prbbg4 xenial z03 Deployed
12/lxd/9 started 10.30.21.119 juju-ada6ad-12-lxd-9 xenial z03 Container started

initial_client_update_done is now set to true as it should be:

root@juju-ada6ad-0-lxd-9:/var/lib/juju/agents/unit-rabbitmq-server-3/charm# sqlite3 .unit-state.db

sqlite> select * from kv;
charm_revisions|["150"]
env|{"JUJU_MACHINE_ID": "0/lxd/9", "NO_PROXY": "127.0.0.1,::1,localhost", "JUJU_CHARM_DIR": "/var/lib/juju/agents/unit-rabbitmq-server-3/charm", "JUJU_API_ADDRESSES": "10.30.20.42:17070", "JUJU_CONTEXT_ID": "rabbitmq-server/3-cluster-relation-changed-3592942888485342330", "PATH": "/var/lib/juju/tools/unit-rabbitmq-server-3:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "JUJU_MODEL_UUID": "8580a94a-5a38-44db-8375-98e4a3ada6ad", "JUJU_METER_INFO": "not set", "JUJU_AGENT_SOCKET": "@/var/lib/juju/agents/unit-rabbitmq-server-3/agent.socket", "no_proxy": "127.0.0.1,::1,localhost", "JUJU_METER_STATUS": "AMBER", "DEBIAN_FRONTEND": "noninteractive", "JUJU_UNIT_NAME": "rabbitmq-server/3", "JUJU_RELATION": "cluster", "JUJU_AVAILABILITY_ZONE": "z01", "JUJU_REMOTE_UNIT": "rabbitmq-server/4", "JUJU_PRINCIPAL_UNIT": "", "APT_LISTCHANGES_FRONTEND": "none", "CHARM_DIR": "/var/lib/juju/agents/unit-rabbitmq-server-3/charm", "JUJU_SLA": "unsupported", "JUJU_MODEL_NAME": "openstack", "JUJU_RELATION_ID": "cluster:228"}
unit|"rabbitmq-server/3"
relid|"cluster:228"
initial_client_update_done|true

Revision history for this message
James Page (james-page) wrote :

Dmitrii

I think this might be due to the leadership changing in the rmq application part way through the install/deployment, which causes the cluster to never actually form (as somehow all units thing some other unit is going to seed the cluster).

We've seen this happen on deployments with high loading during the deployment of OpenStack.

Changed in charm-rabbitmq-server:
status: New → Incomplete
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I will gather more logs if I encounter this again and check unit leader elections by uniter logs.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this today in our test lab. crashdump attached.

bundle: http://paste.ubuntu.com/26495985/

Changed in charm-rabbitmq-server:
status: Incomplete → New
tags: added: cdo-qa cdo-qa-blocker cpe-foundations
summary: - [17.08][ocata] Unit has peers, but RabbitMQ not clustered
+ [17.08+][ocata+] Unit has peers, but RabbitMQ not clustered
tags: added: foundations-engine
removed: cpe-foundations
Changed in charm-rabbitmq-server:
status: New → Confirmed
Changed in charm-rabbitmq-server:
importance: Undecided → Low
Revision history for this message
David Ames (thedac) wrote :
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (4.1 KiB)

In the CI report from #7, I see that initial_client_update_done==true on both the problematic minion unit and the leader unit so I am not sure my initial comment in the bug description is valid anymore (and also considering that this value is set to true on the first client update):

https://paste.ubuntu.com/p/z8d6qYGW2V/

Looking further:

1) rabbitmq/0 was ready at 20:24:57:

2020-05-16 20:24:57 INFO juju-log Unit is ready

2) A few seconds after that the stop hook was fired:

juju-crashdump-4009e8cc-5c02-4ce0-bf61-227517302e4b/3/baremetal/var/lib/juju/agents/unit-rabbitmq-server-0/charm

sqlite> select * from hooks;
# ...
125|hooks/stop|2020-05-16T20:25:06.088873
126|hooks/stop|2020-05-16T20:25:07.176063
127|hooks/stop|2020-05-16T20:25:07.178036

While juju status yaml shows:

      rabbitmq-server/0:
        workload-status:
          current: waiting
          message: Unit has peers, but RabbitMQ not clustered
          since: 16 May 2020 20:25:08Z
        juju-status:
          current: idle
          since: 16 May 2020 20:25:09Z
          version: 2.7.6
        machine: "3"

So we have:

2020-05-16 20:24:57 INFO juju-log Unit is ready

# stop hook -> calls leave_cluster which does stop_app, reset, start_app and logs a bunch of messages about the startup after reset, and then finally logs "Successfully left cluster gracefully"

2020-05-16 20:25:00 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'stop_app']
2020-05-16 20:25:01 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'reset']
2020-05-16 20:25:03 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'start_app']
2020-05-16 20:25:04 DEBUG juju-log Waiting for rabbitmq app to start: /<email address hidden>
2020-05-16 20:25:04 DEBUG juju-log Running ['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']
2020-05-16 20:25:05 DEBUG juju-log Confirmed rabbitmq app is running
2020-05-16 20:25:05 INFO juju-log Successfully left cluster gracefully.
2020-05-16 20:25:05 DEBUG juju-log Calculating erl vm io thread pool size based on num_cpus=2 and multiplier=24
2020-05-16 20:25:05 DEBUG juju-log erl vm io thread pool size = 48 (capped=False)
2020-05-16 20:25:07 DEBUG juju-log Checking for minimum of 2 peer units
2020-05-16 20:25:07 INFO juju-log Sufficient number of peer units to form cluster 2

3) Per the test code, a unit is not actually removed from the model - instead `juju run --unit <unit> hooks/stop` is done:

https://github.com/openstack-charmers/zaza-openstack-tests/blob/08e42db7c3803b2c833e8d8e83791d6e46e8e2ee/zaza/openstack/charm_tests/rabbitmq_server/tests.py#L359-L363

The "removal" test runs first:

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-rabbitmq-server/725909/5/5729/test_charm_func_full_8878/func.txt
2020-05-16 20:23:43 [INFO] test_921_remove_unit (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)
2020-05-16 20:23:43 [INFO] Test if unit cleans up when removed from Rmq cluster.

And succeeds by printing "OK" (logging.info('OK') at the end of the test case code):

2020-05-16 20:25:18 [INFO] OK

And then aft...

Read more...

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.