In the CI report from #7, I see that initial_client_update_done==true on both the problematic minion unit and the leader unit so I am not sure my initial comment in the bug description is valid anymore (and also considering that this value is set to true on the first client update):
sqlite> select * from hooks;
# ...
125|hooks/stop|2020-05-16T20:25:06.088873
126|hooks/stop|2020-05-16T20:25:07.176063
127|hooks/stop|2020-05-16T20:25:07.178036
While juju status yaml shows:
rabbitmq-server/0: workload-status:
current: waiting
message: Unit has peers, but RabbitMQ not clustered
since: 16 May 2020 20:25:08Z juju-status:
current: idle
since: 16 May 2020 20:25:09Z
version: 2.7.6
machine: "3"
So we have:
2020-05-16 20:24:57 INFO juju-log Unit is ready
# stop hook -> calls leave_cluster which does stop_app, reset, start_app and logs a bunch of messages about the startup after reset, and then finally logs "Successfully left cluster gracefully"
2020-05-16 20:25:00 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'stop_app']
2020-05-16 20:25:01 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'reset']
2020-05-16 20:25:03 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'start_app']
2020-05-16 20:25:04 DEBUG juju-log Waiting for rabbitmq app to start: /<email address hidden>
2020-05-16 20:25:04 DEBUG juju-log Running ['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']
2020-05-16 20:25:05 DEBUG juju-log Confirmed rabbitmq app is running
2020-05-16 20:25:05 INFO juju-log Successfully left cluster gracefully.
2020-05-16 20:25:05 DEBUG juju-log Calculating erl vm io thread pool size based on num_cpus=2 and multiplier=24
2020-05-16 20:25:05 DEBUG juju-log erl vm io thread pool size = 48 (capped=False)
2020-05-16 20:25:07 DEBUG juju-log Checking for minimum of 2 peer units
2020-05-16 20:25:07 INFO juju-log Sufficient number of peer units to form cluster 2
3) Per the test code, a unit is not actually removed from the model - instead `juju run --unit <unit> hooks/stop` is done:
And succeeds by printing "OK" (logging.info('OK') at the end of the test case code):
2020-05-16 20:25:18 [INFO] OK
And then after that we have a failure of the "pause" test case which expects a unit to be "active", not "waiting" because the "test_921_remove_unit" test case does not clean up after itself:
2020-05-16 20:25:18 [INFO] FAIL: test_910_pause_and_resume (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)
2020-05-16 20:25:18 [INFO] The services can be paused and resumed.
2020-05-16 20:25:18 [INFO] ----------------------------------------------------------------------
2020-05-16 20:25:18 [INFO] Traceback (most recent call last):
2020-05-16 20:25:18 [INFO] File "/tmp/tmp.AdGtY3fqac/func/lib/python3.5/site-packages/zaza/openstack/charm_tests/rabbitmq_server/tests.py", line 279, in test_910_pause_and_resume
2020-05-16 20:25:18 [INFO] assert unit.workload_status == "active"
2020-05-16 20:25:18 [INFO] AssertionError
So I think the failure in #7 is the test ordering issue: I see that whenever test_921_remove_unit is at the end of all func tests for a given model there are no errors.
In the CI report from #7, I see that initial_ client_ update_ done==true on both the problematic minion unit and the leader unit so I am not sure my initial comment in the bug description is valid anymore (and also considering that this value is set to true on the first client update):
https:/ /paste. ubuntu. com/p/z8d6qYGW2 V/
Looking further:
1) rabbitmq/0 was ready at 20:24:57:
2020-05-16 20:24:57 INFO juju-log Unit is ready
2) A few seconds after that the stop hook was fired:
juju-crashdump- 4009e8cc- 5c02-4ce0- bf61-227517302e 4b/3/baremetal/ var/lib/ juju/agents/ unit-rabbitmq- server- 0/charm
sqlite> select * from hooks; stop|2020- 05-16T20: 25:06.088873 stop|2020- 05-16T20: 25:07.176063 stop|2020- 05-16T20: 25:07.178036
# ...
125|hooks/
126|hooks/
127|hooks/
While juju status yaml shows:
rabbitmq- server/ 0:
workload- status:
juju-status:
current: waiting
message: Unit has peers, but RabbitMQ not clustered
since: 16 May 2020 20:25:08Z
current: idle
since: 16 May 2020 20:25:09Z
version: 2.7.6
machine: "3"
So we have:
2020-05-16 20:24:57 INFO juju-log Unit is ready
# stop hook -> calls leave_cluster which does stop_app, reset, start_app and logs a bunch of messages about the startup after reset, and then finally logs "Successfully left cluster gracefully"
2020-05-16 20:25:00 DEBUG juju-log Running ['/usr/ sbin/rabbitmqct l', 'stop_app'] sbin/rabbitmqct l', 'reset'] sbin/rabbitmqct l', 'start_app'] rabbitmqctl' , 'wait', '/<email address hidden>']
2020-05-16 20:25:01 DEBUG juju-log Running ['/usr/
2020-05-16 20:25:03 DEBUG juju-log Running ['/usr/
2020-05-16 20:25:04 DEBUG juju-log Waiting for rabbitmq app to start: /<email address hidden>
2020-05-16 20:25:04 DEBUG juju-log Running ['timeout', '180', '/usr/sbin/
2020-05-16 20:25:05 DEBUG juju-log Confirmed rabbitmq app is running
2020-05-16 20:25:05 INFO juju-log Successfully left cluster gracefully.
2020-05-16 20:25:05 DEBUG juju-log Calculating erl vm io thread pool size based on num_cpus=2 and multiplier=24
2020-05-16 20:25:05 DEBUG juju-log erl vm io thread pool size = 48 (capped=False)
2020-05-16 20:25:07 DEBUG juju-log Checking for minimum of 2 peer units
2020-05-16 20:25:07 INFO juju-log Sufficient number of peer units to form cluster 2
3) Per the test code, a unit is not actually removed from the model - instead `juju run --unit <unit> hooks/stop` is done:
https:/ /github. com/openstack- charmers/ zaza-openstack- tests/blob/ 08e42db7c3803b2 c833e8d8e83791d 6e46e8e2ee/ zaza/openstack/ charm_tests/ rabbitmq_ server/ tests.py# L359-L363
The "removal" test runs first:
https:/ /openstack- ci-reports. ubuntu. com/artifacts/ test_charm_ pipeline_ func_full/ openstack/ charm-rabbitmq- server/ 725909/ 5/5729/ test_charm_ func_full_ 8878/func. txt remove_ unit (zaza.openstack .charm_ tests.rabbitmq_ server. tests.RmqTests)
2020-05-16 20:23:43 [INFO] test_921_
2020-05-16 20:23:43 [INFO] Test if unit cleans up when removed from Rmq cluster.
And succeeds by printing "OK" (logging.info('OK') at the end of the test case code):
2020-05-16 20:25:18 [INFO] OK
And then after that we have a failure of the "pause" test case which expects a unit to be "active", not "waiting" because the "test_921_ remove_ unit" test case does not clean up after itself:
2020-05-16 20:25:18 [INFO] FAIL: test_910_ pause_and_ resume (zaza.openstack .charm_ tests.rabbitmq_ server. tests.RmqTests) ------- ------- ------- ------- ------- ------- ------- ------- ------- AdGtY3fqac/ func/lib/ python3. 5/site- packages/ zaza/openstack/ charm_tests/ rabbitmq_ server/ tests.py" , line 279, in test_910_ pause_and_ resume status == "active"
2020-05-16 20:25:18 [INFO] The services can be paused and resumed.
2020-05-16 20:25:18 [INFO] -------
2020-05-16 20:25:18 [INFO] Traceback (most recent call last):
2020-05-16 20:25:18 [INFO] File "/tmp/tmp.
2020-05-16 20:25:18 [INFO] assert unit.workload_
2020-05-16 20:25:18 [INFO] AssertionError
So I think the failure in #7 is the test ordering issue: I see that whenever test_921_ remove_ unit is at the end of all func tests for a given model there are no errors.