OpenStack RabbitMQ Server Charm

Bug #1730709
Comment #8

Comment 8 for bug 1730709

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2020-05-18:

In the CI report from #7, I see that initial_client_update_done==true on both the problematic minion unit and the leader unit so I am not sure my initial comment in the bug description is valid anymore (and also considering that this value is set to true on the first client update):

https://paste.ubuntu.com/p/z8d6qYGW2V/

Looking further:

1) rabbitmq/0 was ready at 20:24:57:

2020-05-16 20:24:57 INFO juju-log Unit is ready

2) A few seconds after that the stop hook was fired:

juju-crashdump-4009e8cc-5c02-4ce0-bf61-227517302e4b/3/baremetal/var/lib/juju/agents/unit-rabbitmq-server-0/charm

sqlite> select * from hooks;
# ...
125|hooks/stop|2020-05-16T20:25:06.088873
126|hooks/stop|2020-05-16T20:25:07.176063
127|hooks/stop|2020-05-16T20:25:07.178036

While juju status yaml shows:

      rabbitmq-server/0:
        workload-status:
          current: waiting
          message: Unit has peers, but RabbitMQ not clustered
          since: 16 May 2020 20:25:08Z
        juju-status:
          current: idle
          since: 16 May 2020 20:25:09Z
          version: 2.7.6
        machine: "3"

So we have:

2020-05-16 20:24:57 INFO juju-log Unit is ready

# stop hook -> calls leave_cluster which does stop_app, reset, start_app and logs a bunch of messages about the startup after reset, and then finally logs "Successfully left cluster gracefully"

2020-05-16 20:25:00 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'stop_app']
2020-05-16 20:25:01 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'reset']
2020-05-16 20:25:03 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'start_app']
2020-05-16 20:25:04 DEBUG juju-log Waiting for rabbitmq app to start: /<email address hidden>
2020-05-16 20:25:04 DEBUG juju-log Running ['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']
2020-05-16 20:25:05 DEBUG juju-log Confirmed rabbitmq app is running
2020-05-16 20:25:05 INFO juju-log Successfully left cluster gracefully.
2020-05-16 20:25:05 DEBUG juju-log Calculating erl vm io thread pool size based on num_cpus=2 and multiplier=24
2020-05-16 20:25:05 DEBUG juju-log erl vm io thread pool size = 48 (capped=False)
2020-05-16 20:25:07 DEBUG juju-log Checking for minimum of 2 peer units
2020-05-16 20:25:07 INFO juju-log Sufficient number of peer units to form cluster 2

3) Per the test code, a unit is not actually removed from the model - instead `juju run --unit <unit> hooks/stop` is done:

https://github.com/openstack-charmers/zaza-openstack-tests/blob/08e42db7c3803b2c833e8d8e83791d6e46e8e2ee/zaza/openstack/charm_tests/rabbitmq_server/tests.py#L359-L363

The "removal" test runs first:

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-rabbitmq-server/725909/5/5729/test_charm_func_full_8878/func.txt
2020-05-16 20:23:43 [INFO] test_921_remove_unit (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)
2020-05-16 20:23:43 [INFO] Test if unit cleans up when removed from Rmq cluster.

And succeeds by printing "OK" (logging.info('OK') at the end of the test case code):

2020-05-16 20:25:18 [INFO] OK

And then after that we have a failure of the "pause" test case which expects a unit to be "active", not "waiting" because the "test_921_remove_unit" test case does not clean up after itself:

2020-05-16 20:25:18 [INFO] FAIL: test_910_pause_and_resume (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)
2020-05-16 20:25:18 [INFO] The services can be paused and resumed.
2020-05-16 20:25:18 [INFO] ----------------------------------------------------------------------
2020-05-16 20:25:18 [INFO] Traceback (most recent call last):
2020-05-16 20:25:18 [INFO] File "/tmp/tmp.AdGtY3fqac/func/lib/python3.5/site-packages/zaza/openstack/charm_tests/rabbitmq_server/tests.py", line 279, in test_910_pause_and_resume
2020-05-16 20:25:18 [INFO] assert unit.workload_status == "active"
2020-05-16 20:25:18 [INFO] AssertionError

So I think the failure in #7 is the test ordering issue: I see that whenever test_921_remove_unit is at the end of all func tests for a given model there are no errors.

https://paste.ubuntu.com/p/z8d6qYGW2V/

Looking further:

1) rabbitmq/0 was ready at 20:24:57:

2020-05-16 20:24:57 INFO juju-log Unit is ready

2) A few seconds after that the stop hook was fired:

juju-crashdump-4009e8cc-5c02-4ce0-bf61-227517302e4b/3/baremetal/var/lib/juju/agents/unit-rabbitmq-server-0/charm

sqlite> select * from hooks;
# ...
125|hooks/stop|2020-05-16T20:25:06.088873
126|hooks/stop|2020-05-16T20:25:07.176063
127|hooks/stop|2020-05-16T20:25:07.178036

While juju status yaml shows:

So we have:

2020-05-16 20:24:57 INFO juju-log Unit is ready

# stop hook -> calls leave_cluster which does stop_app, reset, start_app and logs a bunch of messages about the startup after reset, and then finally logs "Successfully left cluster gracefully"

2020-05-16 20:25:00 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'stop_app']
2020-05-16 20:25:01 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'reset']
2020-05-16 20:25:03 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'start_app']
2020-05-16 20:25:04 DEBUG juju-log Waiting for rabbitmq app to start: /var/lib/rabbitmq/mnesia/rabbit@juju-a575fe-zaza-f6ef0b8bad38-3.pid
2020-05-16 20:25:04 DEBUG juju-log Running ['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/var/lib/rabbitmq/mnesia/rabbit@juju-a575fe-zaza-f6ef0b8bad38-3.pid']
2020-05-16 20:25:05 DEBUG juju-log Confirmed rabbitmq app is running
2020-05-16 20:25:05 INFO juju-log Successfully left cluster gracefully.
2020-05-16 20:25:05 DEBUG juju-log Calculating erl vm io thread pool size based on num_cpus=2 and multiplier=24
2020-05-16 20:25:05 DEBUG juju-log erl vm io thread pool size = 48 (capped=False)
2020-05-16 20:25:07 DEBUG juju-log Checking for minimum of 2 peer units
2020-05-16 20:25:07 INFO juju-log Sufficient number of peer units to form cluster 2

3) Per the test code, a unit is not actually removed from the model - instead `juju run --unit <unit> hooks/stop` is done:

https://github.com/openstack-charmers/zaza-openstack-tests/blob/08e42db7c3803b2c833e8d8e83791d6e46e8e2ee/zaza/openstack/charm_tests/rabbitmq_server/tests.py#L359-L363

The "removal" test runs first:

And succeeds by printing "OK" (logging.info('OK') at the end of the test case code):

2020-05-16 20:25:18 [INFO] OK

And then after that we have a failure of the "pause" test case which expects a unit to be "active", not "waiting" because the "test_921_remove_unit" test case does not clean up after itself:

2020-05-16 20:25:18 [INFO] FAIL: test_910_pause_and_resume (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)
2020-05-16 20:25:18 [INFO] The services can be paused and resumed.
2020-05-16 20:25:18 [INFO] ----------------------------------------------------------------------
2020-05-16 20:25:18 [INFO] Traceback (most recent call last):
2020-05-16 20:25:18 [INFO]   File "/tmp/tmp.AdGtY3fqac/func/lib/python3.5/site-packages/zaza/openstack/charm_tests/rabbitmq_server/tests.py", line 279, in test_910_pause_and_resume
2020-05-16 20:25:18 [INFO]     assert unit.workload_status == "active"
2020-05-16 20:25:18 [INFO] AssertionError

So I think the failure in #7 is the test ordering issue: I see that whenever test_921_remove_unit is at the end of all func tests for a given model there are no errors.