Intermittent test failure for test_901_remove_unit

Bug #1701061 reported by Frode Nordahl
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Triaged
Low
Unassigned

Bug Description

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_amulet_full/openstack/charm-rabbitmq-server/474958/12/278/test_charm_amulet_full_261/amulet-full.txt

DEBUG:runner:2017-06-16 23:44:48,586 test_901_remove_unit DEBUG: Checking that units correctly clean up after themselves on unit removal...
DEBUG:runner:2017-06-16 23:44:48,586 _configure_services INFO: OpenStackAmuletDeployment: configure services
DEBUG:runner:2017-06-16 23:48:03,323 _auto_wait_for_status INFO: Waiting for extended status on units...
DEBUG:runner:2017-06-16 23:48:03,328 _auto_wait_for_status DEBUG: Custom extended status wait match: ^Unit is ready and clustered$
DEBUG:runner:2017-06-16 23:48:03,328 _auto_wait_for_status DEBUG: Waiting up to 1200s for extended status on services: ['rabbitmq-server']
DEBUG:runner:2017-06-16 23:48:03,749 _auto_wait_for_status INFO: OK
DEBUG:runner:2017-06-16 23:48:19,423 _auto_wait_for_status INFO: Waiting for extended status on units...
DEBUG:runner:2017-06-16 23:48:19,424 _auto_wait_for_status DEBUG: Custom extended status wait match: ^Unit is ready and clustered$
DEBUG:runner:2017-06-16 23:48:19,424 _auto_wait_for_status DEBUG: Waiting up to 1200s for extended status on services: ['rabbitmq-server']
DEBUG:runner:2017-06-16 23:48:19,749 _auto_wait_for_status INFO: OK
DEBUG:runner:2017-06-16 23:48:29,193 get_unit_hostnames DEBUG: Unit host names: {'rabbitmq-server/6': 'juju-4d268d-auto-osci-sv06-15', 'rabbitmq-server/7': 'juju-4d268d-auto-osci-sv06-16'}
DEBUG:runner:2017-06-16 23:48:34,445 run_cmd_unit DEBUG: rabbitmq-server/6 `rabbitmqctl cluster_status` command returned 0 (OK)
DEBUG:runner:2017-06-16 23:48:34,446 get_rmq_cluster_status DEBUG: rabbitmq-server/6 cluster_status:
DEBUG:runner:Cluster status of node 'rabbit@juju-4d268d-auto-osci-sv06-15' ...
DEBUG:runner:[{nodes,[{disc,['rabbit@juju-4d268d-auto-osci-sv06-15',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-16',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-17']}]},
DEBUG:runner: {running_nodes,['rabbit@juju-4d268d-auto-osci-sv06-16',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-17',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-15']},
DEBUG:runner: {cluster_name,<<"rabbit@juju-4d268d-auto-osci-sv06-17">>},
DEBUG:runner: {partitions,[]}]
DEBUG:runner:2017-06-16 23:48:39,713 run_cmd_unit DEBUG: rabbitmq-server/7 `rabbitmqctl cluster_status` command returned 0 (OK)
DEBUG:runner:2017-06-16 23:48:39,713 get_rmq_cluster_status DEBUG: rabbitmq-server/7 cluster_status:
DEBUG:runner:Cluster status of node 'rabbit@juju-4d268d-auto-osci-sv06-16' ...
DEBUG:runner:[{nodes,[{disc,['rabbit@juju-4d268d-auto-osci-sv06-15',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-16']}]},
DEBUG:runner: {running_nodes,['rabbit@juju-4d268d-auto-osci-sv06-15',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-16']},
DEBUG:runner: {cluster_name,<<"rabbit@juju-4d268d-auto-osci-sv06-17">>},
DEBUG:runner: {partitions,[]}]
DEBUG:runner:['Cluster registration check failed on rabbitmq-server/6: rabbit@juju-4d268d-auto-osci-sv06-17 should not be registered with RabbitMQ after unit removal.\n']
DEBUG:runner:Exit Code: 1
DEBUG:bundletester.utils:Updating JUJU_MODEL: "auto-osci-sv06:admin/auto-osci-sv06" -> ""
DEBUG:bundletester.fetchers:git rev-parse HEAD: 46a0b0760e236fc4ea7909f59d361a925838cd6f

ERROR: InvocationError: '/var/lib/jenkins/checkout/0/rabbitmq-server/.tox/func27/bin/bundletester -vl DEBUG -r json -o func-results.json --test-pattern gate-* --no-destroy'
___________________________________ summary ____________________________________
ERROR: func27: commands failed

The test will happily remove the leader unit or the unit that formed and gave name to the cluster. In such an event the test will fail.

It is debatable whether the remove code should take this into consideration and try to update the cluster name or not.

In the mean time we could improve the test to avoid removing the unit that gave name to the cluster so we don't get intermittent failures.

Tags: sts
Revision history for this message
Frode Nordahl (fnordahl) wrote :

The root cause may be that the registration check and or regex is too wide and matches on the cluster_name row? That check should really just hit on nodes listed after "nodes". I'll dig into that too and see what shows up.

Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Frode Nordahl (fnordahl) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi Frode, there are actually two issues at play here. The amulet error that you've pasted above is a result of a race condition whereby the removed unit reference has not yet been removed from rabbitmq cluster_status at the time amulet does the poll. It does however get removed eventually so its really just a timing thing. The second issue, which you allude to in your comments above is potentially more serious and is the fact that if we remove the unit that was set as the leader_node_hostname which is represented in "cluster_name", that name is never updated to be that of the new leader and that in turn likely messes with the code that you reference above which was added as a fix for bug 1691510.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

DEBUG:runner:Cluster status of node 'rabbit@juju-4d268d-auto-osci-sv06-16' ...
DEBUG:runner:[{nodes,[{disc,['rabbit@juju-4d268d-auto-osci-sv06-15',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-16']}]},
DEBUG:runner: {running_nodes,['rabbit@juju-4d268d-auto-osci-sv06-15',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-16']},
DEBUG:runner: {cluster_name,<<"rabbit@juju-4d268d-auto-osci-sv06-17">>},
DEBUG:runner: {partitions,[]}]
DEBUG:runner:['Cluster registration check failed on rabbitmq-server/6: rabbit@juju-4d268d-auto-osci-sv06-17 should not be registered with RabbitMQ after unit removal.\n']
DEBUG:runner:Exit Code: 1

As you can see here the test itself is clearly in error mistaking value of cluster_name for a node. Making the test sleep will not change that, so it's passing is random with regards to which node was removed.

So should do two things:
1) Update the functional test to not mistake cluster_name as a node in the cluster
2) Evaluate thoroughly whether we should update cluster_name on leader change

I believe 2) is not really necessary and I avoided updating the cluster name in the fixes for bug 1679449 since I did not have full view of what side effects that could have at the time.

As for bug 1691510 and the related fixes, it is not clear to me how that code is using cluster_name for anything?

Revision history for this message
Frode Nordahl (fnordahl) wrote :

The code used in the test was stolen from get_rmq_cluster_running_nodes function in charm-helpers (charmhelpers/contrib/openstack/amulet/utils.py)

So that means this bug exists there too.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

I think this test proves the contrary - http://pastebin.ubuntu.com/25017990/ i.e. that the amulet tests are not picking up the cluster_name.

Note that this is using the /next charms.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Right you are, now that is what you get for not quoting the whole context of the log output:
DEBUG:runner:2017-06-16 23:48:29,193 get_unit_hostnames DEBUG: Unit host names: {'rabbitmq-server/6': 'juju-4d268d-auto-osci-sv06-15', 'rabbitmq-server/7': 'juju-4d268d-auto-osci-sv06-16'}
DEBUG:runner:2017-06-16 23:48:34,445 run_cmd_unit DEBUG: rabbitmq-server/6 `rabbitmqctl cluster_status` command returned 0 (OK)
DEBUG:runner:2017-06-16 23:48:34,446 get_rmq_cluster_status DEBUG: rabbitmq-server/6 cluster_status:
DEBUG:runner:Cluster status of node 'rabbit@juju-4d268d-auto-osci-sv06-15' ...
DEBUG:runner:[{nodes,[{disc,['rabbit@juju-4d268d-auto-osci-sv06-15',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-16',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-17']}]},
DEBUG:runner: {running_nodes,['rabbit@juju-4d268d-auto-osci-sv06-16',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-17',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-15']},
DEBUG:runner: {cluster_name,<<"rabbit@juju-4d268d-auto-osci-sv06-17">>},
DEBUG:runner: {partitions,[]}]
DEBUG:runner:2017-06-16 23:48:39,713 run_cmd_unit DEBUG: rabbitmq-server/7 `rabbitmqctl cluster_status` command returned 0 (OK)
DEBUG:runner:2017-06-16 23:48:39,713 get_rmq_cluster_status DEBUG: rabbitmq-server/7 cluster_status:
DEBUG:runner:Cluster status of node 'rabbit@juju-4d268d-auto-osci-sv06-16' ...
DEBUG:runner:[{nodes,[{disc,['rabbit@juju-4d268d-auto-osci-sv06-15',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-16']}]},
DEBUG:runner: {running_nodes,['rabbit@juju-4d268d-auto-osci-sv06-15',
DEBUG:runner: 'rabbit@juju-4d268d-auto-osci-sv06-16']},
DEBUG:runner: {cluster_name,<<"rabbit@juju-4d268d-auto-osci-sv06-17">>},
DEBUG:runner: {partitions,[]}]
DEBUG:runner:['Cluster registration check failed on rabbitmq-server/6: rabbit@juju-4d268d-auto-osci-sv06-17 should not be registered with RabbitMQ after unit removal.\n']
DEBUG:runner:Exit Code: 1

Revision history for this message
Frode Nordahl (fnordahl) wrote :

But still, I am not quite sure I see how not updating the RabbitMQ cluster_name relates to the changes made to fix bug 1691510.

As for the test, sleeping 10 secs is probably excessive. We could do a quick polling test across sentries looking for equal number of nodes listed before running the check or something like that.

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1701061] Re: Intermittent test failure for test_901_remove_unit

On Tue, 04 Jul 2017 09:21:25 -0000
Frode Nordahl <email address hidden> wrote:

> DEBUG:runner:Cluster status of node
> 'rabbit@juju-4d268d-auto-osci-sv06-16' ...
> DEBUG:runner:[{nodes,[{disc,['rabbit@juju-4d268d-auto-osci-sv06-15',
> DEBUG:runner:
> 'rabbit@juju-4d268d-auto-osci-sv06-16']}]}, DEBUG:runner:
> {running_nodes,['rabbit@juju-4d268d-auto-osci-sv06-15',
> DEBUG:runner:
> 'rabbit@juju-4d268d-auto-osci-sv06-16']}, DEBUG:runner:
> {cluster_name,<<"rabbit@juju-4d268d-auto-osci-sv06-17">>},
> DEBUG:runner: {partitions,[]}] DEBUG:runner:['Cluster registration
> check failed on rabbitmq-server/6:
> rabbit@juju-4d268d-auto-osci-sv06-17 should not be registered with
> RabbitMQ after unit removal.\n'] DEBUG:runner:Exit Code: 1
>
> As you can see here the test itself is clearly in error mistaking
> value of cluster_name for a node. Making the test sleep will not
> change that, so it's passing is random with regards to which node was
> removed.
>
> So should do two things:
> 1) Update the functional test to not mistake cluster_name as a node
> in the cluster 2) Evaluate thoroughly whether we should update
> cluster_name on leader change
>
> I believe 2) is not really necessary and I avoided updating the
> cluster name in the fixes for bug 1679449 since I did not have full
> view of what side effects that could have at the time.
>
> As for bug 1691510 and the related fixes, it is not clear to me how
> that code is using cluster_name for anything?

I was reading the code and I couldn't find anything using the
cluster_name besides testing code. So I think we could just set the
cluster to any string (e.g. rabbitmqctl set_cluster_name juju) and stop
caring about it, there is no reason to keep synced with the current
juju leader.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Right, we should definitely do something about the cluster name other than leaving it at the value set when the cluster was first created.

Either update to new leader name or a name not connected to any hostname as @freyes suggests.

I'll start by adding a functional test to verify what actually happens when we add a unit to the cluster after the fact of removing the current leader and go from there.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/#/c/482887/

Changed in charm-rabbitmq-server:
assignee: nobody → Frode Nordahl (fnordahl)
James Page (james-page)
Changed in charm-rabbitmq-server:
status: New → In Progress
importance: Undecided → Low
milestone: none → 18.02
Ryan Beisner (1chb1n)
Changed in charm-rabbitmq-server:
milestone: 18.02 → 18.05
James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: 18.05 → 18.08
James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: 18.08 → 18.11
James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: 18.11 → 19.04
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-rabbitmq-server (master)

Change abandoned by Frode Nordahl (<email address hidden>) on branch: master
Review: https://review.openstack.org/482887
Reason: Marking this review as abandoned due to lack of activity in the past 6 months.

David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 19.04 → 19.07
David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 19.10 → 20.01
James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: 20.01 → 20.05
David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 20.05 → 20.08
James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: 20.08 → none
Frode Nordahl (fnordahl)
Changed in charm-rabbitmq-server:
status: In Progress → Triaged
assignee: Frode Nordahl (fnordahl) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.