Bug #1796886 “adding unit to existing rabbitmq cluster causes so...” : Bugs : OpenStack RabbitMQ Server Charm

Edward Hope-Morley (hopem) on 2018-10-09

Changed in charm-rabbitmq-server:
status:	New → Incomplete
status:	Incomplete → New
importance:	Undecided → High
milestone:	none → 18.11

Edward Hope-Morley (hopem) on 2018-10-09

description:

updated

David Ames (thedac) on 2018-11-20

Changed in charm-rabbitmq-server:
milestone:	18.11 → 19.04

Revision history for this message

David Ames (thedac) wrote on 2018-11-27:

#1

Let's add more gates in rabbit_utils.client_node_is_ready() to make sure the node has joined the cluster and there is a reasonable expectation it has its data.

Changed in charm-rabbitmq-server:
status:	New → Confirmed
status:	Confirmed → Triaged

Chris Sanders (chris.sanders) on 2019-02-05

Changed in charm-rabbitmq-server:
assignee:	nobody → Chris Sanders (chris.sanders)
assignee:	Chris Sanders (chris.sanders) → nobody

Alok G Singh (alephnull-b) on 2019-02-05

Changed in charm-rabbitmq-server:
assignee:	nobody → Alok G Singh (alephnull-b)

Revision history for this message

Alok G Singh (alephnull-b) wrote on 2019-02-07:

#2

Cannot reproduce. My test setup of course does not have much load so I simulate some AMQP traffic by invoking openstack compute service list repeatedly in parallel but there is no sign of the reported problem.

While clustering, the charm fails and retries and finally succeeds in all my tests.

@hopem, Could you provide a test case that could help isolate the problem? I'll mark this incomplete.

Changed in charm-rabbitmq-server:
status:	Triaged → Incomplete

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-02-07:

#3

Alok, I was probably too hasty in suggesting this was incomplete as David has indicated where to probably do the fix. It's possibly best if a fix is done and then Ed might be able to test it on his setup?

Changed in charm-rabbitmq-server:
status:	Incomplete → Triaged

Revision history for this message

Alok G Singh (alephnull-b) wrote on 2019-02-07:

#4

I had been trying to reproduce this with openstack-on-LXD. hopem suggested that kvm might be a better bet to reproduce. From IRC:

<dosaboy> i think all i did was deploy 3 units, with min-cluster-size=3
<dosaboy> then i added a unit
<dosaboy> and it pretty consistently caused failures
<dosaboy> and im deploying on openstack
<dosaboy> any client was having issue
<dosaboy> s
<dosaboy> not specific to any one servoice

@ajkavanagh, could it be as simple as calling rabbit_utils.cluster_ready() in rabbit_utils.client_node_is_ready()?

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2019-02-09:

#5

@alephnull-b here is a bundle, as requested, that you can use to reproduce this issue - https://paste.ubuntu.com/p/VvGVfGh8PQ/

I have just re-tried this out for myself (using kvm) and can confirm the issue is still there i.e. I deployed incl. 3 units of rabbit and waited for everything to be ready. I checked that no services were in error. I then did 'juju add-unit rabbitmq-server' and once that new unit was up sure enough one of the rabbit clients in my deployment went to error state 'AccessRefused: (0, 0): (403) ACCESS_REFUSED' - this time it was neutron-metering-agent but i've found it is quite random which clients fail to connect.

Hope that helps.

Revision history for this message

Alok G Singh (alephnull-b) wrote on 2019-02-14:

#6

It appears that the root cause of this behaviour might be elsewhere. Even before the rabbitmq cluster is formed, the amqp-relation-changed hook fires on the clients of rabbitmq-server:amqp as this relation already exists. This causes the newly added rmq node's IP to propagate to the client's configuration, triggering errors of the form:

AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile

Note that this analysis is applicable only for this type of error. I have not seen any other error on any of the clients.

For the nova-compute charm, the code in question is at https://github.com/openstack/charm-nova-compute/blob/master/hooks/nova_compute_hooks.py#L279

`rabbit_utils.client_is_ready()` will not help in fixing this. After some discussion with lathiat, we could implement some flag on the amqp relation to indicate that the peer is ready. The following clients will need to modified to check for this "completion flag" before writing configuration files.

ceilometer
cinder
designate
glance
heat
neutron-api
neutron-gateway
neutron-openvswitch
nova-cloud-controller
nova-compute

This is a little complicated to implement and care should be taken to ensure that the implementation is backwards-compatible.

The other half of the bug is that even _after_ the cluster is formed correctly, the clients do not reconnect and stay failed. Errors like the below are seen:

```
2019-02-14 08:19:16.464 26635 ERROR neutron.common.rpc [req-a335630e-c214-4e52-8c71-d145c73ed2bc - - - - -] Timeout in RPC method tunnel_sync. Waiting for 34 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2019-02-14 08:19:16.465 26635 WARNING neutron.common.rpc [req-a335630e-c214-4e52-8c71-d145c73ed2bc - - - - -] Increasing timeout for tunnel_sync calls to 120 seconds. Restart the agent to restore it to the default value.
```

The service does _not_ come back up and instances cannot be launched and connected to.

This part of the bug is probably better fixed in oslo_messaging as the config files pointing to rabbitmq are correctly generated.

It appears that the root cause of this behaviour might be elsewhere. Even before the rabbitmq cluster is formed, the amqp-relation-changed hook fires on the clients of rabbitmq-server:amqp as this relation already exists. This causes the newly added rmq node's IP to propagate to the client's configuration, triggering errors of the form:

AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile

Note that this analysis is applicable only for this type of error. I have not seen any other error on any of the clients.

For the nova-compute charm, the code in question is at https://github.com/openstack/charm-nova-compute/blob/master/hooks/nova_compute_hooks.py#L279

`rabbit_utils.client_is_ready()` will not help in fixing this. After some discussion with lathiat, we could implement some flag on the amqp relation to indicate that the peer is ready. The following clients will need to modified to check for this "completion flag" before writing configuration files.

ceilometer
cinder
designate
glance
heat
neutron-api
neutron-gateway
neutron-openvswitch
nova-cloud-controller
nova-compute

This is a little complicated to implement and care should be taken to ensure that the implementation is backwards-compatible.

The other half of the bug is that even _after_ the cluster is formed correctly, the clients do not reconnect and stay failed. Errors like the below are seen:

```
2019-02-14 08:19:16.464 26635 ERROR neutron.common.rpc [req-a335630e-c214-4e52-8c71-d145c73ed2bc - - - - -] Timeout in RPC method tunnel_sync. Waiting for 34 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2019-02-14 08:19:16.465 26635 WARNING neutron.common.rpc [req-a335630e-c214-4e52-8c71-d145c73ed2bc - - - - -] Increasing timeout for tunnel_sync calls to 120 seconds. Restart the agent to restore it to the default value.
```

The service does _not_ come back up and instances cannot be launched and connected to.

This part of the bug is probably better fixed in oslo_messaging as the config files pointing to rabbitmq are correctly generated.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-02-14:

#7

I think the interesting part of the bug is: "the clients get an authentication failure and never retry thus they sit there unusable until they are manually restarted."

This seems to imply that the config for the amqp client (oslo) hasn't been updated or the service wasn't restarted to pick it up. i.e. if they sit in a failed state, then there's still something wrong on the client code (potentially) or in the data provided to the client. If the cluster comes up and all of the configs on the clients are correct and it *still* isn't working, then it implies that the client wasn't restarted.

Revision history for this message

Alok G Singh (alephnull-b) wrote on 2019-02-15:

#8

Agreed @ajkavanagh.

I attempted to reproduce this on bionic-queens. After expanding the rmq cluster, the following services died and had to be started:

cinder/0:
- cinder-scheduler
- apache2

neutron-gateway/0:
- neutron-metering-agent

nova-cloud-controller/0:
- nova-scheduler

Until these services were restarted, spinning up new instances would end in

Timed out waiting for a reply to message ID 38611b56e5c84bc49fe689e45dad0e8a

  File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 1120, in schedule_and_build_instances
    instance_uuids, return_alternates=True)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 717, in _schedule_instances
    return_alternates=return_alternates)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/utils.py", line 787, in wrapped
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 53, in select_destinations
    instance_uuids, return_objects, return_alternates)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
    return getattr(self.instance, __name)(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/query.py", line 42, in select_destinations
    instance_uuids, return_objects, return_alternates)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/rpcapi.py", line 158, in select_destinations
    return cctxt.call(ctxt, 'select_destinations', **msg_args)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 174, in call
    retry=self.retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 131, in _send
    timeout=timeout, retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send
    retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 548, in _send
    result = self._waiter.wait(msg_id, timeout)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 440, in wait
    message = self.waiters.get(msg_id, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 328, in get
    'to message ID %s' % msg_id)

After restarting the above services, instances can once again be scheduled and accessed.

Agreed @ajkavanagh.

I attempted to reproduce this on bionic-queens. After expanding the rmq cluster, the following services died and had to be started:

cinder/0:
- cinder-scheduler
- apache2

neutron-gateway/0:
- neutron-metering-agent

nova-cloud-controller/0:
- nova-scheduler

Until these services were restarted, spinning up new instances would end in

Timed out waiting for a reply to message ID 38611b56e5c84bc49fe689e45dad0e8a

File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 1120, in schedule_and_build_instances
    instance_uuids, return_alternates=True)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 717, in _schedule_instances
    return_alternates=return_alternates)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/utils.py", line 787, in wrapped
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 53, in select_destinations
    instance_uuids, return_objects, return_alternates)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
    return getattr(self.instance, __name)(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/query.py", line 42, in select_destinations
    instance_uuids, return_objects, return_alternates)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/rpcapi.py", line 158, in select_destinations
    return cctxt.call(ctxt, 'select_destinations', **msg_args)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 174, in call
    retry=self.retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 131, in _send
    timeout=timeout, retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send
    retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 548, in _send
    result = self._waiter.wait(msg_id, timeout)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 440, in wait
    message = self.waiters.get(msg_id, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 328, in get
    'to message ID %s' % msg_id)

After restarting the above services, instances can once again be scheduled and accessed.

Revision history for this message

Alok G Singh (alephnull-b) wrote on 2019-02-15:

#9

Download full text (6.8 KiB)

On bionic-rocky, the behaviour is different. After expanding the rmq cluster, spawning instances fails with the error:

Build of instance 5fc7fe89-854d-459d-87fc-cfcab8c36645 aborted: Failed to allocate the network(s), not rescheduling.
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1929, in _do_build_and_run_instance
    filter_properties, request_spec)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2186, in _build_and_run_instance
    reason=msg)

After restarting the following services

neutron-gateway/0
- neutron-metering-agent
- neutron-lbaasv2-agent
- neutron-dhcp-agent
- neutron-metadata-agent
- neutron-openvswitch-agent

instances still would not spawn with the same error as above. The root cause appears to be:
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent [req-4f7fa2c5-6169-43a7-b713-3192511d902e - - - - -] Unable to reload_allocations dhcp for 76049129-b69a-46f4-ba03-411ce03ea8ac.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 20f1d8ce5d7747d38237ac6f0716e508
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 332, in get
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent return self._queues[msg_id].get(block=True, timeout=timeout)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 313, in get
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent return waiter.wait()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 141, in wait
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent return get_hub().switch()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 294, in switch
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent return self.greenlet.switch()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent queue.Empty
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent During handling of the above exception, another exception occurred:
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 144, in call_driver
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 511, in reload_allocations
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent self.disable()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux...

On bionic-rocky, the behaviour is different. After expanding the rmq cluster, spawning instances fails with the error:

Build of instance 5fc7fe89-854d-459d-87fc-cfcab8c36645 aborted: Failed to allocate the network(s), not rescheduling.
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1929, in _do_build_and_run_instance
    filter_properties, request_spec)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2186, in _build_and_run_instance
    reason=msg)

After restarting the following services

neutron-gateway/0
- neutron-metering-agent
- neutron-lbaasv2-agent
- neutron-dhcp-agent
- neutron-metadata-agent
- neutron-openvswitch-agent

instances still would not spawn with the same error as above. The root cause appears to be:
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent [req-4f7fa2c5-6169-43a7-b713-3192511d902e - - - - -] Unable to reload_allocations dhcp for 76049129-b69a-46f4-ba03-411ce03ea8ac.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 20f1d8ce5d7747d38237ac6f0716e508
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 332, in get
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     return self._queues[msg_id].get(block=True, timeout=timeout)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 313, in get
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     return waiter.wait()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 141, in wait
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     return get_hub().switch()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 294, in switch
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     return self.greenlet.switch()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent queue.Empty
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent 
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent During handling of the above exception, another exception occurred:
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent 
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 144, in call_driver
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     getattr(driver, action)(**action_kwargs)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 511, in reload_allocations
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     self.disable()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 241, in disable
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     self._destroy_namespace_and_port()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 246, in _destroy_namespace_and_port
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     self.device_manager.destroy(self.network, self.interface_name)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 1560, in destroy
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     self.get_device_id(network))
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 626, in release_dhcp_port
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     host=self.host)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/common/rpc.py", line 173, in call
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     time.sleep(wait)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     self.force_reraise()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     six.reraise(self.type_, self.value, self.tb)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     raise value
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/common/rpc.py", line 150, in call
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     return self._original_context.call(ctxt, method, **kwargs)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     retry=self.retry)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_messaging/transport.py", line 133, in _send
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     retry=retry)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     call_monitor_timeout, retry=retry)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     call_monitor_timeout)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     message = self.waiters.get(msg_id, timeout=timeout)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent     'to message ID %s' % msg_id)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 20f1d8ce5d7747d38237ac6f0716e508

David Ames (thedac) on 2019-04-17

Changed in charm-rabbitmq-server:
milestone:	19.04 → 19.07

David Ames (thedac) on 2019-08-12

Changed in charm-rabbitmq-server:
milestone:	19.07 → 19.10

David Ames (thedac) on 2019-10-24

Changed in charm-rabbitmq-server:
milestone:	19.10 → 20.01

Edward Hope-Morley (hopem) on 2019-11-12

Changed in charm-rabbitmq-server:
assignee:	Alok G Singh (alephnull-b) → nobody

James Page (james-page) on 2020-03-02

Changed in charm-rabbitmq-server:
milestone:	20.01 → 20.05

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2020-04-10:

#10

I'm subscribing ~field-high as this appears to have slipped several releases and this is going to impact a major re-architecture/migration event we have planned in the near future. If this is in 20.05 we'll be fine, if it's later it could be problematic.

David Ames (thedac) on 2020-05-21

Changed in charm-rabbitmq-server:
milestone:	20.05 → 20.08

Revision history for this message

David Ames (thedac) wrote on 2020-06-08:

#11

I have been asked to rule out this bug https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1818260 as a possible culprit.

If the following log message is found on the leader node in one of the failed scenarios then it would be a duplicate.

"check_cluster_memberships(): '<NODENAME>' in nodes but not in charm relations or running_nodes, telling RabbitMQ to forget about it"

Based on my reading of the bug I don't think that is likely.

Further TRIAGE:

I still think the root problem is the new node joining the cluster (or something else in the charm) sets its IP address on the amqp relation before it is ready.

As I mentioned in Comment #1 I think this can be resolved in rabbit_utils.client_node_is_ready(). It could, for example, check all of the users/passwds in leader-settings are accessible before setting amqp relation data.

Alok in Comment #6 is describing the symptom not the root cause. We must stop the new node's IP from being set on the relation until it is ready to handle requests.

I agree with all that the oslo.messaging client code should probably be more robust and handle retries better, but we have more control over the rabbitmq-server charm.

Revision history for this message

Andrew McLeod (admcleod) wrote on 2020-06-17:

#12

After bug in #11 was fix merged, I have retested and the problem still exists. Continuing to traige.

Revision history for this message

Andrew McLeod (admcleod) wrote on 2020-06-17:

#13

I tried waiting for more things:

        if (leader_get(attribute='{}_password'.format(rid)) and
            leader_get(attribute='{}_private-address'.format(rid)) and
            leader_get(attribute='{}_resource_params'.format(rid)) and
            leader_get(attribute='{}_resources'.format(rid))):

This worked once (i.e. one new unit), but not a second new unit

Revision history for this message

Andrew McLeod (admcleod) wrote on 2020-06-17:

#14

It appears that a restart is issued to rabbit and then the client connects - in the situations where it fails, the login failure is occuring between issueing the restart command and before the server has actually stopped

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2020-06-18:

#15

Just tested next charms in bionic-stein and confirm that i hit the error by simply going from single unit to 3 unit (cluster_count: 3).

Revision history for this message

Nicolas Bock (nicolasbock) wrote on 2020-06-18:

#16

Can confirm with focal/ussuri. Adding a 4th rabbitmq-server breaks nova-compute.

Revision history for this message

Andrew McLeod (admcleod) wrote on 2020-06-22:

#17

I've tried testing this with different settings for min-cluster-size and seem to have noticed that if I increase the config to n+1, and then add a new node, I don't have any issues with the service going down.

Similarly it appears as if min-cluster-size=0 (or presumably unset but didn't confirm), I also don't see the problem.

Andrew McLeod (admcleod) on 2020-06-23

Changed in charm-rabbitmq-server:
assignee:	nobody → Andrew McLeod (admcleod)

Revision history for this message

Liam Young (gnuoy) wrote on 2020-06-25:

#18

https://review.opendev.org/737717
https://github.com/juju/charm-helpers/pull/482

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-25: Fix merged to charm-rabbitmq-server (master)

#19

Reviewed: https://review.opendev.org/737717
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=20f865c19fb1ca296dd6b574da110d9d2c1be91c
Submitter: Zuul
Branch: master

commit 20f865c19fb1ca296dd6b574da110d9d2c1be91c
Author: Liam Young <email address hidden>
Date: Wed Jun 24 09:13:28 2020 +0000

Do not set client data if not yet clustered

    If a unit is not yet clustered then do not send information to
    the rabbit clients because the unit is not yet ready for
    connections. A fix is also needed on the client side *1 to
    gate on recieving data before adding rabbit unit to list
    of hosts.

*1 https://github.com/juju/charm-helpers/pull/482

Closes-Bug: #1796886
Change-Id: Ib7554bf22356803feb2637a3db3478be575ba350

Changed in charm-rabbitmq-server:
status:	In Progress → Fix Committed

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2020-06-26:

#20

I'm afraid this is still not fixing the problem for me. I did the following:

  * deployed bionic-stein using cs:~openstack-charmers-next/rabbitmq-server-372 (which includes this patch)
  * deployed 3 units of rabbit and set cluster-count=3
  * all worked fine
  * then set cluster-count=4 and waited for all units status to get updated to "waiting ..."
  * added one unit of rabbit
  * most of my neutron and nova agents went to error state and i got:

neutron-gateway/0* blocked idle 5 10.5.0.68 Services not running that should be: neutron-lbaasv2-agent, neutron-l3-agent
nova-cloud-controller/0* blocked idle 6 10.5.0.9 8774/tcp,8775/tcp,8778/tcp Services not running that should be: nova-scheduler
nova-compute/0* active idle 7 10.5.0.36 Unit is ready
neutron-openvswitch/0* blocked idle 10.5.0.36 Services not running that should be: neutron-dhcp-agent
rabbitmq-server/0 active idle 8 10.5.0.61 5672/tcp Unit is ready and clustered
rabbitmq-server/1 active idle 9 10.5.0.70 5672/tcp Unit is ready and clustered
rabbitmq-server/2* active idle 10 10.5.0.12 5672/tcp Unit is ready and clustered
rabbitmq-server/3 active idle 11 10.5.0.56 5672/tcp Unit is ready and clustered

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2020-06-26:

#21

My bad, i missed that the client-side patches have not landed yet. I will re-test with those included.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2020-06-26:

#22

Having now tested with the updated clients I think I have found another issue. On my nova-compute host I see that the client config does not contain the address of the new unit (rabbitmq-server/4):

rabbitmq-server/0 active idle 8 10.5.0.61 5672/tcp Unit is ready and clustered
rabbitmq-server/1 active idle 9 10.5.0.70 5672/tcp Unit is ready and clustered
rabbitmq-server/2* active idle 10 10.5.0.12 5672/tcp Unit is ready and clustered
rabbitmq-server/3 active idle 11 10.5.0.56 5672/tcp Unit is ready and clustered
rabbitmq-server/4 active idle 12 10.5.0.17 5672/tcp Unit is ready and clustered

root@juju-c0abfd-1796886-7:~# grep transport_url /etc/neutron/neutron.conf | tail -n 1
transport_url = rabbit://neutron:LjxTkyxMH4MM8pZkscKGfL66LKWZK2XFT47NPCkFbnffgd8zsL9dT9yCngCcwgyj@10.5.0.12:5672,neutron:LjxTkyxMH4MM8pZkscKGfL66LKWZK2XFT47NPCkFbnffgd8zsL9dT9yCngCcwgyj@10.5.0.56:5672,neutron:LjxTkyxMH4MM8pZkscKGfL66LKWZK2XFT47NPCkFbnffgd8zsL9dT9yCngCcwgyj@10.5.0.61:5672,neutron:LjxTkyxMH4MM8pZkscKGfL66LKWZK2XFT47NPCkFbnffgd8zsL9dT9yCngCcwgyj@10.5.0.70:5672/openstack
root@juju-c0abfd-1796886-7:~# grep transport_url /etc/neutron/neutron.conf | tail -n 1| grep 10.5.0.17

Revision history for this message

Andrew McLeod (admcleod) wrote on 2020-06-30:

#23

This review should address #22: https://review.opendev.org/#/c/738219/

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2020-07-08:

#24

To close the loop, all patches landed have resolved the issue for me. As long as I set min-cluster-size to be the targeted size prior to adding units, i do not hit the problem anymore.

Revision history for this message

James Page (james-page) wrote on 2020-07-21:

#25

Due to the wide impacting nature of the fix for this issue this will not be stable backported but will be included in the upcoming 20.08 charm release.

Alex Kavanagh (ajkavanagh) on 2020-08-14

Changed in charm-rabbitmq-server:
status:	Fix Committed → Fix Released

OpenStack RabbitMQ Server Charm

adding unit to existing rabbitmq cluster causes some clients to fail

Bug Description

Other bug subscribers

Remote bug watches