adding unit to existing rabbitmq cluster causes some clients to fail

Bug #1796886 reported by Edward Hope-Morley
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
High
Andrew McLeod

Bug Description

I have deployed an Openstack (ocata) cloud with a 3 node rabbitmq cluster and done a bunch of vm creates etc to check that rpc is working and all is good. Now I add a unit to rabbitmq and I notice that once it is up and running, the rabbitmq cluster is healthy but some of my clients e.g. nova-compute and neutron-metering agent have died. Checking the logs at both ends I seen that they got restarted (when their .conf files got updated with new hosts) and they tried to connect to the newly added unit of rabbit BUT it seems they did so before the new rabbit unit was fully synced/mirrored because the clients get an authentication failure and never retry thus they sit there unusable until they are manually restarted. They also therefore get reported as down in agent/service listings.

nova-compute.log error - https://pastebin.ubuntu.com/p/CKYBJPVnPK/
neutron-metering-agent.log error - https://pastebin.ubuntu.com/p/kjYFnhXdrp/
neutron-openvswitch-agent.log ... (they are all the same)

newly added rabbtimq log from around the time of these errors - https://pastebin.ubuntu.com/p/C26KH69JBZ/ - also these errors all happen before the logs indicating that the cluster is syncing

Changed in charm-rabbitmq-server:
status: New → Incomplete
status: Incomplete → New
importance: Undecided → High
milestone: none → 18.11
description: updated
David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 18.11 → 19.04
Revision history for this message
David Ames (thedac) wrote :

Let's add more gates in rabbit_utils.client_node_is_ready() to make sure the node has joined the cluster and there is a reasonable expectation it has its data.

Changed in charm-rabbitmq-server:
status: New → Confirmed
status: Confirmed → Triaged
Changed in charm-rabbitmq-server:
assignee: nobody → Chris Sanders (chris.sanders)
assignee: Chris Sanders (chris.sanders) → nobody
Changed in charm-rabbitmq-server:
assignee: nobody → Alok G Singh (alephnull-b)
Revision history for this message
Alok G Singh (alephnull-b) wrote :

Cannot reproduce. My test setup of course does not have much load so I simulate some AMQP traffic by invoking openstack compute service list repeatedly in parallel but there is no sign of the reported problem.

While clustering, the charm fails and retries and finally succeeds in all my tests.

@hopem, Could you provide a test case that could help isolate the problem? I'll mark this incomplete.

Changed in charm-rabbitmq-server:
status: Triaged → Incomplete
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Alok, I was probably too hasty in suggesting this was incomplete as David has indicated where to probably do the fix. It's possibly best if a fix is done and then Ed might be able to test it on his setup?

Changed in charm-rabbitmq-server:
status: Incomplete → Triaged
Revision history for this message
Alok G Singh (alephnull-b) wrote :

I had been trying to reproduce this with openstack-on-LXD. hopem suggested that kvm might be a better bet to reproduce. From IRC:

<dosaboy> i think all i did was deploy 3 units, with min-cluster-size=3
<dosaboy> then i added a unit
<dosaboy> and it pretty consistently caused failures
<dosaboy> and im deploying on openstack
<dosaboy> any client was having issue
<dosaboy> s
<dosaboy> not specific to any one servoice

@ajkavanagh, could it be as simple as calling rabbit_utils.cluster_ready() in rabbit_utils.client_node_is_ready()?

Revision history for this message
Edward Hope-Morley (hopem) wrote :

@alephnull-b here is a bundle, as requested, that you can use to reproduce this issue - https://paste.ubuntu.com/p/VvGVfGh8PQ/

I have just re-tried this out for myself (using kvm) and can confirm the issue is still there i.e. I deployed incl. 3 units of rabbit and waited for everything to be ready. I checked that no services were in error. I then did 'juju add-unit rabbitmq-server' and once that new unit was up sure enough one of the rabbit clients in my deployment went to error state 'AccessRefused: (0, 0): (403) ACCESS_REFUSED' - this time it was neutron-metering-agent but i've found it is quite random which clients fail to connect.

Hope that helps.

Revision history for this message
Alok G Singh (alephnull-b) wrote :

It appears that the root cause of this behaviour might be elsewhere. Even before the rabbitmq cluster is formed, the amqp-relation-changed hook fires on the clients of rabbitmq-server:amqp as this relation already exists. This causes the newly added rmq node's IP to propagate to the client's configuration, triggering errors of the form:

AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile

Note that this analysis is applicable only for this type of error. I have not seen any other error on any of the clients.

For the nova-compute charm, the code in question is at https://github.com/openstack/charm-nova-compute/blob/master/hooks/nova_compute_hooks.py#L279

`rabbit_utils.client_is_ready()` will not help in fixing this. After some discussion with lathiat, we could implement some flag on the amqp relation to indicate that the peer is ready. The following clients will need to modified to check for this "completion flag" before writing configuration files.

ceilometer
cinder
designate
glance
heat
neutron-api
neutron-gateway
neutron-openvswitch
nova-cloud-controller
nova-compute

This is a little complicated to implement and care should be taken to ensure that the implementation is backwards-compatible.

The other half of the bug is that even _after_ the cluster is formed correctly, the clients do not reconnect and stay failed. Errors like the below are seen:

```
2019-02-14 08:19:16.464 26635 ERROR neutron.common.rpc [req-a335630e-c214-4e52-8c71-d145c73ed2bc - - - - -] Timeout in RPC method tunnel_sync. Waiting for 34 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2019-02-14 08:19:16.465 26635 WARNING neutron.common.rpc [req-a335630e-c214-4e52-8c71-d145c73ed2bc - - - - -] Increasing timeout for tunnel_sync calls to 120 seconds. Restart the agent to restore it to the default value.
```

The service does _not_ come back up and instances cannot be launched and connected to.

This part of the bug is probably better fixed in oslo_messaging as the config files pointing to rabbitmq are correctly generated.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I think the interesting part of the bug is: "the clients get an authentication failure and never retry thus they sit there unusable until they are manually restarted."

This seems to imply that the config for the amqp client (oslo) hasn't been updated or the service wasn't restarted to pick it up. i.e. if they sit in a failed state, then there's still something wrong on the client code (potentially) or in the data provided to the client. If the cluster comes up and all of the configs on the clients are correct and it *still* isn't working, then it implies that the client wasn't restarted.

Revision history for this message
Alok G Singh (alephnull-b) wrote :

Agreed @ajkavanagh.

I attempted to reproduce this on bionic-queens. After expanding the rmq cluster, the following services died and had to be started:

cinder/0:
- cinder-scheduler
- apache2

neutron-gateway/0:
- neutron-metering-agent

nova-cloud-controller/0:
- nova-scheduler

Until these services were restarted, spinning up new instances would end in

Timed out waiting for a reply to message ID 38611b56e5c84bc49fe689e45dad0e8a

  File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 1120, in schedule_and_build_instances
    instance_uuids, return_alternates=True)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 717, in _schedule_instances
    return_alternates=return_alternates)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/utils.py", line 787, in wrapped
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 53, in select_destinations
    instance_uuids, return_objects, return_alternates)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
    return getattr(self.instance, __name)(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/query.py", line 42, in select_destinations
    instance_uuids, return_objects, return_alternates)
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/rpcapi.py", line 158, in select_destinations
    return cctxt.call(ctxt, 'select_destinations', **msg_args)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 174, in call
    retry=self.retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 131, in _send
    timeout=timeout, retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send
    retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 548, in _send
    result = self._waiter.wait(msg_id, timeout)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 440, in wait
    message = self.waiters.get(msg_id, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 328, in get
    'to message ID %s' % msg_id)

After restarting the above services, instances can once again be scheduled and accessed.

Revision history for this message
Alok G Singh (alephnull-b) wrote :
Download full text (6.8 KiB)

On bionic-rocky, the behaviour is different. After expanding the rmq cluster, spawning instances fails with the error:

Build of instance 5fc7fe89-854d-459d-87fc-cfcab8c36645 aborted: Failed to allocate the network(s), not rescheduling.
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1929, in _do_build_and_run_instance
    filter_properties, request_spec)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2186, in _build_and_run_instance
    reason=msg)

After restarting the following services

neutron-gateway/0
- neutron-metering-agent
- neutron-lbaasv2-agent
- neutron-dhcp-agent
- neutron-metadata-agent
- neutron-openvswitch-agent

instances still would not spawn with the same error as above. The root cause appears to be:
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent [req-4f7fa2c5-6169-43a7-b713-3192511d902e - - - - -] Unable to reload_allocations dhcp for 76049129-b69a-46f4-ba03-411ce03ea8ac.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 20f1d8ce5d7747d38237ac6f0716e508
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 332, in get
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent return self._queues[msg_id].get(block=True, timeout=timeout)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 313, in get
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent return waiter.wait()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 141, in wait
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent return get_hub().switch()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 294, in switch
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent return self.greenlet.switch()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent queue.Empty
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent During handling of the above exception, another exception occurred:
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 144, in call_driver
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs)
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 511, in reload_allocations
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent self.disable()
2019-02-15 06:28:15.854 13439 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux...

Read more...

David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 19.04 → 19.07
David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 19.10 → 20.01
Changed in charm-rabbitmq-server:
assignee: Alok G Singh (alephnull-b) → nobody
James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: 20.01 → 20.05
Revision history for this message
Chris Sanders (chris.sanders) wrote :

I'm subscribing ~field-high as this appears to have slipped several releases and this is going to impact a major re-architecture/migration event we have planned in the near future. If this is in 20.05 we'll be fine, if it's later it could be problematic.

David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 20.05 → 20.08
Revision history for this message
David Ames (thedac) wrote :

I have been asked to rule out this bug https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1818260 as a possible culprit.

If the following log message is found on the leader node in one of the failed scenarios then it would be a duplicate.

 "check_cluster_memberships(): '<NODENAME>' in nodes but not in charm relations or running_nodes, telling RabbitMQ to forget about it"

Based on my reading of the bug I don't think that is likely.

Further TRIAGE:

I still think the root problem is the new node joining the cluster (or something else in the charm) sets its IP address on the amqp relation before it is ready.

As I mentioned in Comment #1 I think this can be resolved in rabbit_utils.client_node_is_ready(). It could, for example, check all of the users/passwds in leader-settings are accessible before setting amqp relation data.

Alok in Comment #6 is describing the symptom not the root cause. We must stop the new node's IP from being set on the relation until it is ready to handle requests.

I agree with all that the oslo.messaging client code should probably be more robust and handle retries better, but we have more control over the rabbitmq-server charm.

Revision history for this message
Andrew McLeod (admcleod) wrote :

After bug in #11 was fix merged, I have retested and the problem still exists. Continuing to traige.

Revision history for this message
Andrew McLeod (admcleod) wrote :

I tried waiting for more things:

        if (leader_get(attribute='{}_password'.format(rid)) and
            leader_get(attribute='{}_private-address'.format(rid)) and
            leader_get(attribute='{}_resource_params'.format(rid)) and
            leader_get(attribute='{}_resources'.format(rid))):

This worked once (i.e. one new unit), but not a second new unit

Revision history for this message
Andrew McLeod (admcleod) wrote :

It appears that a restart is issued to rabbit and then the client connects - in the situations where it fails, the login failure is occuring between issueing the restart command and before the server has actually stopped

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Just tested next charms in bionic-stein and confirm that i hit the error by simply going from single unit to 3 unit (cluster_count: 3).

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

Can confirm with focal/ussuri. Adding a 4th rabbitmq-server breaks nova-compute.

Revision history for this message
Andrew McLeod (admcleod) wrote :

I've tried testing this with different settings for min-cluster-size and seem to have noticed that if I increase the config to n+1, and then add a new node, I don't have any issues with the service going down.

Similarly it appears as if min-cluster-size=0 (or presumably unset but didn't confirm), I also don't see the problem.

Andrew McLeod (admcleod)
Changed in charm-rabbitmq-server:
assignee: nobody → Andrew McLeod (admcleod)
Revision history for this message
Liam Young (gnuoy) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.opendev.org/737717
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=20f865c19fb1ca296dd6b574da110d9d2c1be91c
Submitter: Zuul
Branch: master

commit 20f865c19fb1ca296dd6b574da110d9d2c1be91c
Author: Liam Young <email address hidden>
Date: Wed Jun 24 09:13:28 2020 +0000

    Do not set client data if not yet clustered

    If a unit is not yet clustered then do not send information to
    the rabbit clients because the unit is not yet ready for
    connections. A fix is also needed on the client side *1 to
    gate on recieving data before adding rabbit unit to list
    of hosts.

    *1 https://github.com/juju/charm-helpers/pull/482

    Closes-Bug: #1796886
    Change-Id: Ib7554bf22356803feb2637a3db3478be575ba350

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I'm afraid this is still not fixing the problem for me. I did the following:

  * deployed bionic-stein using cs:~openstack-charmers-next/rabbitmq-server-372 (which includes this patch)
  * deployed 3 units of rabbit and set cluster-count=3
  * all worked fine
  * then set cluster-count=4 and waited for all units status to get updated to "waiting ..."
  * added one unit of rabbit
  * most of my neutron and nova agents went to error state and i got:

neutron-gateway/0* blocked idle 5 10.5.0.68 Services not running that should be: neutron-lbaasv2-agent, neutron-l3-agent
nova-cloud-controller/0* blocked idle 6 10.5.0.9 8774/tcp,8775/tcp,8778/tcp Services not running that should be: nova-scheduler
nova-compute/0* active idle 7 10.5.0.36 Unit is ready
  neutron-openvswitch/0* blocked idle 10.5.0.36 Services not running that should be: neutron-dhcp-agent
rabbitmq-server/0 active idle 8 10.5.0.61 5672/tcp Unit is ready and clustered
rabbitmq-server/1 active idle 9 10.5.0.70 5672/tcp Unit is ready and clustered
rabbitmq-server/2* active idle 10 10.5.0.12 5672/tcp Unit is ready and clustered
rabbitmq-server/3 active idle 11 10.5.0.56 5672/tcp Unit is ready and clustered

Revision history for this message
Edward Hope-Morley (hopem) wrote :

My bad, i missed that the client-side patches have not landed yet. I will re-test with those included.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Having now tested with the updated clients I think I have found another issue. On my nova-compute host I see that the client config does not contain the address of the new unit (rabbitmq-server/4):

rabbitmq-server/0 active idle 8 10.5.0.61 5672/tcp Unit is ready and clustered
rabbitmq-server/1 active idle 9 10.5.0.70 5672/tcp Unit is ready and clustered
rabbitmq-server/2* active idle 10 10.5.0.12 5672/tcp Unit is ready and clustered
rabbitmq-server/3 active idle 11 10.5.0.56 5672/tcp Unit is ready and clustered
rabbitmq-server/4 active idle 12 10.5.0.17 5672/tcp Unit is ready and clustered

root@juju-c0abfd-1796886-7:~# grep transport_url /etc/neutron/neutron.conf | tail -n 1
transport_url = rabbit://neutron:LjxTkyxMH4MM8pZkscKGfL66LKWZK2XFT47NPCkFbnffgd8zsL9dT9yCngCcwgyj@10.5.0.12:5672,neutron:LjxTkyxMH4MM8pZkscKGfL66LKWZK2XFT47NPCkFbnffgd8zsL9dT9yCngCcwgyj@10.5.0.56:5672,neutron:LjxTkyxMH4MM8pZkscKGfL66LKWZK2XFT47NPCkFbnffgd8zsL9dT9yCngCcwgyj@10.5.0.61:5672,neutron:LjxTkyxMH4MM8pZkscKGfL66LKWZK2XFT47NPCkFbnffgd8zsL9dT9yCngCcwgyj@10.5.0.70:5672/openstack
root@juju-c0abfd-1796886-7:~# grep transport_url /etc/neutron/neutron.conf | tail -n 1| grep 10.5.0.17

Revision history for this message
Andrew McLeod (admcleod) wrote :

This review should address #22: https://review.opendev.org/#/c/738219/

Revision history for this message
Edward Hope-Morley (hopem) wrote :

To close the loop, all patches landed have resolved the issue for me. As long as I set min-cluster-size to be the targeted size prior to adding units, i do not hit the problem anymore.

Revision history for this message
James Page (james-page) wrote :

Due to the wide impacting nature of the fix for this issue this will not be stable backported but will be included in the upcoming 20.08 charm release.

Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.