kolla-ansible In series upgrade in Antelope leaves rabbitmq in unstable state

Bug #2054844 reported by Shrishail Kariyappanavar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
New
Undecided
Unassigned

Bug Description

While attempting to do an in-series upgrade(using kolla deploy) with new set of images, we noticed with good consistency that the rabbitmq would get into an unstable state post deploy. The direct impact of this is generally between neutron-server and neutron-agents. All or some neutron agents are not able to reach neutron-server and hence declared dead by neutron-server.

I am trying to go from 2023.1-cad045b26-20231101 to 2023.1-95b7c30cf-20240222.
I have also hit the issue when going from 2023.1-cad045b26-20231101 to 2023.1-cad045b26-<different-date> with some local changes for glance-api.

As a workaround, I stopped all rabbitmq containers first, before starting them one by one. I was able to edit the deploy steps to use this logic and have not seen the issue.

Adding some details from neutron-serve, neutron-l3-agent and rabbitmq.

root@5ebbf78d5d16:/# openstack network agent list
+--------------------------------------+---------------------------+----------------------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+---------------------------+----------------------------+-------------------+-------+-------+---------------------------+
| 09f2626a-7793-4329-a453-bb6338247a92 | Metadata agent | sandbox-1003 | None | XXX | UP | neutron-metadata-agent |
| 21650b7b-f4e5-4fc9-a7c3-f6fe082c4962 | BGP dynamic routing agent | sandbox-1003 | None | XXX | UP | neutron-bgp-dragent |
| 2a69a575-9035-4db2-bd29-641c792825d5 | Open vSwitch agent | sandbox-1002 | None | XXX | UP | neutron-openvswitch-agent |
| 41519097-2a99-480b-92d5-35aca78a0bc7 | L3 agent | sandbox-1003 | nova | XXX | UP | neutron-l3-agent |
| 49dd98fd-d123-4b60-b6f8-fa689368cf19 | NIC Switch agent | sandbox-1006 | None | XXX | UP | neutron-sriov-nic-agent |
| 6138cdd4-0972-41fa-baf7-c23442c1fff3 | Open vSwitch agent | sandbox-1006 | None | XXX | UP | neutron-openvswitch-agent |
| 638fe504-a5f1-49e0-be64-d311d7cb9749 | Metadata agent | sandbox-1002 | None | XXX | UP | neutron-metadata-agent |
| 64d0283e-aa50-4f97-8313-4987443c3d67 | L3 agent | sandbox-1001 | nova | XXX | UP | neutron-l3-agent |
| 700513c7-21dd-4ca2-8b2c-4d69195377bd | NIC Switch agent | sandbox-1004 | None | XXX | UP | neutron-sriov-nic-agent |
| 785b4321-e685-4204-9ccd-46e0d61809a6 | BGP dynamic routing agent | sandbox-1001 | None | XXX | UP | neutron-bgp-dragent |
| 84d49924-7da4-4675-861d-2ac7e5ad7a28 | NIC Switch agent | sandbox-1005 | None | XXX | UP | neutron-sriov-nic-agent |
| 8cc54104-9ef8-4b2e-be43-b4c1bf6d9d9d | Open vSwitch agent | sandbox-1001 | None | XXX | UP | neutron-openvswitch-agent |
| a8ba3583-19f9-487f-a9a3-504f3ad3aea5 | Metadata agent | sandbox-1001 | None | XXX | UP | neutron-metadata-agent |
| aa1b4184-d6e5-4913-8083-e53455f19abc | BGP dynamic routing agent | sandbox-1002 | None | XXX | UP | neutron-bgp-dragent |
| c85e6021-5bcb-496a-a4cc-4944955687c0 | DHCP agent | sandbox-1003 | nova | XXX | UP | neutron-dhcp-agent |
| d61bec18-bfe0-44a6-bd86-a154d4450c97 | DHCP agent | sandbox-1001 | nova | XXX | UP | neutron-dhcp-agent |
| d7bb5ec9-85b9-4897-a1be-d8572f9128f3 | DHCP agent | sandbox-1002 | nova | XXX | UP | neutron-dhcp-agent |
| d9dadeda-0f44-4c43-be83-8408bf75e9b4 | L3 agent | sandbox-1002 | nova | XXX | UP | neutron-l3-agent |
| dae82255-73a2-4dc7-8045-3f04047f953e | Open vSwitch agent | sandbox-1003 | None | XXX | UP | neutron-openvswitch-agent |
| f446695a-9533-4b89-97a2-6a0f367a5fbd | Open vSwitch agent | sandbox-1004 | None | XXX | UP | neutron-openvswitch-agent |
| f8765d4b-a2e3-438d-a915-3bc59f5ed3f6 | Open vSwitch agent | sandbox-1005 | None | XXX | UP | neutron-openvswitch-agent |
+--------------------------------------+---------------------------+----------------------------+-------------------+-------+-------+---------------------------+

Neutron-server logs:

2024-02-23 17:27:51.838 1025 WARNING neutron.db.agents_db [None req-7722a349-d54f-4b84-bfda-eeef570c0c63 - - - - - -] Agent healthcheck: found 21 dead agents out of 21:
                Type Last heartbeat host
      Metadata agent 2024-02-23 03:58:00 sandbox-1003
BGP dynamic routing agent 2024-02-23 03:58:03 sandbox-1003
  Open vSwitch agent 2024-02-23 03:57:17 sandbox-1002
            L3 agent 2024-02-23 03:57:24 sandbox-1003
    NIC Switch agent 2024-02-23 03:57:54 sandbox-1006
  Open vSwitch agent 2024-02-23 03:57:46 sandbox-1006
      Metadata agent 2024-02-23 03:58:00 sandbox-1002
            L3 agent 2024-02-23 03:57:55 sandbox-1001
    NIC Switch agent 2024-02-23 03:57:24 sandbox-1004
BGP dynamic routing agent 2024-02-23 03:58:02 sandbox-1001
    NIC Switch agent 2024-02-23 03:57:56 sandbox-1005
  Open vSwitch agent 2024-02-23 03:57:47 sandbox-1001
      Metadata agent 2024-02-23 03:58:01 sandbox-1001
BGP dynamic routing agent 2024-02-23 03:58:02 sandbox-1002
          DHCP agent 2024-02-23 03:57:45 sandbox-1003
          DHCP agent 2024-02-23 03:57:45 sandbox-1001
          DHCP agent 2024-02-23 03:57:45 sandbox-1002
            L3 agent 2024-02-23 03:57:23 sandbox-1002
  Open vSwitch agent 2024-02-23 03:57:17 sandbox-1003
  Open vSwitch agent 2024-02-23 03:57:46 sandbox-1004
  Open vSwitch agent 2024-02-23 03:57:46 sandbox-1005
2024-02-23 17:27:51.885 1025 WARNING neutron.db.agentschedulers_db [None req-53fb4a11-1c42-49b2-bd33-a15168410fff - - - - - -] Agent d61bec18-bfe0-44a6-bd86-a154d4450c97 is down. Type: DHCP agent, host: sandbox-1001, heartbeat: 2024-02-23 03:57:45
2024-02-23 17:27:51.889 1025 WARNING neutron.db.agentschedulers_db [None req-53fb4a11-1c42-49b2-bd33-a15168410fff - - - - - -] Agent d7bb5ec9-85b9-4897-a1be-d8572f9128f3 is down. Type: DHCP agent, host: sandbox-1002, heartbeat: 2024-02-23 03:57:45
2024-02-23 17:27:51.892 1025 WARNING neutron.db.agentschedulers_db [None req-53fb4a11-1c42-49b2-bd33-a15168410fff - - - - - -] Agent c85e6021-5bcb-496a-a4cc-4944955687c0 is down. Type: DHCP agent, host: sandbox-1003, heartbeat: 2024-02-23 03:57:45
2024-02-23 17:27:51.895 1025 WARNING neutron.db.agentschedulers_db [None req-53fb4a11-1c42-49b2-bd33-a15168410fff - - - - - -] No DHCP agents available, skipping rescheduling
2024-02-23 17:27:52.524 1015 INFO neutron.wsgi [None req-8c7530f6-6297-4aa9-919d-6c178a59684f 3c6d7d854110451eaafcac1a84d61ba4 bf32d504d5c7413b9bb89007389d3f1d - - default default] 169.254.101.12,127.0.0.1 "GET /v2.0/routers HTTP/1.1" status: 200 len: 1668 time: 0.1564512
2024-02-23 17:27:52.788 1015 INFO neutron.wsgi [None req-dacfaf45-d63f-40e1-8dd4-48bfe93d75bf 3c6d7d854110451eaafcac1a84d61ba4 bf32d504d5c7413b9bb89007389d3f1d - - default default] 169.254.101.12,127.0.0.1 "GET /v2.0/network-ip-availabilities HTTP/1.1" status: 200 len: 3060 time: 0.0114617
2024-02-23 17:27:52.876 1015 INFO neutron.wsgi [None req-88212439-8200-43d8-ba5f-b0be15862a38 3c6d7d854110451eaafcac1a84d61ba4 bf32d504d5c7413b9bb89007389d3f1d - - default default] 169.254.101.12,127.0.0.1 "GET /v2.0/floatingips HTTP/1.1" status: 200 len: 193 time: 0.0225327
2024-02-23 17:27:53.008 1015 INFO neutron

neutron-l3-agent logs:

2024-02-23 16:30:08.306 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 1d634d394abf468b8475b741bd9b27a8.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 1d634d394abf468b8475b741bd9b27a8
2024-02-23 16:40:08.309 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 15 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID fad31618c5294f6a9e2f1c37d442c959
2024-02-23 16:40:22.925 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID fad31618c5294f6a9e2f1c37d442c959.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID fad31618c5294f6a9e2f1c37d442c959
2024-02-23 16:50:22.933 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 9 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 9fbb1e107e56464883a7e32dae311967
2024-02-23 16:50:32.349 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 9fbb1e107e56464883a7e32dae311967.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 9fbb1e107e56464883a7e32dae311967
2024-02-23 17:00:32.353 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 28 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 38d94fd2657b479b8980aafca6813de7
2024-02-23 17:01:00.725 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 38d94fd2657b479b8980aafca6813de7.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 38d94fd2657b479b8980aafca6813de7
2024-02-23 17:11:00.732 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 60 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 38deafbd33254872bc2d955a01cce2c9
2024-02-23 17:12:00.690 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 38deafbd33254872bc2d955a01cce2c9.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 38deafbd33254872bc2d955a01cce2c9
2024-02-23 17:22:00.694 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 58 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 8d206bc20222443c81525fdd93d276eb
2024-02-23 17:22:58.324 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 8d206bc20222443c81525fdd93d276eb.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 8d206bc20222443c81525fdd93d276eb

rabbitmq cluster status:

(rabbitmq)[rabbitmq@sandbox-1001 /]$ rabbitmqctl cluster_status
Cluster status of node rabbit@sandbox-1001 ...
Basics

Cluster name: rabbit@sandbox-1001
Total CPU cores available cluster-wide: 30

Disk Nodes

rabbit@sandbox-1001
rabbit@sandbox-1002
rabbit@sandbox-1003

Running Nodes

rabbit@sandbox-1001
rabbit@sandbox-1002
rabbit@sandbox-1003

Versions

rabbit@sandbox-1001: RabbitMQ 3.11.28 on Erlang 25.3.2.9
rabbit@sandbox-1002: RabbitMQ 3.11.28 on Erlang 25.3.2.9
rabbit@sandbox-1003: RabbitMQ 3.11.28 on Erlang 25.3.2.9

CPU Cores

Node: rabbit@sandbox-1001, available CPU cores: 10
Node: rabbit@sandbox-1002, available CPU cores: 10
Node: rabbit@sandbox-1003, available CPU cores: 10

Maintenance status

Node: rabbit@sandbox-1001, status: not under maintenance
Node: rabbit@sandbox-1002, status: not under maintenance
Node: rabbit@sandbox-1003, status: not under maintenance

Alarms

(none)

Network Partitions

(none)

Listeners

Node: rabbit@sandbox-1001, interface: 169.254.101.11, port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@sandbox-1001, interface: 169.254.101.11, port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@sandbox-1001, interface: 169.254.101.11, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@sandbox-1001, interface: 169.254.101.11, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@sandbox-1002, interface: 169.254.101.12, port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@sandbox-1002, interface: 169.254.101.12, port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@sandbox-1002, interface: 169.254.101.12, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@sandbox-1002, interface: 169.254.101.12, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@sandbox-1003, interface: 169.254.101.13, port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@sandbox-1003, interface: 169.254.101.13, port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@sandbox-1003, interface: 169.254.101.13, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@sandbox-1003, interface: 169.254.101.13, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0

Feature flags

Flag: classic_mirrored_queue_version, state: enabled
Flag: classic_queue_type_delivery_support, state: enabled
Flag: direct_exchange_routing_v2, state: enabled
Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: feature_flags_v2, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: listener_records_in_ets, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: stream_queue, state: enabled
Flag: stream_sac_coordinator_unblock_group, state: enabled
Flag: stream_single_active_consumer, state: enabled
Flag: tracking_records_in_ets, state: enabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled

description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.