While attempting to do an in-series upgrade(using kolla deploy) with new set of images, we noticed with good consistency that the rabbitmq would get into an unstable state post deploy. The direct impact of this is generally between neutron-server and neutron-agents. All or some neutron agents are not able to reach neutron-server and hence declared dead by neutron-server.
I am trying to go from 2023.1-cad045b26-20231101 to 2023.1-95b7c30cf-20240222.
I have also hit the issue when going from 2023.1-cad045b26-20231101 to 2023.1-cad045b26-<different-date> with some local changes for glance-api.
As a workaround, I stopped all rabbitmq containers first, before starting them one by one. I was able to edit the deploy steps to use this logic and have not seen the issue.
Adding some details from neutron-serve, neutron-l3-agent and rabbitmq.
root@5ebbf78d5d16:/# openstack network agent list
+--------------------------------------+---------------------------+----------------------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+---------------------------+----------------------------+-------------------+-------+-------+---------------------------+
| 09f2626a-7793-4329-a453-bb6338247a92 | Metadata agent | sandbox-1003 | None | XXX | UP | neutron-metadata-agent |
| 21650b7b-f4e5-4fc9-a7c3-f6fe082c4962 | BGP dynamic routing agent | sandbox-1003 | None | XXX | UP | neutron-bgp-dragent |
| 2a69a575-9035-4db2-bd29-641c792825d5 | Open vSwitch agent | sandbox-1002 | None | XXX | UP | neutron-openvswitch-agent |
| 41519097-2a99-480b-92d5-35aca78a0bc7 | L3 agent | sandbox-1003 | nova | XXX | UP | neutron-l3-agent |
| 49dd98fd-d123-4b60-b6f8-fa689368cf19 | NIC Switch agent | sandbox-1006 | None | XXX | UP | neutron-sriov-nic-agent |
| 6138cdd4-0972-41fa-baf7-c23442c1fff3 | Open vSwitch agent | sandbox-1006 | None | XXX | UP | neutron-openvswitch-agent |
| 638fe504-a5f1-49e0-be64-d311d7cb9749 | Metadata agent | sandbox-1002 | None | XXX | UP | neutron-metadata-agent |
| 64d0283e-aa50-4f97-8313-4987443c3d67 | L3 agent | sandbox-1001 | nova | XXX | UP | neutron-l3-agent |
| 700513c7-21dd-4ca2-8b2c-4d69195377bd | NIC Switch agent | sandbox-1004 | None | XXX | UP | neutron-sriov-nic-agent |
| 785b4321-e685-4204-9ccd-46e0d61809a6 | BGP dynamic routing agent | sandbox-1001 | None | XXX | UP | neutron-bgp-dragent |
| 84d49924-7da4-4675-861d-2ac7e5ad7a28 | NIC Switch agent | sandbox-1005 | None | XXX | UP | neutron-sriov-nic-agent |
| 8cc54104-9ef8-4b2e-be43-b4c1bf6d9d9d | Open vSwitch agent | sandbox-1001 | None | XXX | UP | neutron-openvswitch-agent |
| a8ba3583-19f9-487f-a9a3-504f3ad3aea5 | Metadata agent | sandbox-1001 | None | XXX | UP | neutron-metadata-agent |
| aa1b4184-d6e5-4913-8083-e53455f19abc | BGP dynamic routing agent | sandbox-1002 | None | XXX | UP | neutron-bgp-dragent |
| c85e6021-5bcb-496a-a4cc-4944955687c0 | DHCP agent | sandbox-1003 | nova | XXX | UP | neutron-dhcp-agent |
| d61bec18-bfe0-44a6-bd86-a154d4450c97 | DHCP agent | sandbox-1001 | nova | XXX | UP | neutron-dhcp-agent |
| d7bb5ec9-85b9-4897-a1be-d8572f9128f3 | DHCP agent | sandbox-1002 | nova | XXX | UP | neutron-dhcp-agent |
| d9dadeda-0f44-4c43-be83-8408bf75e9b4 | L3 agent | sandbox-1002 | nova | XXX | UP | neutron-l3-agent |
| dae82255-73a2-4dc7-8045-3f04047f953e | Open vSwitch agent | sandbox-1003 | None | XXX | UP | neutron-openvswitch-agent |
| f446695a-9533-4b89-97a2-6a0f367a5fbd | Open vSwitch agent | sandbox-1004 | None | XXX | UP | neutron-openvswitch-agent |
| f8765d4b-a2e3-438d-a915-3bc59f5ed3f6 | Open vSwitch agent | sandbox-1005 | None | XXX | UP | neutron-openvswitch-agent |
+--------------------------------------+---------------------------+----------------------------+-------------------+-------+-------+---------------------------+
Neutron-server logs:
2024-02-23 17:27:51.838 1025 WARNING neutron.db.agents_db [None req-7722a349-d54f-4b84-bfda-eeef570c0c63 - - - - - -] Agent healthcheck: found 21 dead agents out of 21:
Type Last heartbeat host
Metadata agent 2024-02-23 03:58:00 sandbox-1003
BGP dynamic routing agent 2024-02-23 03:58:03 sandbox-1003
Open vSwitch agent 2024-02-23 03:57:17 sandbox-1002
L3 agent 2024-02-23 03:57:24 sandbox-1003
NIC Switch agent 2024-02-23 03:57:54 sandbox-1006
Open vSwitch agent 2024-02-23 03:57:46 sandbox-1006
Metadata agent 2024-02-23 03:58:00 sandbox-1002
L3 agent 2024-02-23 03:57:55 sandbox-1001
NIC Switch agent 2024-02-23 03:57:24 sandbox-1004
BGP dynamic routing agent 2024-02-23 03:58:02 sandbox-1001
NIC Switch agent 2024-02-23 03:57:56 sandbox-1005
Open vSwitch agent 2024-02-23 03:57:47 sandbox-1001
Metadata agent 2024-02-23 03:58:01 sandbox-1001
BGP dynamic routing agent 2024-02-23 03:58:02 sandbox-1002
DHCP agent 2024-02-23 03:57:45 sandbox-1003
DHCP agent 2024-02-23 03:57:45 sandbox-1001
DHCP agent 2024-02-23 03:57:45 sandbox-1002
L3 agent 2024-02-23 03:57:23 sandbox-1002
Open vSwitch agent 2024-02-23 03:57:17 sandbox-1003
Open vSwitch agent 2024-02-23 03:57:46 sandbox-1004
Open vSwitch agent 2024-02-23 03:57:46 sandbox-1005
2024-02-23 17:27:51.885 1025 WARNING neutron.db.agentschedulers_db [None req-53fb4a11-1c42-49b2-bd33-a15168410fff - - - - - -] Agent d61bec18-bfe0-44a6-bd86-a154d4450c97 is down. Type: DHCP agent, host: sandbox-1001, heartbeat: 2024-02-23 03:57:45
2024-02-23 17:27:51.889 1025 WARNING neutron.db.agentschedulers_db [None req-53fb4a11-1c42-49b2-bd33-a15168410fff - - - - - -] Agent d7bb5ec9-85b9-4897-a1be-d8572f9128f3 is down. Type: DHCP agent, host: sandbox-1002, heartbeat: 2024-02-23 03:57:45
2024-02-23 17:27:51.892 1025 WARNING neutron.db.agentschedulers_db [None req-53fb4a11-1c42-49b2-bd33-a15168410fff - - - - - -] Agent c85e6021-5bcb-496a-a4cc-4944955687c0 is down. Type: DHCP agent, host: sandbox-1003, heartbeat: 2024-02-23 03:57:45
2024-02-23 17:27:51.895 1025 WARNING neutron.db.agentschedulers_db [None req-53fb4a11-1c42-49b2-bd33-a15168410fff - - - - - -] No DHCP agents available, skipping rescheduling
2024-02-23 17:27:52.524 1015 INFO neutron.wsgi [None req-8c7530f6-6297-4aa9-919d-6c178a59684f 3c6d7d854110451eaafcac1a84d61ba4 bf32d504d5c7413b9bb89007389d3f1d - - default default] 169.254.101.12,127.0.0.1 "GET /v2.0/routers HTTP/1.1" status: 200 len: 1668 time: 0.1564512
2024-02-23 17:27:52.788 1015 INFO neutron.wsgi [None req-dacfaf45-d63f-40e1-8dd4-48bfe93d75bf 3c6d7d854110451eaafcac1a84d61ba4 bf32d504d5c7413b9bb89007389d3f1d - - default default] 169.254.101.12,127.0.0.1 "GET /v2.0/network-ip-availabilities HTTP/1.1" status: 200 len: 3060 time: 0.0114617
2024-02-23 17:27:52.876 1015 INFO neutron.wsgi [None req-88212439-8200-43d8-ba5f-b0be15862a38 3c6d7d854110451eaafcac1a84d61ba4 bf32d504d5c7413b9bb89007389d3f1d - - default default] 169.254.101.12,127.0.0.1 "GET /v2.0/floatingips HTTP/1.1" status: 200 len: 193 time: 0.0225327
2024-02-23 17:27:53.008 1015 INFO neutron
neutron-l3-agent logs:
2024-02-23 16:30:08.306 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 1d634d394abf468b8475b741bd9b27a8.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 1d634d394abf468b8475b741bd9b27a8
2024-02-23 16:40:08.309 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 15 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID fad31618c5294f6a9e2f1c37d442c959
2024-02-23 16:40:22.925 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID fad31618c5294f6a9e2f1c37d442c959.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID fad31618c5294f6a9e2f1c37d442c959
2024-02-23 16:50:22.933 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 9 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 9fbb1e107e56464883a7e32dae311967
2024-02-23 16:50:32.349 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 9fbb1e107e56464883a7e32dae311967.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 9fbb1e107e56464883a7e32dae311967
2024-02-23 17:00:32.353 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 28 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 38d94fd2657b479b8980aafca6813de7
2024-02-23 17:01:00.725 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 38d94fd2657b479b8980aafca6813de7.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 38d94fd2657b479b8980aafca6813de7
2024-02-23 17:11:00.732 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 60 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 38deafbd33254872bc2d955a01cce2c9
2024-02-23 17:12:00.690 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 38deafbd33254872bc2d955a01cce2c9.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 38deafbd33254872bc2d955a01cce2c9
2024-02-23 17:22:00.694 1055 ERROR neutron_lib.rpc [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 58 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 8d206bc20222443c81525fdd93d276eb
2024-02-23 17:22:58.324 1055 WARNING neutron.agent.l3.agent [None req-812757eb-1f4a-46c7-9a78-9939dd5901e5 - - - - - -] l3-agent cannot contact neutron server to retrieve HA router count. Check connectivity to neutron server. Retrying... Detailed message: Timed out waiting for a reply to message ID 8d206bc20222443c81525fdd93d276eb.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 8d206bc20222443c81525fdd93d276eb
rabbitmq cluster status:
(rabbitmq)[rabbitmq@sandbox-1001 /]$ rabbitmqctl cluster_status
Cluster status of node rabbit@sandbox-1001 ...
Basics
Cluster name: rabbit@sandbox-1001
Total CPU cores available cluster-wide: 30
Disk Nodes
rabbit@sandbox-1001
rabbit@sandbox-1002
rabbit@sandbox-1003
Running Nodes
rabbit@sandbox-1001
rabbit@sandbox-1002
rabbit@sandbox-1003
Versions
rabbit@sandbox-1001: RabbitMQ 3.11.28 on Erlang 25.3.2.9
rabbit@sandbox-1002: RabbitMQ 3.11.28 on Erlang 25.3.2.9
rabbit@sandbox-1003: RabbitMQ 3.11.28 on Erlang 25.3.2.9
CPU Cores
Node: rabbit@sandbox-1001, available CPU cores: 10
Node: rabbit@sandbox-1002, available CPU cores: 10
Node: rabbit@sandbox-1003, available CPU cores: 10
Maintenance status
Node: rabbit@sandbox-1001, status: not under maintenance
Node: rabbit@sandbox-1002, status: not under maintenance
Node: rabbit@sandbox-1003, status: not under maintenance
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@sandbox-1001, interface: 169.254.101.11, port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@sandbox-1001, interface: 169.254.101.11, port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@sandbox-1001, interface: 169.254.101.11, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@sandbox-1001, interface: 169.254.101.11, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@sandbox-1002, interface: 169.254.101.12, port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@sandbox-1002, interface: 169.254.101.12, port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@sandbox-1002, interface: 169.254.101.12, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@sandbox-1002, interface: 169.254.101.12, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@sandbox-1003, interface: 169.254.101.13, port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@sandbox-1003, interface: 169.254.101.13, port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@sandbox-1003, interface: 169.254.101.13, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@sandbox-1003, interface: 169.254.101.13, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Feature flags
Flag: classic_mirrored_queue_version, state: enabled
Flag: classic_queue_type_delivery_support, state: enabled
Flag: direct_exchange_routing_v2, state: enabled
Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: feature_flags_v2, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: listener_records_in_ets, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: stream_queue, state: enabled
Flag: stream_sac_coordinator_unblock_group, state: enabled
Flag: stream_single_active_consumer, state: enabled
Flag: tracking_records_in_ets, state: enabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled