L3-agent queue is processed by single worker

Bug #1494416 reported by Dmitry Mescheryakov on 2015-09-10
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Critical
Eugene Nikanorov
7.0.x
Critical
Eugene Nikanorov
8.0.x
Critical
Eugene Nikanorov

Bug Description

Steps to reproduce:
1. Deploy MOS with Neutron with DVR enabled
2. Restart all l3 agents at once. You can use the following one-liner on the master node for that:
fuel nodes | grep comp | awk '{ print $1; }' | xargs -I@ ssh node-@ initctl restart neutron-l3-agent

With some probability all agents will go down (as displayed by 'neutron agent-list | grep L3'). The issue is not healed automatically with time.

So far reproduced only on 200 node environment, with around 200 l3 agents living on compute nodes. It might be not reproducible for smaller environments.

======== Other symptoms

neutron-server logs on all three controllers are full of errors like this:
http://paste.openstack.org/show/455396/

Also if one executes
rabbitmqctl list_queues messages consumers name
it can be seen that queue 'q-l3-plugin' is full of messages.

======== RCA
1. l3 agent periodically does RPC calls to the neutron-server.
2. If an agent is restarted after it has sent an RPC request and before it has received the reply, the neutron-server has to send reply to already non-existing reply queue. The queue disappears because l3 agent restart makes it create a new reply queue with different name, while the old queue is removed because it has auto-delete flag.
3. It takes oslo.messaging 60 seconds to try to send message to non-existing queue. After that the message is discarded.
4. l3 agent after start makes an initial RPC call to neutron-server. If call is not responded 5 times in a row, after 5 minutes l3 agent dies with critical error and is respawned by systemd.
5. Assume the following situation: 'q-l3-plugin' queue is full of 'old' RPC requests from already died RPC requests. It takes one neutron-server thread at least 60 seconds to process one message (60 seconds are spent trying to send reply to non-existing queue). In that case new requests from l3 agent are not going to be processed in 5 minutes and so it dies and restarts, meaning it just contributed 5 more messages with invalid reply queue to 'q-l3-plugin'. When number of agents is greater than number of Neutron threads processing requests, the issue never ends by itself as l3 agents produce more messages then neutron-server can process. 'q-l3-plugin' queue constantly grows.

Changed in mos:
assignee: nobody → MOS Oslo (mos-oslo)
Eugene Nikanorov (enikanorov) wrote :

RCA shows that only 1 process out of all configured neutron-server processes was listening to q-l3-agent.
It's not enough for DVR environments with lots of L3 agents.

Changed in mos:
assignee: MOS Oslo (mos-oslo) → Eugene Nikanorov (enikanorov)
status: New → Confirmed
importance: Undecided → Critical
Changed in mos:
milestone: none → 7.0
tags: added: scale
Ilya Shakhat (shakhat) wrote :

Bug is renamed to be more specific and to help QA verify it easily.

summary: - Mass restarting of l3 agents brings down Neutron
+ L3-agent queue is processed by single worker
Changed in mos:
status: Confirmed → In Progress
Dmitry Mescheryakov (dmitrymex) wrote :

As Eugene pointed out, q-l3-plugin queue got overflown because only one neutron-server process per controller processes it, while there are 16 worker processes per controller. Neutron team fixed this - now all workers consume from q-l3-plugin. Now system CPU becomes limiting factor: if one restarts all l3 agents, it can be seen that CPU on controllers is actively used by rabbitmq, neutron-server and mysql and that current capacity is still not enough to process all requests of l3 agents in time.

Neutron team continues investigating the issue.

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/11510

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/11513

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/11515

Jay Pipes (jaypipes) wrote :

Do we have an upstream bug logged somewhere for this?

Reviewed: https://review.fuel-infra.org/11510
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 74106165210627a5aa5ba329c9a3e67360d92f32
Author: Eugene Nikanorov <email address hidden>
Date: Tue Sep 15 10:22:12 2015

Fix neutron-server scalability

1) separate workers and separate topic for state reports.
2) make all rpc workers to consume from q-l3-plugin topic.

Closes-Bug: #1494416
Change-Id: I21c3953380d28c1a252a92f334c7ee6460ed96af

Changed in mos:
status: In Progress → Fix Committed

Reviewed: https://review.fuel-infra.org/11515
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: dc884791dba2717a8b02da97f4c380ef3ecbda6d
Author: Oleg Bondarev <email address hidden>
Date: Tue Sep 15 11:56:02 2015

Do not update ACTIVE ports back to BUILD status

Status update (ACTIVE-BUILD-ACTIVE) may trigger a bunch of
unneeded RPC communications between neutron server and l3
dvr agents which may overload server fatally.
Updated ports will be put in PENDING_BUILD status right after
db update to distinguish real port update and cases when agents
are just restarted and syncing with server.

Related-Bug: #1493732
Related-Bug: #1494416
Change-Id: Ia65b901cb4829d00e829d0b2afbb246860bf0fe5

Reviewed: https://review.fuel-infra.org/11513
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 46867deb289e88391f1fadeef010b69a535f444a
Author: Oleg Bondarev <email address hidden>
Date: Thu Sep 17 15:05:44 2015

L3 agent: skip routers notifications if fullsync is true

In case l3 agent is about to fullsync there is no point in processing
routers_updated notifications separately.
This should decrease the (unneeded) load on neutron server at high
scale.

Closes-Bug: #1493732
Related-Bug: #1494416
Change-Id: Ic20b767f34903e9bf14f4616632af3b8698dcebb

Change abandoned by Eugene Nikanorov <email address hidden> on branch: openstack-ci/fuel-7.0/2015.1.0
Review: https://review.fuel-infra.org/11431

Ilya Shakhat (shakhat) wrote :

Steps to verify:
1. Check number of consumers for queue q-l3-plugin.
Before the fix the number of consumers was 3 (1 per controller), after the fix should be the same as for q-plugin.

tags: added: on-verification
Ksenia Svechnikova (kdemina) wrote :

Verify on HW lab with MOS 7.0 release (#301)

Steps to reproduce:
1. Deploy MOS with Neutron with DVR enabled
2. Restart all l3 agents at once.
3. Displayed 'neutron agent-list | grep L3'

tags: removed: on-verification

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/13314

Related fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/13320

Related fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/13323

Related fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/13701

Reviewed: https://review.fuel-infra.org/13701
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: d4bea2476a44050da2662f1258fb5b0f82c3852b
Author: Eugene Nikanorov <email address hidden>
Date: Mon Nov 9 14:23:35 2015

Use separate queue for agent state reports.

This optimization is needed for big clusters with hundreds
of agents where the spike of activity may trigger a burst
of RPC requests that would prevent neutron-server from processing
agent heart beats in time, triggering resource rescheduling.

This will be further optimized by running dedicated RPC workers
for state reports processing.

Cherry-picked from https://review.openstack.org/#/c/226362/
Related-Bug: #1494416
Related-Bug: #1496410
Change-Id: Id86a1f962aaa4f64011d57ae55d240f890cca4f7

Reviewed: https://review.fuel-infra.org/13743
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: c61d1789fbe41229e3206f8824fa4643af3b6d13
Author: Eugene Nikanorov <email address hidden>
Date: Mon Nov 9 14:16:49 2015

Consume service plugins queues in RPC workers.

This patch adds all RPC workers to consumers of service
plugins queues such as metering and l3-plugin.
This is important for DVR-enabled deployments with hundreds
of agents.

Cherry-picked from: https://review.openstack.org/#/c/226686/
Change-Id: I6fea7f409c91b25d2c35b038d6100fdfa85d1905
Closes-Bug: #1498844
Closes-Bug: #1494416

tags: added: on-verification

Fix released

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "427"

Change abandoned by Oleg Bondarev <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13323
Reason: Upstream patch was abandoned

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers