Mirantis OpenStack

L3-agent queue is processed by single worker

Series 8.0.x
Bug #1494416

Bug #1494416 reported by Dmitry Mescheryakov on 2015-09-10

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Released	Critical	Eugene Nikanorov	Mirantis OpenStack 8.0
7.0.x	Fix Released	Critical	Eugene Nikanorov	Mirantis OpenStack 7.0
8.0.x	Fix Released	Critical	Eugene Nikanorov	Mirantis OpenStack 8.0

Bug Description

Steps to reproduce:
1. Deploy MOS with Neutron with DVR enabled
2. Restart all l3 agents at once. You can use the following one-liner on the master node for that:
fuel nodes | grep comp | awk '{ print $1; }' | xargs -I@ ssh node-@ initctl restart neutron-l3-agent

With some probability all agents will go down (as displayed by 'neutron agent-list | grep L3'). The issue is not healed automatically with time.

So far reproduced only on 200 node environment, with around 200 l3 agents living on compute nodes. It might be not reproducible for smaller environments.

======== Other symptoms

neutron-server logs on all three controllers are full of errors like this:
http://paste.openstack.org/show/455396/

Also if one executes
rabbitmqctl list_queues messages consumers name
it can be seen that queue 'q-l3-plugin' is full of messages.

======== RCA
1. l3 agent periodically does RPC calls to the neutron-server.
2. If an agent is restarted after it has sent an RPC request and before it has received the reply, the neutron-server has to send reply to already non-existing reply queue. The queue disappears because l3 agent restart makes it create a new reply queue with different name, while the old queue is removed because it has auto-delete flag.
3. It takes oslo.messaging 60 seconds to try to send message to non-existing queue. After that the message is discarded.
4. l3 agent after start makes an initial RPC call to neutron-server. If call is not responded 5 times in a row, after 5 minutes l3 agent dies with critical error and is respawned by systemd.
5. Assume the following situation: 'q-l3-plugin' queue is full of 'old' RPC requests from already died RPC requests. It takes one neutron-server thread at least 60 seconds to process one message (60 seconds are spent trying to send reply to non-existing queue). In that case new requests from l3 agent are not going to be processed in 5 minutes and so it dies and restarts, meaning it just contributed 5 more messages with invalid reply queue to 'q-l3-plugin'. When number of agents is greater than number of Neutron threads processing requests, the issue never ends by itself as l3 agents produce more messages then neutron-server can process. 'q-l3-plugin' queue constantly grows.

Tags:

Dmitry Mescheryakov (dmitrymex) on 2015-09-10

Changed in mos:
assignee:	nobody → MOS Oslo (mos-oslo)

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2015-09-10:

RCA shows that only 1 process out of all configured neutron-server processes was listening to q-l3-agent.
It's not enough for DVR environments with lots of L3 agents.

Changed in mos:
assignee:	MOS Oslo (mos-oslo) → Eugene Nikanorov (enikanorov)
status:	New → Confirmed
importance:	Undecided → Critical

Alexander Ignatov (aignatov) on 2015-09-11

Changed in mos:
milestone:	none → 7.0

Alexander Ignatov (aignatov) on 2015-09-11

tags:

added: scale

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2015-09-11:

https://review.fuel-infra.org/#/c/11431/

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-09-11:

Bug is renamed to be more specific and to help QA verify it easily.

summary:

- Mass restarting of l3 agents brings down Neutron
+ L3-agent queue is processed by single worker

Alexander Ignatov (aignatov) on 2015-09-11

Changed in mos:
status:	Confirmed → In Progress

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-11:

As Eugene pointed out, q-l3-plugin queue got overflown because only one neutron-server process per controller processes it, while there are 16 worker processes per controller. Neutron team fixed this - now all workers consume from q-l3-plugin. Now system CPU becomes limiting factor: if one restarts all l3 agents, it can be seen that CPU on controllers is actively used by rabbitmq, neutron-server and mysql and that current capacity is still not enough to process all requests of l3 agents in time.

Neutron team continues investigating the issue.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-09-12: Related fix proposed to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/11510

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-09-12:

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/11513

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-09-12:

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/11515

Revision history for this message

Jay Pipes (jaypipes) wrote on 2015-09-14:

Do we have an upstream bug logged somewhere for this?

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-09-17: Fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/11510
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 74106165210627a5aa5ba329c9a3e67360d92f32
Author: Eugene Nikanorov <email address hidden>
Date: Tue Sep 15 10:22:12 2015

Fix neutron-server scalability

1) separate workers and separate topic for state reports.
2) make all rpc workers to consume from q-l3-plugin topic.

Closes-Bug: #1494416
Change-Id: I21c3953380d28c1a252a92f334c7ee6460ed96af

Changed in mos:
status:	In Progress → Fix Committed

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-09-17: Related fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

#10

Reviewed: https://review.fuel-infra.org/11515
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: dc884791dba2717a8b02da97f4c380ef3ecbda6d
Author: Oleg Bondarev <email address hidden>
Date: Tue Sep 15 11:56:02 2015

Do not update ACTIVE ports back to BUILD status

Status update (ACTIVE-BUILD-ACTIVE) may trigger a bunch of
unneeded RPC communications between neutron server and l3
dvr agents which may overload server fatally.
Updated ports will be put in PENDING_BUILD status right after
db update to distinguish real port update and cases when agents
are just restarted and syncing with server.

Related-Bug: #1493732
Related-Bug: #1494416
Change-Id: Ia65b901cb4829d00e829d0b2afbb246860bf0fe5

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-09-17:

#11

Reviewed: https://review.fuel-infra.org/11513
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 46867deb289e88391f1fadeef010b69a535f444a
Author: Oleg Bondarev <email address hidden>
Date: Thu Sep 17 15:05:44 2015

L3 agent: skip routers notifications if fullsync is true

In case l3 agent is about to fullsync there is no point in processing
routers_updated notifications separately.
This should decrease the (unneeded) load on neutron server at high
scale.

Closes-Bug: #1493732
Related-Bug: #1494416
Change-Id: Ic20b767f34903e9bf14f4616632af3b8698dcebb

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-09-18: Change abandoned on openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

#12

Change abandoned by Eugene Nikanorov <email address hidden> on branch: openstack-ci/fuel-7.0/2015.1.0
Review: https://review.fuel-infra.org/11431

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-09-18:

#13

Steps to verify:
1. Check number of consumers for queue q-l3-plugin.
Before the fix the number of consumers was 3 (1 per controller), after the fix should be the same as for q-plugin.

Ksenia Svechnikova (kdemina) on 2015-10-06

tags:

added: on-verification

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2015-10-06:

#14

Verify on HW lab with MOS 7.0 release (#301)

Steps to reproduce:
1. Deploy MOS with Neutron with DVR enabled
2. Restart all l3 agents at once.
3. Displayed 'neutron agent-list | grep L3'

tags:

removed: on-verification

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-10-29: Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

#15

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/13314

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-10-29: Related fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

#16

Related fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/13320

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-10-29:

#17

Related fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/13323

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-11-06:

#18

Related fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/13701

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-11-09: Change abandoned on openstack/neutron (openstack-ci/fuel-8.0/liberty)

#19

Change abandoned by Elena Ezhova <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13314
Reason: Replaced by https://review.fuel-infra.org/#/c/13743/ and https://review.fuel-infra.org/#/c/13701/

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-11-10: Related fix merged to openstack/neutron (openstack-ci/fuel-8.0/liberty)

#20

Reviewed: https://review.fuel-infra.org/13701
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: d4bea2476a44050da2662f1258fb5b0f82c3852b
Author: Eugene Nikanorov <email address hidden>
Date: Mon Nov 9 14:23:35 2015

Use separate queue for agent state reports.

This optimization is needed for big clusters with hundreds
of agents where the spike of activity may trigger a burst
of RPC requests that would prevent neutron-server from processing
agent heart beats in time, triggering resource rescheduling.

This will be further optimized by running dedicated RPC workers
for state reports processing.

Cherry-picked from https://review.openstack.org/#/c/226362/
Related-Bug: #1494416
Related-Bug: #1496410
Change-Id: Id86a1f962aaa4f64011d57ae55d240f890cca4f7

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-11-10: Fix merged to openstack/neutron (openstack-ci/fuel-8.0/liberty)

#21

Reviewed: https://review.fuel-infra.org/13743
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: c61d1789fbe41229e3206f8824fa4643af3b6d13
Author: Eugene Nikanorov <email address hidden>
Date: Mon Nov 9 14:16:49 2015

Consume service plugins queues in RPC workers.

This patch adds all RPC workers to consumers of service
plugins queues such as metering and l3-plugin.
This is important for DVR-enabled deployments with hundreds
of agents.

Cherry-picked from: https://review.openstack.org/#/c/226686/
Change-Id: I6fea7f409c91b25d2c35b038d6100fdfa85d1905
Closes-Bug: #1498844
Closes-Bug: #1494416

Alexander Zatserklyany (zatserklyany) on 2016-01-15

tags:

added: on-verification

Revision history for this message

Alexander Zatserklyany (zatserklyany) wrote on 2016-01-15:

#22

Fix released

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "427"

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-05-16: Change abandoned on openstack/neutron (openstack-ci/fuel-8.0/liberty)

#23

Change abandoned by Oleg Bondarev <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13323
Reason: Upstream patch was abandoned

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.