neutron

the machine running dhcp agent will have very high cpu load when start dhcp agent after the agent down more than 150 seconds

Bug #1766812 reported by Jiaping LI on 2018-04-25

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Undecided	Unassigned

Bug Description

This issue can be reproduced by following steps:

openstack Ocata version, centos 7.2

1. two dhcp agent nodes
2. neutron-server side config allow_automatic_dhcp_failover is True and dhcp_agents_per_network is 2
3. create a lot of networks and each one have one subnet, I created 200.The more networks, the higher cpu load of dhcp agent node, and the longer high cpu load duration
4. stop one dhcp agent, and wait at least more than 150s (agent_down_time * 2). It is best to check the distribution of networks on two dhcp agent nodes. Neutron-server will remove the networks of the dead dhcp agent after 150s, it is better to wait until all the networks is removed from the dead dhcp agent in the db. So if have 200 networks, you can do the next step after more than 5 minites.
5. start the dhcp agent above, and use top to check the cpu situation, after a while, you will see very high cpu load.

If you have rabbitmq web UI, after do the 5 step, the dhcp agent will sync the networks and the dhcp agent consumer has not been created yet. Neutron-server find that the dhcp agent is active and re schedule network to the dhcp agent, you will find that the messages heap up in the dhcp agent side. After the dhcp agent finished syncing networks, the dhcp agent consumer is created and will consume the messages but not deal. When the dhcp agent queue consumes the heap messages and deal, the cpu load of dhcp agent node will become higher and higher.

See original description

Tags:

Revision history for this message

xiexianbin (xianbin) wrote on 2018-04-25:

which version of openstack and os?

Jiaping LI (lijiaping) on 2018-04-25

description:	updated
description:	updated

Revision history for this message

Pawel Suder (pasuder) wrote on 2018-04-25:

Thank you for providing information about that issue. I would like to ask you few more questions:

- what plugin do you use? what is the type of port?
- what is the configuration of Neutron DHCP agent? could you provide it, please?
- what is the configuration of Neutron server?
- what is the topology?
- do you have any instances in those networks?
- could you provide logs from both DHCP agents, please?
- could you provide logs from neutron server, please?

Thank you!

Revision history for this message

Jiaping LI (lijiaping) wrote on 2018-04-25:

server.log Edit (32.2 MiB, text/plain)

Revision history for this message

Jiaping LI (lijiaping) wrote on 2018-04-25:

dhcp-agent.log Edit (19.4 MiB, text/plain)

Revision history for this message

Jiaping LI (lijiaping) wrote on 2018-04-25:

dhcp_agent.ini Edit (9.2 KiB, text/plain)

- ml2, tenant_network_types = vxlan, mechanism_drivers = openvswitch,l2population
- see attachment
- see attachment
- two all-in-one node (neutron-server ,dhcp agent, l3 agent)
- no vms
- see above attachment
- see above attachment

Actually, our production envirionment(three controllers running neutron-server, two networks running dhcp agent and l3 agent both, many computers), and have a lot of networks, routers and vms, it has the same issue and more critical. The high cpu load of dhcp agent make the network node down, so we seperate the dhcp agent and l3 agent temporarily, but that consumes more equipments.

Revision history for this message

Jiaping LI (lijiaping) wrote on 2018-04-25:

neutron.conf Edit (71.7 KiB, text/plain)

Revision history for this message

Jiaping LI (lijiaping) wrote on 2018-04-25:

another-dhcp-agent.log Edit (12.1 MiB, text/plain)

I find some information. When I do the steps, the normal dhcp agent will update the port too, but they were finished in two reqs, but the dhcp agent I restart did not, it is one req with one update the port message, so it has many reqs to deal at one time.
The attachment log is another dhcp agent I do not any operation.
the two reqs: req-7357bbe2-fc84-4679-9ada-508a951bae7e
req-6b3bae67-4547-45e3-8dbc-e70d8aec6840

Revision history for this message

Jiaping LI (lijiaping) wrote on 2018-04-28:

#10

After further testing, all-in-one environment(dhcp agent and controller services together), if I create 200 networks, the centos 7.2 system almost have 1400 tasks, and can reproduce the issue.

And if I separate the dhcp agent and its corresponding service(ovs, neutron-openvswitch-agent) in a standalone machine, if I create 200 networks, the system only have about 600 tasks, and can not reproduce the issue. If I continue to create networks, and the system have almost 1000 tasks, the issue can reproduce easily.

Revision history for this message

Pawel Suder (pasuder) wrote on 2018-04-30:

#11

Thank you for your update.

Revision history for this message

Brian Haley (brian-haley) wrote on 2018-04-30:

#12

Can you give the specific version of neutron you are using? I'm wondering if this is related to https://bugs.launchpad.net/bugs/1750777 - seems very similar.

Revision history for this message

Jiaping LI (lijiaping) wrote on 2018-05-02:

#13

top1.png Edit (498.7 KiB, image/png)

Hi, my branch is based the commit:

commit b4ac177451275f1045d212df512d314f17a306f4
Merge: 8e42716 3f354c6
Author: Jenkins <email address hidden>
Date: Sun Apr 9 19:51:35 2017 +0000

Merge "Correct the mistake in ../conf.py"

And I try the patch in the issue https://bugs.launchpad.net/bugs/1750777, but it does not work for my issue.

When I reproduce the issue, the cpu load begin to be high, I capture the top screenshot, see the attachment.

Revision history for this message

Brian Haley (brian-haley) wrote on 2018-05-02:

#14

That change seems to imply Pike, and there are a lot of changes after that on the stable/pike branch so it's not the latest code.

Have you determined what the dhcp-agent is doing? Is it making calls to configure things and that's why the rootwrap daemon is overloaded?

I noticed in your logs there are some debug messages with your initials printing all the networks after an RPC call, are you looking at that as well?

Revision history for this message

Jiaping LI (lijiaping) wrote on 2018-05-03:

#15

For example, I have two dhcp agents, dhcp-agent-1, dhcp-agent-2.
When stop the dhcp-agent-2,after agent_down_time * 2, the networks will be rescheduled by neutron-server to delete from dhcp-agent-2 slowly. After a while, you can see that the networks will not exist in the dhcp-agent-2 in the neutron db.In the process, neutron-server will send messages(some like network operation and port update) to the dhcp-agent-2 queue and the messages heap up because the dhcp-agent-2 not work.

When dhcp-agent-2 start again, it will do the all networks full sync first, but the dhcp-agent-2 consumer not created yet. At the same time, the neutron-server find that the dhcp-agent-2 work again and re schedule the networks to the dhcp-agent-2 again and send messages to the dhcp-agent-2 queue and the messages continue to heap up(you can check the rabbitmq web ui, all the messages in the queue dhcp-agent-2 is Ready state).

After full sync finished, the dhcp-agent-2 queue consumer created and consume all the ready state messages and become to be Unacked state, because the dhcp-agent-2 do the second sync and the dhcp-agent do not deal the messages when sync. After the dhcp-agent-2 second sync finished, dhcp agent begin to deal the messages, the Unacked messages will gradually reduce, and the cpu load will be very high.

When the dhcp agent deal the messages, I find that it is doing the port_update_end operation. The code in neutron/agent/dhcp/agent.py:

@_wait_if_syncing
def port_update_end(self, context, payload):
"""Handle the port.update.end notification event."""

Yesterday, I modify the code to use the queue module to deal the message contrast to L3 agent, and find that can fix this issue!
The L3 agent use RouterProcessingQueue to deal messages. I try to use a queue to deal messages in the dhcp agent similar to L3 agent, and seems to work well.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-08: Fix proposed to neutron (master)

#16

Fix proposed to branch: master
Review: https://review.openstack.org/566817

Changed in neutron:
assignee:	nobody → Jiaping LI (lijiaping)
status:	New → In Progress

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2018-08-31: auto-abandon-script

#17

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee:	Jiaping LI (lijiaping) → nobody
status:	In Progress → New
tags:	added: timeout-abandon

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-31: Change abandoned on neutron (master)

#18

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/566817
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

olmy0414 (oleksandr-mykhalskyi) wrote on 2019-04-18:

#19

Do you have plans to finish this patch? Or can someone else pickup this problem?

We have the same issue on our openstack pike cloud with centos 7.5

Current count of network with dhcp - 215

Some times ago, we have to restart rabbitmq node (part of rabbitmq cluster) on one of cloud controllers.
After that, neutron-dhcp agent on other controller showed this status as dead (xxx) for 10-15 min, than this agent started to synchronizing state.
During this synchronization, his child neutron-rootwrap-daemon processes hunged the controller for 2-3min by huge cpu sys% load up to 90-95%
As result - partitions on rabbitmq cluster and a lot of related issues in the cloud...

Thanks

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2019-07-24:

#20

We got the same behavior which makes the DHCP sync a very long process (more than 10 hours)!

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-01-23:

#21

The DHCP agent was changed to use GreenPool thread, and will dynamically increase the number as of commit 7369b69e2ef5b1b3c30b237885c2648c63f1dffb. This was a similar change as presented in the linked patch. For this reason I'll mark this bug fixed.

Changed in neutron:
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.