report_interval too frequent; Causing load on service, failing high CPU usage operations
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Medium
|
Assaf Muller | ||
Havana |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
report_interval is how often an agent sends out a heartbeat to the service. The Neutron service responds to these 'report_state' RPC messages by updating the agent's heartbeat DB record. The last heartbeat is then compared to the configured agent_down_time to determine if the agent is up or down. The agent's status is used when scheduling networks on DHCP and L3 agents.
The defaults are 4 seconds for report_interval and 9 for agent_down_time.
On a setup with 18 agents (15 layer 2, L3, DHCP, metadata) sitting on 16 nodes, and a Neutron service sitting on a dedicated powerful machine, the service was idle with 20% CPU usage. Changing the report_interval to 28 seconds and agent_down_time to 60 seconds changed the CPU usage to 1%, and allowed bulk operations on a larger scale. (In this case: Creating 30 instances at the same time with 60 ports). With the original values the operation failed (The instances did not get IP addresses), and with the new values we were able to boot 60 instances successfully. Side note: This flow will work better once the Nova-Neutron race is resolved, but that's orthogonal to this proposal.
Changed in neutron: | |
assignee: | nobody → Assaf Muller (amuller) |
Changed in neutron: | |
importance: | Undecided → Medium |
milestone: | none → icehouse-rc1 |
Changed in neutron: | |
status: | Fix Committed → Fix Released |
Changed in neutron: | |
milestone: | icehouse-rc1 → 2014.1 |
Fix proposed to branch: master /review. openstack. org/80829
Review: https:/