neutron

report_interval too frequent; Causing load on service, failing high CPU usage operations

Bug #1293083 reported by Assaf Muller on 2014-03-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Medium	Assaf Muller	neutron 2014.1 "icehouse"
	Havana	Fix Released	Undecided	Unassigned	neutron 2013.2.4

Bug Description

report_interval is how often an agent sends out a heartbeat to the service. The Neutron service responds to these 'report_state' RPC messages by updating the agent's heartbeat DB record. The last heartbeat is then compared to the configured agent_down_time to determine if the agent is up or down. The agent's status is used when scheduling networks on DHCP and L3 agents.

The defaults are 4 seconds for report_interval and 9 for agent_down_time.

On a setup with 18 agents (15 layer 2, L3, DHCP, metadata) sitting on 16 nodes, and a Neutron service sitting on a dedicated powerful machine, the service was idle with 20% CPU usage. Changing the report_interval to 28 seconds and agent_down_time to 60 seconds changed the CPU usage to 1%, and allowed bulk operations on a larger scale. (In this case: Creating 30 instances at the same time with 60 ports). With the original values the operation failed (The instances did not get IP addresses), and with the new values we were able to boot 60 instances successfully. Side note: This flow will work better once the Nova-Neutron race is resolved, but that's orthogonal to this proposal.

Tags:

Assaf Muller (amuller) on 2014-03-16

Changed in neutron:
assignee:	nobody → Assaf Muller (amuller)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-16: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/80829

Changed in neutron:
status:	New → In Progress

Robert Kukura (rkukura) on 2014-03-16

Changed in neutron:
importance:	Undecided → Medium
milestone:	none → icehouse-rc1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-20: Fix merged to neutron (master)

Reviewed: https://review.openstack.org/80829
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e13d19cab384a9f5f8a00436ad39118f342af32c
Submitter: Jenkins
Branch: master

commit e13d19cab384a9f5f8a00436ad39118f342af32c
Author: Assaf Muller <email address hidden>
Date: Sun Mar 16 13:01:18 2014 +0200

Change report_interval from 4 to 30, agent_down_time from 9 to 75

    report_interval is how often an agent sends out a heartbeat to the
    service. The Neutron service responds to these 'report_state' RPC
    messages by updating the agent's heartbeat DB record.
    The last heartbeat is then compared to the configured
    agent_down_time to determine if the agent is up or down.
    The agent's status is used when scheduling networks on DHCP
    and L3 agents.

    In the spirit of sane defaults suited for production, these values
    should be bumped to reduce the load on the Neutron service
    dramatically, freeing up CPU time to perform intensive operations.

    DocImpact
    Closes-Bug: #1293083
    Change-Id: I77bcf8f66f74ba55513c989caead1f96c92b9832

Changed in neutron:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2014-04-01

Changed in neutron:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-04-14: Fix proposed to neutron (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/87240

Revision history for this message

Openstack Gerrit (openstack-gerrit) wrote on 2014-04-16: Fix merged to neutron (stable/havana)

Reviewed: https://review.openstack.org/87240
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3569abac570f3176466b94b2c9ed9ff50d2f0b0d
Submitter: Jenkins
Branch: stable/havana

commit 3569abac570f3176466b94b2c9ed9ff50d2f0b0d
Author: Assaf Muller <email address hidden>
Date: Sun Mar 16 13:01:18 2014 +0200

Change report_interval from 4 to 30, agent_down_time from 9 to 75

    report_interval is how often an agent sends out a heartbeat to the
    service. The Neutron service responds to these 'report_state' RPC
    messages by updating the agent's heartbeat DB record.
    The last heartbeat is then compared to the configured
    agent_down_time to determine if the agent is up or down.
    The agent's status is used when scheduling networks on DHCP
    and L3 agents.

DocImpact
Closes-Bug: #1293083

    (cherry picked from commit e13d19cab384a9f5f8a00436ad39118f342af32c)
    Change-Id: I77bcf8f66f74ba55513c989caead1f96c92b9832
    Conflicts:
     neutron/agent/common/config.py

tags:

added: in-stable-havana

Thierry Carrez (ttx) on 2014-04-17

Changed in neutron:
milestone:	icehouse-rc1 → 2014.1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.