report_interval too frequent; Causing load on service, failing high CPU usage operations

Bug #1293083 reported by Assaf Muller
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Assaf Muller
Havana
Fix Released
Undecided
Unassigned

Bug Description

report_interval is how often an agent sends out a heartbeat to the service. The Neutron service responds to these 'report_state' RPC messages by updating the agent's heartbeat DB record. The last heartbeat is then compared to the configured agent_down_time to determine if the agent is up or down. The agent's status is used when scheduling networks on DHCP and L3 agents.

The defaults are 4 seconds for report_interval and 9 for agent_down_time.

On a setup with 18 agents (15 layer 2, L3, DHCP, metadata) sitting on 16 nodes, and a Neutron service sitting on a dedicated powerful machine, the service was idle with 20% CPU usage. Changing the report_interval to 28 seconds and agent_down_time to 60 seconds changed the CPU usage to 1%, and allowed bulk operations on a larger scale. (In this case: Creating 30 instances at the same time with 60 ports). With the original values the operation failed (The instances did not get IP addresses), and with the new values we were able to boot 60 instances successfully. Side note: This flow will work better once the Nova-Neutron race is resolved, but that's orthogonal to this proposal.

Assaf Muller (amuller)
Changed in neutron:
assignee: nobody → Assaf Muller (amuller)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/80829

Changed in neutron:
status: New → In Progress
Robert Kukura (rkukura)
Changed in neutron:
importance: Undecided → Medium
milestone: none → icehouse-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/80829
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e13d19cab384a9f5f8a00436ad39118f342af32c
Submitter: Jenkins
Branch: master

commit e13d19cab384a9f5f8a00436ad39118f342af32c
Author: Assaf Muller <email address hidden>
Date: Sun Mar 16 13:01:18 2014 +0200

    Change report_interval from 4 to 30, agent_down_time from 9 to 75

    report_interval is how often an agent sends out a heartbeat to the
    service. The Neutron service responds to these 'report_state' RPC
    messages by updating the agent's heartbeat DB record.
    The last heartbeat is then compared to the configured
    agent_down_time to determine if the agent is up or down.
    The agent's status is used when scheduling networks on DHCP
    and L3 agents.

    In the spirit of sane defaults suited for production, these values
    should be bumped to reduce the load on the Neutron service
    dramatically, freeing up CPU time to perform intensive operations.

    DocImpact
    Closes-Bug: #1293083
    Change-Id: I77bcf8f66f74ba55513c989caead1f96c92b9832

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/87240

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to neutron (stable/havana)

Reviewed: https://review.openstack.org/87240
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3569abac570f3176466b94b2c9ed9ff50d2f0b0d
Submitter: Jenkins
Branch: stable/havana

commit 3569abac570f3176466b94b2c9ed9ff50d2f0b0d
Author: Assaf Muller <email address hidden>
Date: Sun Mar 16 13:01:18 2014 +0200

    Change report_interval from 4 to 30, agent_down_time from 9 to 75

    report_interval is how often an agent sends out a heartbeat to the
    service. The Neutron service responds to these 'report_state' RPC
    messages by updating the agent's heartbeat DB record.
    The last heartbeat is then compared to the configured
    agent_down_time to determine if the agent is up or down.
    The agent's status is used when scheduling networks on DHCP
    and L3 agents.

    In the spirit of sane defaults suited for production, these values
    should be bumped to reduce the load on the Neutron service
    dramatically, freeing up CPU time to perform intensive operations.

    DocImpact
    Closes-Bug: #1293083

    (cherry picked from commit e13d19cab384a9f5f8a00436ad39118f342af32c)
    Change-Id: I77bcf8f66f74ba55513c989caead1f96c92b9832
    Conflicts:
     neutron/agent/common/config.py

tags: added: in-stable-havana
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-rc1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.