dhcp agent reporting state as down during the initial sync

Bug #1650611 reported by Daniel Alvarez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Bertrand Lallau

Bug Description

When dhcp agent is started, neutron agent-list reports its state as dead until the initial sync is complete.

This can lead to unwanted alarms in monitoring systems, especially in large environments where the initial sync may take hours. During this time, systemctl shows that the agent is actually alive while neutron agent-list reports it as down.

Technical details:

If I'm right, this line [0] is the exact point where the initial sync takes place right after the first state report (with start_flag=True) is sent to the server. As it's being done in the same thread, it won't send a second state report until it's done with the sync.

Doing it in a separate thread would let the heartbeat task to continue sending state reports to the server but I don't know whether this have any unwanted side effects.

[0] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L751

Revision history for this message
Miguel Lavalle (minsel) wrote :

@Daniel,

This seems like it may be a nice enhancement. Do you have specific information from large deployers that you could share here, so we can help the drivers to prioritize this enhancement?

tags: added: l3-ipam-dhcp rfe
removed: l3-bgp
Miguel Lavalle (minsel)
Changed in neutron:
importance: Undecided → Wishlist
Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

@Miguel,
1200 tenants and about 4400 VMs take ~2.5 hours for dhcp-agent to sync

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This seems a bug to me, but we would need to triage this one. Can you share version details, please?

tags: removed: rfe
Changed in neutron:
importance: Wishlist → Low
status: New → Incomplete
Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

@Armando, I agree with you. Also, it's present in master.
I was reviewing this patch [0] and I think it'll fix it (see my comments in gerrit)
Thanks, Daniel

[0] https://review.openstack.org/#/c/413010/

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Agreed, it's not an enhancement, but a bug fix. I've analyzed the change, talked to dalvarez, and +2'd it. I believe we should also backport this.

tags: added: mitaka-backport-potential newton-backport-potential
Changed in neutron:
assignee: nobody → Miguel Angel Ajo (mangelajo)
status: Incomplete → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/413010
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f15851b98974dc16606da195cf3ecee577cd0ef8
Submitter: Jenkins
Branch: master

commit f15851b98974dc16606da195cf3ecee577cd0ef8
Author: Bertrand Lallau <email address hidden>
Date: Tue Dec 20 10:53:41 2016 +0100

    DHCP: enhance DHCPAgent startup procedure

    During DhcpAgent startup procedure all the following networks
    initialization is actually perform twice:
     * Killing old dnsmasq processes
     * set and configure all TAP interfaces
     * building all Dnsmasq config files (lease and host files)
     * launching dnsmasq processes
    What is done during the second iteration is just clean and redo
    exactly the same another time! This is really inefficient and
    increase dramatically DHCP startup time (near twice than needed).

    Initialization process 'sync_state' method is called twice:
     * one time during init_host()
     * another time during _report_state()

    sync_state() call must stay in init_host() due to bug #1420042.

    sync_state() is always called during startup in init_host()
    and will be periodically called by periodic_resync()
    to do reconciliation.
    Hence it can safely be removed from the run() method.

    Change-Id: Id6433598d5c833d2e86be605089d42feee57c257
    Closes-bug: #1651368
    Closes-Bug: #1650611

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/422519

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/422519
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ae0a31d8b0c449031432117ac33c1e5bdf9d5957
Submitter: Jenkins
Branch: stable/newton

commit ae0a31d8b0c449031432117ac33c1e5bdf9d5957
Author: Bertrand Lallau <email address hidden>
Date: Tue Dec 20 10:53:41 2016 +0100

    DHCP: enhance DHCPAgent startup procedure

    During DhcpAgent startup procedure all the following networks
    initialization is actually perform twice:
     * Killing old dnsmasq processes
     * set and configure all TAP interfaces
     * building all Dnsmasq config files (lease and host files)
     * launching dnsmasq processes
    What is done during the second iteration is just clean and redo
    exactly the same another time! This is really inefficient and
    increase dramatically DHCP startup time (near twice than needed).

    Initialization process 'sync_state' method is called twice:
     * one time during init_host()
     * another time during _report_state()

    sync_state() call must stay in init_host() due to bug #1420042.

    sync_state() is always called during startup in init_host()
    and will be periodically called by periodic_resync()
    to do reconciliation.
    Hence it can safely be removed from the run() method.

    Change-Id: Id6433598d5c833d2e86be605089d42feee57c257
    Closes-bug: #1651368
    Closes-Bug: #1650611
    (cherry picked from commit f15851b98974dc16606da195cf3ecee577cd0ef8)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/423206

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/424049

Changed in neutron:
assignee: Miguel Angel Ajo (mangelajo) → Bertrand Lallau (bertrand-lallau)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0b3

This issue was fixed in the openstack/neutron 10.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.2.0

This issue was fixed in the openstack/neutron 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/423206
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=21963618ca7accfe0b5d3389ee15426d5032a2ac
Submitter: Jenkins
Branch: stable/mitaka

commit 21963618ca7accfe0b5d3389ee15426d5032a2ac
Author: Bertrand Lallau <email address hidden>
Date: Tue Dec 20 10:53:41 2016 +0100

    DHCP: enhance DHCPAgent startup procedure

    During DhcpAgent startup procedure all the following networks
    initialization is actually perform twice:
     * Killing old dnsmasq processes
     * set and configure all TAP interfaces
     * building all Dnsmasq config files (lease and host files)
     * launching dnsmasq processes
    What is done during the second iteration is just clean and redo
    exactly the same another time! This is really inefficient and
    increase dramatically DHCP startup time (near twice than needed).

    Initialization process 'sync_state' method is called twice:
     * one time during init_host()
     * another time during _report_state()

    sync_state() call must stay in init_host() due to bug #1420042.

    sync_state() is always called during startup in init_host()
    and will be periodically called by periodic_resync()
    to do reconciliation.
    Hence it can safely be removed from the run() method.

    Change-Id: Id6433598d5c833d2e86be605089d42feee57c257
    Closes-bug: #1651368
    Closes-Bug: #1650611
    (cherry picked from commit f15851b98974dc16606da195cf3ecee577cd0ef8)
    (cherry picked from commit ae0a31d8b0c449031432117ac33c1e5bdf9d5957)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/liberty)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/424049

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.