neutron-dhcp-agent memory leak on network sync failure

Bug #1969270 reported by Xiaojun Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Unassigned

Bug Description

neutron version: 15.0.2 (still presents in the latest release)

I've found a very interesting memory leak issue in neutron-dhcp-agent:

When dhcp-agent tries to sync network state, it makes an rpc call to neutron-server, if there's something wrong on neutron-server's side(database access failure, for example), an error will be returned to dhcp-agent and deserialized to an RemoteError object.

The RemoteError will be added to neutron.agent.dhcp.agent.DhcpAgent.needs_resync_reasons for periodic resync. The following code in methond neutron.agent.dhcp.agent.DhcpAgent._periodic_resync_helper() handles network resync:

            if self.needs_resync_reasons:
                # be careful to avoid a race with additions to list
                # from other threads
                reasons = self.needs_resync_reasons
                self.needs_resync_reasons = collections.defaultdict(list)
                for net, r in reasons.items():
                    if not net:
                        net = "*"
                    LOG.debug("resync (%(network)s): %(reason)s",
                              {"reason": r, "network": net})
                self.sync_state(reasons.keys())

There's a trap here: since "reasons" is a defaultdict object, "reasons.keys()" will hold a reference to "reasons", thus the self.sync_state method frame will hold an indirect reference to the previous RemoteError object.

When this self.sync_state is invoked, another RemoteError will be raised since neutron-server is still malfunctioning. The RemoteError object's tracebacks has a reference to sync_state frame which still holds a reference to the previous RemoteError. So the history RemoteError will never be garbage collected.

I've generated a reference graph using objgraph, which helps to understand the reference chain. Please see the attachment.

One proposed fix is to modify self.sync_state(reasons.keys()) to self.sync_state(list(reasons.keys())) in DhcpAgent._periodic_resync_helper()

Another way is adding str(reason) to self.needs_resync_reasons instead of reason object itself, in DhcpAgent.schedule_resync()

Both of them breaks the reference chain.

Revision history for this message
Xiaojun Lin (linxiaojun) wrote :
Xiaojun Lin (linxiaojun)
description: updated
description: updated
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Xiaojun:

That was an impressive analysis and, IMO, correct.

I think we can implement both measures:
- Store a string in the "needs_resync_reasons" dict. There is no need to waist space storing the whole exception instance.
- In "_periodic_resync_helper" we are creating a new "needs_resync_reasons" dictionary and using the older one to pass the network IDs. Back in py27 that wasn't a problem but in py36 ".keys()" is a generator that implies passing a copy of the dictionary. In any case, it is safer just to pass a new generated list.

Thanks a lot for this detailed bug description.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/838492

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/838492
Committed: https://opendev.org/openstack/neutron/commit/e3b3ec930967305e5fce314c0a4cf74151ad711c
Submitter: "Zuul (22348)"
Branch: master

commit e3b3ec930967305e5fce314c0a4cf74151ad711c
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Apr 13 23:38:04 2022 +0000

    [DHCP] Break reference chain to any Exception object when resync

    In the DHCP agent, if an exception is raised during the driver call,
    "DhcpAgent.schedule_resync" is called. Before this patch, the
    exception instance was passed instead of a string. This instance
    reference was stored in the dictionary "needs_resync_reasons" and
    used in "_periodic_resync_helper" to resync the DHCP agent
    information.

    The call to "sync_state" passed the dictionary ".keys()" method. In
    python2.7 when that was implemented, this method was creating a list
    with the dictionary keys. In python3, this method is a generator
    that holds the dictionary content.

    This patch breaks this reference chain in two points (actually only
    one is needed):
    - "sync_state" now passes a list created from the mentioned generator.
    - The dictionary "needs_resync_reasons" now stores the exception
      strings only, instead of the exception instance.

    Closes-Bug: #1969270
    Change-Id: I07e9818021283d321fc32066be7e0f8e2b81e639

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/838895

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/838896

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/838897

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/838898

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/838899

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/neutron/+/838901

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/838897
Committed: https://opendev.org/openstack/neutron/commit/85055d8c65f70e390af54227ace60a788121683b
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 85055d8c65f70e390af54227ace60a788121683b
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Apr 13 23:38:04 2022 +0000

    [DHCP] Break reference chain to any Exception object when resync

    In the DHCP agent, if an exception is raised during the driver call,
    "DhcpAgent.schedule_resync" is called. Before this patch, the
    exception instance was passed instead of a string. This instance
    reference was stored in the dictionary "needs_resync_reasons" and
    used in "_periodic_resync_helper" to resync the DHCP agent
    information.

    The call to "sync_state" passed the dictionary ".keys()" method. In
    python2.7 when that was implemented, this method was creating a list
    with the dictionary keys. In python3, this method is a generator
    that holds the dictionary content.

    This patch breaks this reference chain in two points (actually only
    one is needed):
    - "sync_state" now passes a list created from the mentioned generator.
    - The dictionary "needs_resync_reasons" now stores the exception
      strings only, instead of the exception instance.

    Closes-Bug: #1969270
    Change-Id: I07e9818021283d321fc32066be7e0f8e2b81e639
    (cherry picked from commit e3b3ec930967305e5fce314c0a4cf74151ad711c)

tags: added: in-stable-xena
tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/838896
Committed: https://opendev.org/openstack/neutron/commit/3b509e11b7d5d3d8f60266d520fdda42e35e6f7e
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 3b509e11b7d5d3d8f60266d520fdda42e35e6f7e
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Apr 13 23:38:04 2022 +0000

    [DHCP] Break reference chain to any Exception object when resync

    In the DHCP agent, if an exception is raised during the driver call,
    "DhcpAgent.schedule_resync" is called. Before this patch, the
    exception instance was passed instead of a string. This instance
    reference was stored in the dictionary "needs_resync_reasons" and
    used in "_periodic_resync_helper" to resync the DHCP agent
    information.

    The call to "sync_state" passed the dictionary ".keys()" method. In
    python2.7 when that was implemented, this method was creating a list
    with the dictionary keys. In python3, this method is a generator
    that holds the dictionary content.

    This patch breaks this reference chain in two points (actually only
    one is needed):
    - "sync_state" now passes a list created from the mentioned generator.
    - The dictionary "needs_resync_reasons" now stores the exception
      strings only, instead of the exception instance.

    Closes-Bug: #1969270
    Change-Id: I07e9818021283d321fc32066be7e0f8e2b81e639
    (cherry picked from commit e3b3ec930967305e5fce314c0a4cf74151ad711c)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/838898
Committed: https://opendev.org/openstack/neutron/commit/ff711bd809b8ce9aec42fcd8e8ca266c90b331d7
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit ff711bd809b8ce9aec42fcd8e8ca266c90b331d7
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Apr 13 23:38:04 2022 +0000

    [DHCP] Break reference chain to any Exception object when resync

    In the DHCP agent, if an exception is raised during the driver call,
    "DhcpAgent.schedule_resync" is called. Before this patch, the
    exception instance was passed instead of a string. This instance
    reference was stored in the dictionary "needs_resync_reasons" and
    used in "_periodic_resync_helper" to resync the DHCP agent
    information.

    The call to "sync_state" passed the dictionary ".keys()" method. In
    python2.7 when that was implemented, this method was creating a list
    with the dictionary keys. In python3, this method is a generator
    that holds the dictionary content.

    This patch breaks this reference chain in two points (actually only
    one is needed):
    - "sync_state" now passes a list created from the mentioned generator.
    - The dictionary "needs_resync_reasons" now stores the exception
      strings only, instead of the exception instance.

    Closes-Bug: #1969270
    Change-Id: I07e9818021283d321fc32066be7e0f8e2b81e639
    (cherry picked from commit e3b3ec930967305e5fce314c0a4cf74151ad711c)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/838899
Committed: https://opendev.org/openstack/neutron/commit/28cc137134d506a75c78b5d1cc32cec01f1af793
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 28cc137134d506a75c78b5d1cc32cec01f1af793
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Apr 13 23:38:04 2022 +0000

    [DHCP] Break reference chain to any Exception object when resync

    In the DHCP agent, if an exception is raised during the driver call,
    "DhcpAgent.schedule_resync" is called. Before this patch, the
    exception instance was passed instead of a string. This instance
    reference was stored in the dictionary "needs_resync_reasons" and
    used in "_periodic_resync_helper" to resync the DHCP agent
    information.

    The call to "sync_state" passed the dictionary ".keys()" method. In
    python2.7 when that was implemented, this method was creating a list
    with the dictionary keys. In python3, this method is a generator
    that holds the dictionary content.

    This patch breaks this reference chain in two points (actually only
    one is needed):
    - "sync_state" now passes a list created from the mentioned generator.
    - The dictionary "needs_resync_reasons" now stores the exception
      strings only, instead of the exception instance.

    Closes-Bug: #1969270
    Change-Id: I07e9818021283d321fc32066be7e0f8e2b81e639
    (cherry picked from commit e3b3ec930967305e5fce314c0a4cf74151ad711c)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/838901
Committed: https://opendev.org/openstack/neutron/commit/a5e2c1540b407685f51c15b494002b0991775616
Submitter: "Zuul (22348)"
Branch: stable/train

commit a5e2c1540b407685f51c15b494002b0991775616
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Apr 13 23:38:04 2022 +0000

    [DHCP] Break reference chain to any Exception object when resync

    In the DHCP agent, if an exception is raised during the driver call,
    "DhcpAgent.schedule_resync" is called. Before this patch, the
    exception instance was passed instead of a string. This instance
    reference was stored in the dictionary "needs_resync_reasons" and
    used in "_periodic_resync_helper" to resync the DHCP agent
    information.

    The call to "sync_state" passed the dictionary ".keys()" method. In
    python2.7 when that was implemented, this method was creating a list
    with the dictionary keys. In python3, this method is a generator
    that holds the dictionary content.

    This patch breaks this reference chain in two points (actually only
    one is needed):
    - "sync_state" now passes a list created from the mentioned generator.
    - The dictionary "needs_resync_reasons" now stores the exception
      strings only, instead of the exception instance.

    Closes-Bug: #1969270
    Change-Id: I07e9818021283d321fc32066be7e0f8e2b81e639
    (cherry picked from commit e3b3ec930967305e5fce314c0a4cf74151ad711c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/838895
Committed: https://opendev.org/openstack/neutron/commit/17c99763626c0cd44fa22de7ed129d8c785152bf
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 17c99763626c0cd44fa22de7ed129d8c785152bf
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Apr 13 23:38:04 2022 +0000

    [DHCP] Break reference chain to any Exception object when resync

    In the DHCP agent, if an exception is raised during the driver call,
    "DhcpAgent.schedule_resync" is called. Before this patch, the
    exception instance was passed instead of a string. This instance
    reference was stored in the dictionary "needs_resync_reasons" and
    used in "_periodic_resync_helper" to resync the DHCP agent
    information.

    The call to "sync_state" passed the dictionary ".keys()" method. In
    python2.7 when that was implemented, this method was creating a list
    with the dictionary keys. In python3, this method is a generator
    that holds the dictionary content.

    This patch breaks this reference chain in two points (actually only
    one is needed):
    - "sync_state" now passes a list created from the mentioned generator.
    - The dictionary "needs_resync_reasons" now stores the exception
      strings only, instead of the exception instance.

    Closes-Bug: #1969270
    Change-Id: I07e9818021283d321fc32066be7e0f8e2b81e639
    (cherry picked from commit e3b3ec930967305e5fce314c0a4cf74151ad711c)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.4.0

This issue was fixed in the openstack/neutron 18.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.3.0

This issue was fixed in the openstack/neutron 19.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.1.0

This issue was fixed in the openstack/neutron 20.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.0.0.0rc1

This issue was fixed in the openstack/neutron 21.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron train-eol

This issue was fixed in the openstack/neutron train-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ussuri-eol

This issue was fixed in the openstack/neutron ussuri-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron victoria-eom

This issue was fixed in the openstack/neutron victoria-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.