Activity log for bug #1969270

Date Who What changed Old value New value Message
2022-04-16 09:57:25 Xiaojun Lin bug added bug
2022-04-16 09:58:39 Xiaojun Lin attachment added reference graph https://bugs.launchpad.net/neutron/+bug/1969270/+attachment/5581063/+files/backref.svg
2022-04-16 10:04:34 Xiaojun Lin description neutron version: 15.0.2 (still presents in the latest release) I've found a very interesting memory leak issue in neutron-dhcp-agent: When dhcp-agent tries to sync network state, it makes an rpc call to neutron-server, if there's something wrong on neutron-server's side(database access failure, for example), an error will be returned to dhcp-agent and deserialized to an RemoteError object. The RemoteError will be added to neutron.agent.dhcp.agent.DhcpAgent.needs_resync_reasons for periodic resync. The following code in methond neutron.agent.dhcp.agent.DhcpAgent._periodic_resync_helper handles network resync: if self.needs_resync_reasons: # be careful to avoid a race with additions to list # from other threads reasons = self.needs_resync_reasons self.needs_resync_reasons = collections.defaultdict(list) for net, r in reasons.items(): if not net: net = "*" LOG.debug("resync (%(network)s): %(reason)s", {"reason": r, "network": net}) self.sync_state(reasons.keys()) There's a trap here: since "reasons" is a defaultdict object, "reasons.keys()" still holds a reference to "reasons", thus the self.sync_state method frame will hold an indirect reference to the previous RemoteError object. When this self.sync_state is invoked, another RemoteError will be raised since neutron-server is still malfunctioning. The RemoteError object has a reference to sync_state frame which still holds a reference to the previous RemoteError. So the history RemoteError will never be garbage collected. I've generated a reference graph using objgraph, which helps to understand the reference chain. Please see the attachment. One proposed fix is to modify self.sync_state(reasons.keys()) to self.sync_state(list(reasons.keys())) in DhcpAgent._periodic_resync_helper Another way is adding str(reason) to self.needs_resync_reasons instead of reason object itself, in DhcpAgent.schedule_resync Both of them breaks the reference chain. neutron version: 15.0.2 (still presents in the latest release) I've found a very interesting memory leak issue in neutron-dhcp-agent: When dhcp-agent tries to sync network state, it makes an rpc call to neutron-server, if there's something wrong on neutron-server's side(database access failure, for example), an error will be returned to dhcp-agent and deserialized to an RemoteError object. The RemoteError will be added to neutron.agent.dhcp.agent.DhcpAgent.needs_resync_reasons for periodic resync. The following code in methond neutron.agent.dhcp.agent.DhcpAgent._periodic_resync_helper handles network resync:             if self.needs_resync_reasons:                 # be careful to avoid a race with additions to list                 # from other threads                 reasons = self.needs_resync_reasons                 self.needs_resync_reasons = collections.defaultdict(list)                 for net, r in reasons.items():                     if not net:                         net = "*"                     LOG.debug("resync (%(network)s): %(reason)s",                               {"reason": r, "network": net})                 self.sync_state(reasons.keys()) There's a trap here: since "reasons" is a defaultdict object, "reasons.keys()" still holds a reference to "reasons", thus the self.sync_state method frame will hold an indirect reference to the previous RemoteError object. When this self.sync_state is invoked, another RemoteError will be raised since neutron-server is still malfunctioning. The RemoteError object's tracebacks has a reference to sync_state frame which still holds a reference to the previous RemoteError. So the history RemoteError will never be garbage collected. I've generated a reference graph using objgraph, which helps to understand the reference chain. Please see the attachment. One proposed fix is to modify self.sync_state(reasons.keys()) to self.sync_state(list(reasons.keys())) in DhcpAgent._periodic_resync_helper Another way is adding str(reason) to self.needs_resync_reasons instead of reason object itself, in DhcpAgent.schedule_resync Both of them breaks the reference chain.
2022-04-16 10:07:15 Xiaojun Lin description neutron version: 15.0.2 (still presents in the latest release) I've found a very interesting memory leak issue in neutron-dhcp-agent: When dhcp-agent tries to sync network state, it makes an rpc call to neutron-server, if there's something wrong on neutron-server's side(database access failure, for example), an error will be returned to dhcp-agent and deserialized to an RemoteError object. The RemoteError will be added to neutron.agent.dhcp.agent.DhcpAgent.needs_resync_reasons for periodic resync. The following code in methond neutron.agent.dhcp.agent.DhcpAgent._periodic_resync_helper handles network resync:             if self.needs_resync_reasons:                 # be careful to avoid a race with additions to list                 # from other threads                 reasons = self.needs_resync_reasons                 self.needs_resync_reasons = collections.defaultdict(list)                 for net, r in reasons.items():                     if not net:                         net = "*"                     LOG.debug("resync (%(network)s): %(reason)s",                               {"reason": r, "network": net})                 self.sync_state(reasons.keys()) There's a trap here: since "reasons" is a defaultdict object, "reasons.keys()" still holds a reference to "reasons", thus the self.sync_state method frame will hold an indirect reference to the previous RemoteError object. When this self.sync_state is invoked, another RemoteError will be raised since neutron-server is still malfunctioning. The RemoteError object's tracebacks has a reference to sync_state frame which still holds a reference to the previous RemoteError. So the history RemoteError will never be garbage collected. I've generated a reference graph using objgraph, which helps to understand the reference chain. Please see the attachment. One proposed fix is to modify self.sync_state(reasons.keys()) to self.sync_state(list(reasons.keys())) in DhcpAgent._periodic_resync_helper Another way is adding str(reason) to self.needs_resync_reasons instead of reason object itself, in DhcpAgent.schedule_resync Both of them breaks the reference chain. neutron version: 15.0.2 (still presents in the latest release) I've found a very interesting memory leak issue in neutron-dhcp-agent: When dhcp-agent tries to sync network state, it makes an rpc call to neutron-server, if there's something wrong on neutron-server's side(database access failure, for example), an error will be returned to dhcp-agent and deserialized to an RemoteError object. The RemoteError will be added to neutron.agent.dhcp.agent.DhcpAgent.needs_resync_reasons for periodic resync. The following code in methond neutron.agent.dhcp.agent.DhcpAgent._periodic_resync_helper() handles network resync:             if self.needs_resync_reasons:                 # be careful to avoid a race with additions to list                 # from other threads                 reasons = self.needs_resync_reasons                 self.needs_resync_reasons = collections.defaultdict(list)                 for net, r in reasons.items():                     if not net:                         net = "*"                     LOG.debug("resync (%(network)s): %(reason)s",                               {"reason": r, "network": net})                 self.sync_state(reasons.keys()) There's a trap here: since "reasons" is a defaultdict object, "reasons.keys()" will hold a reference to "reasons", thus the self.sync_state method frame will hold an indirect reference to the previous RemoteError object. When this self.sync_state is invoked, another RemoteError will be raised since neutron-server is still malfunctioning. The RemoteError object's tracebacks has a reference to sync_state frame which still holds a reference to the previous RemoteError. So the history RemoteError will never be garbage collected. I've generated a reference graph using objgraph, which helps to understand the reference chain. Please see the attachment. One proposed fix is to modify self.sync_state(reasons.keys()) to self.sync_state(list(reasons.keys())) in DhcpAgent._periodic_resync_helper() Another way is adding str(reason) to self.needs_resync_reasons instead of reason object itself, in DhcpAgent.schedule_resync() Both of them breaks the reference chain.
2022-04-18 14:46:12 Jakub Libosvar neutron: importance Undecided Medium
2022-04-19 11:30:13 OpenStack Infra neutron: status New In Progress
2022-04-20 17:48:58 OpenStack Infra neutron: status In Progress Fix Released
2022-04-26 20:37:05 OpenStack Infra tags in-stable-xena
2022-04-26 20:37:11 OpenStack Infra tags in-stable-xena in-stable-wallaby in-stable-xena
2022-04-26 20:37:17 OpenStack Infra tags in-stable-wallaby in-stable-xena in-stable-victoria in-stable-wallaby in-stable-xena
2022-04-26 20:37:23 OpenStack Infra tags in-stable-victoria in-stable-wallaby in-stable-xena in-stable-ussuri in-stable-victoria in-stable-wallaby in-stable-xena
2022-04-26 20:37:28 OpenStack Infra tags in-stable-ussuri in-stable-victoria in-stable-wallaby in-stable-xena in-stable-train in-stable-ussuri in-stable-victoria in-stable-wallaby in-stable-xena
2022-04-27 09:49:40 OpenStack Infra tags in-stable-train in-stable-ussuri in-stable-victoria in-stable-wallaby in-stable-xena in-stable-train in-stable-ussuri in-stable-victoria in-stable-wallaby in-stable-xena in-stable-yoga