neutron unittest require minimum 4gb memory

Bug #1234857 reported by Bhuvan Arumugam
32
This bug affects 7 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Maru Newby
Havana
Fix Released
Medium
Aaron Rosen
Icehouse
Fix Released
Medium
Unassigned

Bug Description

tox -e py26

The unittest hang forever. Each test seem to take around 25mins to complete. Each test report following error, though it is PASS. It sounds like a regression caused by fix for https://bugs.launchpad.net/neutron/+bug/1191768.

https://github.com/openstack/neutron/commit/06f679df5d025e657b2204151688ffa60c97a3d3

As per this fix, the default behavior for neutron.agent.rpc.report_state() is modified to use cast(), to report back the state in json format. The original behavior was to use call() method.

Using call() method by default might fix this problem.

ERROR:neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent:Failed reporting state!
Traceback (most recent call last):
  File "/home/jenkins/workspace/csi-neutron-upstream/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py", line 759, in _report_state
    self.agent_state)
  File "/home/jenkins/workspace/csi-neutron-upstream/neutron/agent/rpc.py", line 74, in report_state
    return self.cast(context, msg, topic=self.topic)
  File "/home/jenkins/workspace/csi-neutron-upstream/neutron/openstack/common/rpc/proxy.py", line 171, in cast
    rpc.cast(context, self._get_topic(topic), msg)
  File "/home/jenkins/workspace/csi-neutron-upstream/neutron/openstack/common/rpc/__init__.py", line 158, in cast
    return _get_impl().cast(CONF, context, topic, msg)
  File "/home/jenkins/workspace/csi-neutron-upstream/neutron/openstack/common/rpc/impl_fake.py", line 166, in cast
    check_serialize(msg)
  File "/home/jenkins/workspace/csi-neutron-upstream/neutron/openstack/common/rpc/impl_fake.py", line 131, in check_serialize
    json.dumps(msg)
  File "/usr/lib64/python2.6/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python2.6/json/encoder.py", line 367, in encode
    chunks = list(self.iterencode(o))
  File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
    for chunk in self._iterencode_dict(o, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
    for chunk in self._iterencode(value, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
    for chunk in self._iterencode_dict(o, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
    for chunk in self._iterencode(value, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
    for chunk in self._iterencode_dict(o, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
    for chunk in self._iterencode(value, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
    for chunk in self._iterencode_dict(o, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
    for chunk in self._iterencode(value, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
    for chunk in self._iterencode_dict(o, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
    for chunk in self._iterencode(value, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 317, in _iterencode
    for chunk in self._iterencode_default(o, markers):
  File "/usr/lib64/python2.6/json/encoder.py", line 323, in _iterencode_default
    newobj = self.default(o)
  File "/usr/lib64/python2.6/json/encoder.py", line 344, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <MagicMock name='LinuxBridgeManager().local_ip' id='666599248'> is not JSON serializable

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Armando, can you confirm this is the issue you saw and then went away by resetting the venv?

Changed in neutron:
assignee: nobody → Armando Migliaccio (armando-migliaccio)
Revision history for this message
Aaron Rosen (arosen) wrote :

I got a copy of OL6 downloading to try and reproduce as well.

Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

In the meantime, I'm testing report json using call() method. I'll keep you posted how it goes.

--- a/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py
+++ b/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py
@@ -756,7 +756,7 @@ class LinuxBridgeNeutronAgentRPC(sg_rpc.SecurityGroupAgentRpcMixin):
             devices = len(self.br_mgr.udev_get_tap_devices())
             self.agent_state.get('configurations')['devices'] = devices
             self.state_rpc.report_state(self.context,
- self.agent_state)
+ self.agent_state, use_call=True)
             self.agent_state.pop('start_flag', None)
         except Exception:
             LOG.exception(_("Failed reporting state!"))

Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → nobody
Revision history for this message
ZhiQiang Fan (aji-zqfan) wrote :

"""As per this fix, the default behavior for neutron.agent.rpc.report_state() is modified to use cast(), to report back the state in json format. The original behavior was to use call() method.

Using call() method by default might fix this problem."""

i'm confused with you comment, because what i know about call v.s. cast is: http://docs.openstack.org/developer/nova/devref/rpc.html, which means call need callback, not cast. (please correct me if i'm wrong)

Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

ZhiQiang, with the following fix we don't face JSON serializable error. But the tests still hang. Discard my original comment about cast vs call. With the patch in https://bugs.launchpad.net/neutron/+bug/1234857/comments/3, the problem still occur.

Aaron, can you keep us posted on how your testing with OL6 go?

--- a/neutron/tests/unit/linuxbridge/test_lb_neutron_agent.py
+++ b/neutron/tests/unit/linuxbridge/test_lb_neutron_agent.py
@@ -97,9 +97,11 @@ class TestLinuxBridgeAgent(base.BaseTestCase):
         super(TestLinuxBridgeAgent, self).setUp()
         cfg.CONF.set_override('rpc_backend',
                               'neutron.openstack.common.rpc.impl_fake')
+ mock_specific = mock.MagicMock(local_ip=LOCAL_IP)
         self.lbmgr_patcher = mock.patch('neutron.plugins.linuxbridge.agent.'
                                         'linuxbridge_neutron_agent.'
- 'LinuxBridgeManager')
+ 'LinuxBridgeManager',
+ return_value=mock_specific)
         self.lbmgr_mock = self.lbmgr_patcher.start()
         self.addCleanup(self.lbmgr_patcher.stop)
         self.execute_p = mock.patch.object(ip_lib.IPWrapper, '_execute')

Revision history for this message
ZhiQiang Fan (aji-zqfan) wrote :

i suggest you to try mocking the FixedIntervalLoopingCall, the hang may be caused by rpc (i'm not sure about it, i even cannot reproduce this issue in my env, [ubuntu 12.04 32bit, py2.7.5])

i read the test code in neutron/tests/unit/linuxbridge/test_lb_neutron_agent.py and neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py. as i said, the TestLinuxBridgeAgent.test_update_devices_failed() create an agent instantiates LinuxBridgeNeutronAgentRPC, eventhough it doesn't call daemon_loop but do setup_rpc, which is using FixedIntervalLoopingCall. that unit test doesn't clear the agent and let it run forever until sys.exit() (if i'm wrong, please correct me)

since i don't know how to clear this agent cleanly, the only way i can image is mock (and this is the usual way of other agents do, such as dhcp_agent, l3_agent, ovs agent)

(ps: the patch which set use_call=true will always use self.call for rpc, which i think can increase but not fix the hang problem)

Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

Thank you for the suggestion, ZhiQiang.

Mocking the FixedIntervalLoopingCall doesn't help either. We don't face JSON serializable error anymore, but the test suites are very slow. Only around 3200 unit tests were executed in last 24hrs.

FWIW, with or without patch, the first 2200 tests are executed quickly within 10mins. It takes double the amount of time for each consecutive tests. For example, after executing 2200 tests, if it takes 1min for a test, it takes 2min for next test, 4mins for next test, 8mins and so on for remaining tests. However, if we run those tests alone separately they are successful and completed within a minute.

I'm going to see if I could narrow-down the reason for delay with cProfile. Meanwhile let me know if you got any other clue.

Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

I'm yet to continue debugging with cProfile.

Meanwhile if anyone want to take a peek at testr output, here it is:
  http://livecipher.com/testr-1234857.log.gz

The pip list (pip freeze) used in our environment is as follows. FWIW we also use pypi.openstack.org/mirror to download packages. It is identical to pip list used by gate jobs, ex: https://jenkins01.openstack.org/job/gate-neutron-python26/2231/console

Babel==1.3
Jinja2==2.7.1
Mako==0.9.0
MarkupSafe==0.18
Paste==1.7.5.1
PasteDeploy==1.5.0
Pygments==1.6
Routes==1.13
SQLAlchemy==0.7.10
Sphinx==1.2b3
WebOb==1.2.3
WebTest==2.0.6
alembic==0.6.0
amqp==1.0.13
amqplib==1.0.2
anyjson==0.3.3
argparse==1.2.1
beautifulsoup4==4.3.2
cliff==1.4.5
cmd2==0.6.7
configobj==4.7.2
coverage==3.7
discover==0.4.0
docutils==0.11
eventlet==0.14.0
extras==0.0.3
fixtures==0.3.14
flake8==2.0
greenlet==0.4.1
hacking==0.7.2
httplib2==0.8
importlib==1.0.2
iso8601==0.1.4
jsonrpclib==0.1.3
kombu==2.5.15
mccabe==0.2.1
mock==1.0.1
mox==0.5.3
netaddr==0.7.10
neutron==2014.1.a61.gc7db907
ordereddict==1.1
oslo.config==1.2.1
pbr==0.5.21
pep8==1.4.5
prettytable==0.7.2
pyflakes==0.7.3
pyparsing==2.0.1
python-keystoneclient==0.3.2
python-mimeparse==0.1.4
python-neutronclient==2.3.1
python-novaclient==2.15.0
python-subunit==0.0.15
pytz==2013.7
pyudev==0.16.1
repoze.lru==0.6
requests==2.0.0
setuptools-git==1.0
simplejson==3.3.1
six==1.4.1
stevedore==0.12
testrepository==0.0.17
testtools==0.9.32
waitress==0.8.5

Aaron Rosen (arosen)
Changed in neutron:
assignee: nobody → Aaron Rosen (arosen)
importance: Undecided → Medium
Aaron Rosen (arosen)
Changed in neutron:
status: New → Confirmed
Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

FWIW, we could replicate this issue consistently in any of following VMs. We don't use physical servers in our environment, at least to build/test neutron.

2GB/Intel(R) Xeon(R) CPU X5650 @ 2.67GHz (1 processor)
1GB/Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz (2 processor)

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I *think* we're at a stage where Neutron unit test require at least 3GB to run smoothly; I'll try to give you a confirmation as soon as possible.

Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

Salvatore, you are correct. The unittests PASS completely in following VMs.

8GB/Intel(R) Core(TM) i7-3770QM CPU @ 3.40GHz (2 processors). Time taken: 1hr
4GB/Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz (2 processors). Time taken: 1hr 45mins

I'll see if we could bump up the memory for VMs in our build infrastructure.
Meanwhile, is there any plan to fix unittests to run with 2GB memory, similar to other openstack projects?

summary: - neutron unittest do not complete in OL6
+ neutron unittest require minimum 4gb memory
Revision history for this message
Aaron Rosen (arosen) wrote :

Actually I wonder if it's more of a python2.6 issue. Looking at a few jenkin test runs it shows that tests running with python2.6 take twice as long as python2.7

gate-neutron-python26 SUCCESS in 55m 17s
gate-neutron-python27 SUCCESS in 33m 15s

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/53244
Committed: http://github.com/openstack/neutron/commit/d26dfed7d47926425bb55893da45f650d6549b9a
Submitter: Jenkins
Branch: master

commit d26dfed7d47926425bb55893da45f650d6549b9a
Author: Aaron Rosen <email address hidden>
Date: Tue Oct 22 15:33:45 2013 -0700

    Fix TypeError: <MagicMock name='LinuxBridgeManager().local_ip'

    This patch fixes an error that does not result in a test failure but
    causes the following error to be logged in the test logs:

    TypeError: <MagicMock name='LinuxBridgeManager().local_ip' id='666599248'>
    is not JSON serializable

    Related-Bug: #1234857

    Change-Id: I65cdd997745d8b760031f8f32e3c10682ad2288e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/havana)

Related fix proposed to branch: stable/havana
Review: https://review.openstack.org/58286

Revision history for this message
Aaron Rosen (arosen) wrote :

Noticed this is causing failures in stable/havana from time to time. See <https://jenkins01.openstack.org/job/periodic-neutron-python26-stable-havana/4/>

tags: added: havana-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/havana)

Reviewed: https://review.openstack.org/58286
Committed: http://github.com/openstack/neutron/commit/decd4e7ad887a9d3e901ebb3b43011d05e2540ab
Submitter: Jenkins
Branch: stable/havana

commit decd4e7ad887a9d3e901ebb3b43011d05e2540ab
Author: Aaron Rosen <email address hidden>
Date: Tue Oct 22 15:33:45 2013 -0700

    Fix TypeError: <MagicMock name='LinuxBridgeManager().local_ip'

    This patch fixes an error that does not result in a test failure but
    causes the following error to be logged in the test logs:

    TypeError: <MagicMock name='LinuxBridgeManager().local_ip' id='666599248'>
    is not JSON serializable

    Related-Bug: #1234857

    Change-Id: I65cdd997745d8b760031f8f32e3c10682ad2288e
    (cherry picked from commit: d26dfed7d47926425bb55893da45f650d6549b9a)

tags: added: in-stable-havana
Alan Pevec (apevec)
tags: removed: in-stable-havana
Revision history for this message
Akihiro Motoki (amotoki) wrote :

It seems two topics are discussed in this bug report: the one is mocking error and the other is slow unittests in python 2.6.
It seems the title "neutron unittest require minimum 4gb memory" and the merged fix is not relevant directly.

It is confusing that this bug in stable/havana is marked as "Fix Committed".
The original bug "neutron unittest require minimum 4gb memory" is not addressed yet.
Is it better to split the bug?

Revision history for this message
Maru Newby (maru) wrote :

The largest contributors to memory usage appear to be leakage due to plugins not being deallocated and sqlalchemy. I'll be submitting a patch that ensures that the core plugin is always deallocated as a first step.

Changed in neutron:
assignee: Aaron Rosen (arosen) → Maru Newby (maru)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/92793

Changed in neutron:
status: Confirmed → In Progress
Changed in neutron:
milestone: none → juno-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/92793
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6db48dd688d45ba41863926fa1a91d2d3ea484ec
Submitter: Jenkins
Branch: master

commit 6db48dd688d45ba41863926fa1a91d2d3ea484ec
Author: Maru Newby <email address hidden>
Date: Wed May 7 22:41:40 2014 +0000

    Ensure core plugin deallocation after every test

    The unit tests were previously consuming an excessive amount of memory
    (4GB+) due to plugin instances persisting in memory. Deallocation was
    not possible where a combination of circular references and mocking
    was involved. This patch ensures that only NeutronManager holds a
    plugin reference and that all other references are instances of
    weakref.proxy. Residual memory footprint for tox executed on a
    12-core machine has been reduced to ~1.3GB. Plugin deallocation is
    validated at the end of each test to prevent regressions.

    This change also includes fixes to unit tests that depended on plugin
    instances persisting across tests.

    Partial-Bug: #1234857
    Change-Id: Ia1f868c2d206eb72ef77d290d054f3c48ab58c94

Revision history for this message
Maru Newby (maru) wrote :

I'm assuming this makes sense to backport?

tags: added: icehouse-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/93140

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/93140
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ee5c3b0bb60918a7e59e89ceffbafa2f683b74fd
Submitter: Jenkins
Branch: master

commit ee5c3b0bb60918a7e59e89ceffbafa2f683b74fd
Author: Maru Newby <email address hidden>
Date: Fri May 9 22:24:45 2014 +0000

    Make plugin deallocation check optional

    The fix for memory leakage in the related bug added a check for plugin
    deallocation that performed a call to gc.collect() after every test.
    This had the side-effect of increasing test execution time by ~50%, so
    this patch makes the check optional via an environment variable
    (OS_CHECK_PLUGIN_DEALLOCATION).

    It may make sense to create a periodic job that runs with the check
    enabled, but otherwise the check can be used by developers for
    debugging purposes.

    Change-Id: I9ebe663abbc4e4ff3a62d807d5a3230c2ecccd07
    Related-Bug: #1234857

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/94489

Changed in neutron:
assignee: Maru Newby (maru) → Armando Migliaccio (armando-migliaccio)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/94489
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=adb6674c798594fa7348e294e5c979085503ba96
Submitter: Jenkins
Branch: master

commit adb6674c798594fa7348e294e5c979085503ba96
Author: armando-migliaccio <email address hidden>
Date: Tue May 20 14:47:51 2014 -0700

    NSX: replace strong references to the plugin with weakref ones

    This is done in accordance to a refactoring done in change 92793.

    Partial-bug: #1234857

    Change-Id: Iaf2bc93182c01e62bb755b4e3f104f5a926b73fc

Revision history for this message
Kyle Mestery (mestery) wrote :

Maru, can you comment on if this bug is actually finished or still requires some work in Neutron? Thanks!

Changed in neutron:
milestone: juno-1 → none
assignee: Armando Migliaccio (armando-migliaccio) → Maru Newby (maru)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/102329

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/icehouse)

Related fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/105700

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/icehouse)

Reviewed: https://review.openstack.org/102329
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=478f4871fff9b2e3d5754ad43165b0356e07c58f
Submitter: Jenkins
Branch: stable/icehouse

commit 478f4871fff9b2e3d5754ad43165b0356e07c58f
Author: Maru Newby <email address hidden>
Date: Wed May 7 22:41:40 2014 +0000

    Ensure core plugin deallocation after every test

    The unit tests were previously consuming an excessive amount of memory
    (4GB+) due to plugin instances persisting in memory. Deallocation was
    not possible where a combination of circular references and mocking
    was involved. This patch ensures that only NeutronManager holds a
    plugin reference and that all other references are instances of
    weakref.proxy. Residual memory footprint for tox executed on a
    12-core machine has been reduced to ~1.3GB. Plugin deallocation is
    validated at the end of each test to prevent regressions.

    This change also includes fixes to unit tests that depended on plugin
    instances persisting across tests.

    Conflicts:

     neutron/tests/base.py
     neutron/tests/unit/ml2/test_mechanism_odl.py
     neutron/tests/unit/test_db_plugin.py

    Partial-Bug: #1234857
    Change-Id: Ia1f868c2d206eb72ef77d290d054f3c48ab58c94
    (cherry picked from commit 6db48dd688d45ba41863926fa1a91d2d3ea484ec)

tags: added: in-stable-icehouse
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/icehouse)

Reviewed: https://review.openstack.org/105700
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0cb4aaaefaf780d733178d33bab9791ab6cbce3b
Submitter: Jenkins
Branch: stable/icehouse

commit 0cb4aaaefaf780d733178d33bab9791ab6cbce3b
Author: Maru Newby <email address hidden>
Date: Fri May 9 22:24:45 2014 +0000

    Make plugin deallocation check optional

    The fix for memory leakage in the related bug added a check for plugin
    deallocation that performed a call to gc.collect() after every test.
    This had the side-effect of increasing test execution time by ~50%, so
    this patch makes the check optional via an environment variable
    (OS_CHECK_PLUGIN_DEALLOCATION).

    It may make sense to create a periodic job that runs with the check
    enabled, but otherwise the check can be used by developers for
    debugging purposes.

    Change-Id: I9ebe663abbc4e4ff3a62d807d5a3230c2ecccd07
    Related-Bug: #1234857
    (cherry picked from commit ee5c3b0bb60918a7e59e89ceffbafa2f683b74fd)

Revision history for this message
Akihiro Motoki (amotoki) wrote :

I checked how much memory is consumed to run the whole unit tests.

Ubuntu 14.04 amd64 machine with 4 CPUs and 8 GB.
The unit tests are run using 4 python processes. It takes about 25 mins.
The maximum usage of the memory is as follows:

             total used free shared buffers cached
Mem: 8176832 7375460 801372 428 275764 2965736
-/+ buffers/cache: 4133960 4042872
Swap: 4194300 0 4194300

The memory consumption has been improved a lot compared to the situation six month ago.
I think we can close this bug once if Maru doesn't have any pending works on this bug.

Maru Newby (maru)
Changed in neutron:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.