Unit tests hang forever

Bug #1578986 reported by Peter Wang
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Gorka Eguileor

Bug Description

we have noticed several times of the zone manager hang:

http://logs.openstack.org/27/312427/6/check/gate-cinder-python27-db/1c92ea2/
http://logs.openstack.org/27/312427/6/check/gate-cinder-python34-db/3273385/

the test case below most likely causes the hang issue:
{3} cinder.tests.unit.zonemanager.test_fc_zone_manager.TestFCZoneManager.test_add_connection

can anyone help to investigate this?

Thanks
Peter

Revision history for this message
lg yue (yalogr) wrote :
Download full text (6.2 KiB)

{2} cinder.tests.unit.zonemanager.test_fc_zone_manager.TestFCZoneManager.test_delete_connection [0.032648s] ... ok

^CTraceback (most recent call last):
{3} cinder.tests.unit.volume.drivers.emc.vnx.test_client.TestClient.test_expand_lun_not_ops_ready [] ... inprogress
  File ".tox/py27/bin/ostestr", line 11, in <module>
Traceback (most recent call last):
  File "/root/openstack-github/cinder/.tox/py27/bin/subunit-trace", line 11, in <module>
        sys.exit(main())
sys.exit(main())
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/os_testr/ostestr.py", line 245, in main
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/os_testr/subunit_trace.py", line 399, in main
    exit(_select_and_call_runner(opts, exclude_regex, others))
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/os_testr/ostestr.py", line 205, in _select_and_call_runner
    args.no_summary))
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/os_testr/subunit_trace.py", line 373, in trace
    opts.until_failure, opts.color, others)
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/os_testr/ostestr.py", line 172, in call_testr
    stream.run(result)
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/subunit/v2.py", line 284, in run
    proc.communicate()
  File "/usr/lib64/python2.7/subprocess.py", line 797, in communicate
    content = self.source.read(1)
KeyboardInterrupt
    self.wait()
  File "/usr/lib64/python2.7/subprocess.py", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib64/python2.7/subprocess.py", line 478, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt
ERROR: KEYBOARDINTERRUPT
ERROR: keyboardinterrupt
[root@computer1 cinder]#
[root@computer1 cinder]# tox -e py27Using NetApp controller: 127.0.0.1
Using NetApp controller: 127.0.0.1
Using NetApp controller: 127.0.0.1
Using NetApp controller: 127.0.0.1
Using NetApp controller: 127.0.0.1
Traceback (most recent call last):
  File "/root/openstack-github/cinder/.tox/py27/bin/testr", line 26, in <module>
    sys.exit(run_argv(sys.argv, sys.stdin, sys.stdout, sys.stderr))
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/testrepository/commands/__init__.py", line 188, in run_argv
    result = cmd.execute()
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/testrepository/commands/__init__.py", line 143, in execute
    result = self.run()
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/testrepository/commands/run.py", line 213, in run
    return self._run_tests(cmd)
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/testrepository/commands/run.py", line 358, in _run_tests
    return run_tests()
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/testrepository/commands/run.py", line 356, in run_tests
    return load_cmd.execute()
  File "/root/openstack-github/cinder/.tox/py27/lib/python2.7/site-packages/testrepository/commands/__init__.py", line 143, in execute
    result = self.run()
  Fil...

Read more...

summary: - zone manger unit tests hang forever
+ Unit tests hang forever
Changed in cinder:
status: New → Confirmed
Revision history for this message
Clinton Knight (clintonk) wrote :

In the output posted by yalogr above, note the line printed *after* Ctrl-C was pressed:

{3} cinder.tests.unit.volume.drivers.emc.vnx.test_client.TestClient.test_expand_lun_not_ops_ready [] ... inprogress

This shows that an EMC VNX test was running when the hang was observed, so it is most likely that the EMC test is at fault.

The EMC VNX driver is all-new code merged on 7 July 2016 at commit 572b84c073ef66d976d4530d9b7f07f518695ab0.

I confirmed this by running the unit tests on the previous commit (9da9ebb34581818c053e1e92bf6f552e32605479). They did not hang after several attempts. Moving to the subsequent VNX commit yielded hangs easily.

Revision history for this message
Xing Yang (xing-yang) wrote :
Peter Wang (peter.wang)
Changed in cinder:
assignee: nobody → Peter Wang (peter.wang)
Ivan Kolodyazhny (e0ne)
Changed in cinder:
importance: Undecided → Medium
Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

Please, be aware that both EMC VNX and EMAX tests could hang

Revision history for this message
Eric Harney (eharney) wrote :

Here is the trace I see from gdb when this hang occurs on my system. Something is still polling in our fake oslo_messaging driver.

http://paste.openstack.org/raw/539769/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/346157

Changed in cinder:
assignee: Peter Wang (peter.wang) → Xing Yang (xing-yang)
status: Confirmed → In Progress
Revision history for this message
Xing Yang (xing-yang) wrote :

This patch didn't fix it: https://review.openstack.org/346157

Continue investigation.

Changed in cinder:
assignee: Xing Yang (xing-yang) → Peter Wang (peter.wang)
Revision history for this message
Peter Wang (peter.wang) wrote :

Hi Xing and Eric

I also got some more logs from gdb: http://paste.openstack.org/show/540431/

from my observation: all threads were waiting for lock-release by *LWP 22400*, while 22400 was unable to release the lock, and failed into a infinite loop then.

I still have no idea what's root cause.

Thanks
Peter

Changed in cinder:
assignee: Peter Wang (peter.wang) → Xing Yang (xing-yang)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/347117

Changed in cinder:
assignee: Xing Yang (xing-yang) → Peter Wang (peter.wang)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (master)

Reviewed: https://review.openstack.org/320148
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=39e0c888ba97ce93199ac14d223008ce3c333992
Submitter: Jenkins
Branch: master

commit 39e0c888ba97ce93199ac14d223008ce3c333992
Author: Ivan Kolodyazhny <email address hidden>
Date: Mon May 23 22:13:10 2016 +0300

    Move drivers unit tests to unit.volume.drivers directory

    This change makes our tests directory cleaner a bit and
    helps to find tests faster.

    Some of these tests freeze for a long time, so they are skipped.

    Related-Bug: #1578986

    Change-Id: Ic292b9608cd3d174a2b9b30e7d941e728f32547c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by xing-yang (<email address hidden>) on branch: master
Review: https://review.openstack.org/346157
Reason: Abandon this patch. Continue investigation here:

https://review.openstack.org/#/c/347117/

Revision history for this message
Peter Wang (peter.wang) wrote :

@Ivan

Xing also mentioned huawei driver'hang issue in https://review.openstack.org/#/c/347117/
Can you paste logs for VMAX, I noticed you said VMAX could get hang as well?

Xing and I are trying to figure out what's in common for tests from 3 vendors.

Thanks
peter

Changed in cinder:
assignee: Peter Wang (peter.wang) → Xing Yang (xing-yang)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/347117
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=e5681895ba0ea0e3884904583a825cd405253ceb
Submitter: Jenkins
Branch: master

commit e5681895ba0ea0e3884904583a825cd405253ceb
Author: Peter Wang <email address hidden>
Date: Tue Jul 26 12:54:59 2016 +0800

    Resolve unittest hang issue

    This patch removes suspect decorator patch_no_sleep from VNX unit
    test, and set the timeout = 0 when running tests to avoid potential
    long running.

    Also removed mock.patch of FixedIntervalLoopingCall from the setup
    of Huawei unit test to prevent test_replication_driver_enable_success
    in test_huawei_drivers.py from hanging.

    Change-Id: I5c8efbf9a62e0f5bc39c5341b53f7af4e47c7e81
    Partial-Bug: #1578986

Revision history for this message
Xing Yang (xing-yang) wrote :

Found two new tests that caused hang again. Both have “mock.patch('time.sleep’)” in it. I’ll skip these two tests.

http://logs.openstack.org/52/328052/61/check/gate-cinder-python34-db/d417592/console.html

cinder.tests.unit.volume.drivers.test_datera.DateraVolumeTestCase.test_ensure_export_success [] ... inprogress
2016-08-25 16:35:34.965056 | {2} cinder.tests.unit.volume.drivers.test_datera.DateraVolumeTestCase.test_create_export_target_does_not_exist_success [] ... inprogress
2016-08-25 16:35:35.043246 |

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/360807

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/360807
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=1ac4303043b756b8a6be2cf092b664fa57e037d4
Submitter: Jenkins
Branch: master

commit 1ac4303043b756b8a6be2cf092b664fa57e037d4
Author: xing-yang <email address hidden>
Date: Sun Jul 17 02:52:15 2016 -0400

    Set sleep time to 0 in Datera driver

    Found two new tests that caused hang. Both have “mock.patch(
    'time.sleep’)” in it. In this patch, I changed
    datera.DEFAULT_SI_SLEEP to 0 from 0.01 and removed
    “mock.patch('time.sleep’)”.

    Logs that showed hanging:
        http://logs.openstack.org/52/328052/61/check/
        gate-cinder-python34-db/d417592/console.html

    cinder.tests.unit.volume.drivers.test_datera.DateraVolumeTestCase.
    test_ensure_export_success [] ... inprogress
    2016-08-25 16:35:34.965056 |
    {2} cinder.tests.unit.volume.drivers.test_datera.DateraVolumeTestCase.
    test_create_export_target_does_not_exist_success [] ... inprogress
    2016-08-25 16:35:35.043246 |

    Change-Id: I3b8797b245616a85f927de15dca48e75831e035d
    Partial-Bug: #1578986

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/361835

Changed in cinder:
assignee: Xing Yang (xing-yang) → Peter Wang (peter.wang)
Revision history for this message
Xing Yang (xing-yang) wrote :
Download full text (4.1 KiB)

We saw the following VNX unit test error. Patch https://review.openstack.org/361835 was submitted to address the problem.

http://logs.openstack.org/52/315352/7/check/gate-cinder-python34-db/4456f4e/console.html

2016-08-28 00:41:19.705748 | ==============================
2016-08-28 00:41:19.705763 | Failed 1 tests - output below:
2016-08-28 00:41:19.705775 | ==============================
2016-08-28 00:41:19.705781 |
2016-08-28 00:41:19.705801 | cinder.tests.unit.volume.drivers.emc.vnx.test_utils.TestUtils.test_wait_until
2016-08-28 00:41:19.705821 | -----------------------------------------------------------------------------
2016-08-28 00:41:19.705827 |
2016-08-28 00:41:19.705836 | Captured traceback:
2016-08-28 00:41:19.705846 | ~~~~~~~~~~~~~~~~~~~
2016-08-28 00:41:19.705860 | b'Traceback (most recent call last):'
2016-08-28 00:41:19.705894 | b' File "/home/jenkins/workspace/gate-cinder-python34-db/.tox/py34/lib/python3.4/site-packages/mock/mock.py", line 1305, in patched'
2016-08-28 00:41:19.705915 | b' return func(*args, **keywargs)'
2016-08-28 00:41:19.705950 | b' File "/home/jenkins/workspace/gate-cinder-python34-db/cinder/tests/unit/volume/drivers/emc/vnx/test_utils.py", line 42, in test_wait_until'
2016-08-28 00:41:19.705966 | b' utils.wait_until(mock_testmethod)'
2016-08-28 00:41:19.705996 | b' File "/home/jenkins/workspace/gate-cinder-python34-db/cinder/volume/drivers/emc/vnx/utils.py", line 137, in wait_until'
2016-08-28 00:41:19.706015 | b' timer.start(interval=interval).wait()'
2016-08-28 00:41:19.706047 | b' File "/home/jenkins/workspace/gate-cinder-python34-db/.tox/py34/lib/python3.4/site-packages/eventlet/event.py", line 121, in wait'
2016-08-28 00:41:19.706062 | b' return hubs.get_hub().switch()'
2016-08-28 00:41:19.706094 | b' File "/home/jenkins/workspace/gate-cinder-python34-db/.tox/py34/lib/python3.4/site-packages/eventlet/hubs/hub.py", line 294, in switch'
2016-08-28 00:41:19.706108 | b' return self.greenlet.switch()'
2016-08-28 00:41:19.706143 | b' File "/home/jenkins/workspace/gate-cinder-python34-db/.tox/py34/lib/python3.4/site-packages/oslo_service/loopingcall.py", line 136, in _run_loop'
2016-08-28 00:41:19.706171 | b' result = func(*self.args, **self.kw)'
2016-08-28 00:41:19.706201 | b' File "/home/jenkins/workspace/gate-cinder-python34-db/cinder/volume/drivers/emc/vnx/utils.py", line 132, in _inner'
2016-08-28 00:41:19.706217 | b" % {'condition_name': condition.__name__})"
2016-08-28 00:41:19.706249 | b' File "/home/jenkins/workspace/gate-cinder-python34-db/.tox/py34/lib/python3.4/site-packages/mock/mock.py", line 700, in __getattr__'
2016-08-28 00:41:19.706262 | b' raise AttributeError(name)'
2016-08-28 00:41:19.706273 | b'AttributeError: __name__'
2016-08-28 00:41:19.706280 | b''
2016-08-28 00:41:19.706285 |
2016-08-28 00:41:19.706294 | Captured pythonlogging:
2016-08-28 00:41:19.706304 | ~~~~~~~~~~~~~~~~~~~~~~~
2016-08-28 00:41:19.706328 | b'2016-08-28 00:40:33,718 INFO [cinder.coordination] Coordination backend started successfully.'
2016-08-28 00:41:19.706366 | b"2016-08-28 00:40:34,237 ER...

Read more...

Revision history for this message
Eric Harney (eharney) wrote :

> We saw the following VNX unit test error. Patch https://review.openstack.org/361835 was submitted to address the problem.

> http://logs.openstack.org/52/315352/7/check/gate-cinder-python34-db/4456f4e/console.html

Why is this thought to be related to the "hang forever" problem?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/361835
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=08f1a2414e9fbacef2c24419309ebb03d601a29a
Submitter: Jenkins
Branch: master

commit 08f1a2414e9fbacef2c24419309ebb03d601a29a
Author: Peter Wang <email address hidden>
Date: Mon Aug 29 01:58:30 2016 -0400

    Avoid patch loopingcall in VNX

    This patch removes patch loopcall to avoid possible test case failure
    caused by eventlet.

    Change-Id: I17dc59594fbb296a678dac52ffc290e8e0bda303
    Partial-Bug: #1578986

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/367601

Changed in cinder:
assignee: Peter Wang (peter.wang) → Gorka Eguileor (gorka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/367601
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=a6eca292fc93acaf5aa4dc5302154b96076195d9
Submitter: Jenkins
Branch: master

commit a6eca292fc93acaf5aa4dc5302154b96076195d9
Author: Gorka Eguileor <email address hidden>
Date: Mon Sep 5 21:27:15 2016 +0200

    Remove sleep mocks in RBD unit tests

    This patch removes all mocks of time.sleep in test_rbd and replaces
    passed function to retrying.Retrying to calculate retrying delays with a
    dummy that always return 0 so we don't do any delays.

    Partial-Bug: #1578986
    Change-Id: If11f5dfd596029fb933bb5dc30dcc2fb1a3c120f

Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :

I haven't seen this for quite a while now, so I'm going to mark it as fixed.

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
Peter Wang (peter.wang) wrote :

Looks like the issue is still observed in unit test if mock.path('time.sleep') is mocked

http://logs.openstack.org/06/520006/2/check/openstack-tox-py27/c3b6afa/job-output.txt.gz

2017-11-15 14:41:18.632875 | ubuntu-xenial | {7} cinder.tests.unit.volume.drivers.dell_emc.vnx.test_client.TestClient.test_verify_migration_false [] ... inprogress

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.