intermittent failures in call to execute() with exception "ValueError: I/O operation on closed file"

Bug #1347337 reported by Nikhil Manchanda
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
High
Amrith Kumar
oslo.concurrency
Fix Released
Low
Amrith Kumar

Bug Description

This happened just once, so I'm opening a tracking bug to see if we hit this again.
The details of the run:

2014-07-22 23:41:56.663 | py27 runtests: commands[1] | python setup.py testr --slowest
2014-07-22 23:41:57.027 | running testr
2014-07-22 23:41:59.231 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:41:59.397 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:41:59.906 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:42:00.257 | No handlers could be found for logger "trove.conductor.manager"
2014-07-22 23:42:01.015 | No handlers could be found for logger "trove.guestagent.api"
2014-07-22 23:42:01.695 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:42:02.514 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:42:03.094 | No handlers could be found for logger "trove.guestagent.strategies.storage.swift"
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --list
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpVCBlup
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpeEa7fi
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpEudlCu
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpIVQ9rl
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpROt36p
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpmDS72J
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpDaJo98
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpMMzRbw
2014-07-22 23:42:13.834 | ======================================================================
2014-07-22 23:42:13.834 | FAIL: guestagent.test_mongodb_manager.GuestAgentMongoDBManagerTest.test_prepare_from_backup
2014-07-22 23:42:13.835 | tags: worker-5
2014-07-22 23:42:13.835 | ----------------------------------------------------------------------
2014-07-22 23:42:13.835 | Traceback (most recent call last):
2014-07-22 23:42:13.835 | File "/home/jenkins/workspace/gate-trove-python27/trove/tests/unittests/guestagent/test_mongodb_manager.py", line 58, in test_prepare_from_backup
2014-07-22 23:42:13.835 | self._prepare_dynamic(backup_id='backup_id_123abc')
2014-07-22 23:42:13.835 | File "/home/jenkins/workspace/gate-trove-python27/trove/tests/unittests/guestagent/test_mongodb_manager.py", line 91, in _prepare_dynamic
2014-07-22 23:42:13.835 | backup_info=backup_info)
2014-07-22 23:42:13.835 | File "trove/guestagent/datastore/mongodb/manager.py", line 66, in prepare
2014-07-22 23:42:13.835 | operating_system.update_owner('mongodb', 'mongodb', mount_point)
2014-07-22 23:42:13.835 | File "trove/guestagent/common/operating_system.py", line 109, in update_owner
2014-07-22 23:42:13.835 | run_as_root=True, root_helper="sudo")
2014-07-22 23:42:13.836 | File "trove/common/utils.py", line 278, in execute_with_timeout
2014-07-22 23:42:13.836 | return execute(*args, **kwargs)
2014-07-22 23:42:13.836 | File "trove/openstack/common/processutils.py", line 186, in execute
2014-07-22 23:42:13.836 | result = obj.communicate()
2014-07-22 23:42:13.836 | File "/usr/lib/python2.7/subprocess.py", line 799, in communicate
2014-07-22 23:42:13.836 | return self._communicate(input)
2014-07-22 23:42:13.836 | File "/usr/lib/python2.7/subprocess.py", line 1396, in _communicate
2014-07-22 23:42:13.836 | self.stdin.flush()
2014-07-22 23:42:13.836 | File "/home/jenkins/workspace/gate-trove-python27/.tox/py27/local/lib/python2.7/site-packages/eventlet/greenio.py", line 419, in _operationOnClosedFile
2014-07-22 23:42:13.836 | raise ValueError("I/O operation on closed file")
2014-07-22 23:42:13.836 | ValueError: I/O operation on closed file
2014-07-22 23:42:13.836 | ======================================================================
2014-07-22 23:42:13.837 | FAIL: process-returncode
2014-07-22 23:42:13.837 | tags: worker-5
2014-07-22 23:42:13.837 | ----------------------------------------------------------------------
2014-07-22 23:42:13.837 | Binary content:
2014-07-22 23:42:13.837 | traceback (test/plain; charset="utf8")
2014-07-22 23:42:13.837 | Ran 712 tests in 14.846s
2014-07-22 23:42:13.837 | FAILED (id=0, failures=2, skips=1)
2014-07-22 23:42:13.838 | error: testr failed (1)
2014-07-22 23:42:13.862 | ERROR: InvocationError: '/home/jenkins/workspace/gate-trove-python27/.tox/py27/bin/python setup.py testr --slowest'
2014-07-22 23:42:13.862 | ___________________________________ summary ____________________________________
2014-07-22 23:42:13.862 | ERROR: py27: commands failed
2014-07-22 23:42:13.870 | + result=1

Changed in trove:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Amrith Kumar (amrith) wrote :

I hit this too ...

http://logs.openstack.org/95/107795/6/check/gate-trove-python27/558a25e/console.html

It happens to be a failure on a line that was changed in

https://review.openstack.org/#/c/107542/2/trove/guestagent/datastore/mongodb/manager.py,cm

I am NOT claiming that this is the cause; just that I hit it right after I merged this change to something I was doing.

And I'm hitting it reasonably regularly (read annoyingly regularly).

Revision history for this message
Amrith Kumar (amrith) wrote :

I'm assigning this to myself after speaking with Nikhil.

Changed in trove:
assignee: nobody → Amrith (amrith)
Amrith Kumar (amrith)
affects: trove → oslo
Revision history for this message
Amrith Kumar (amrith) wrote :

Submitting a change shortly. The full explanation of this failure follows. The underlying issue is a problem in python and a bug in python has been reported. A partial workaround for the python issue is going to be proposed for oslo.

http://bugs.python.org/issue22114

The failure is because:

1. We are creating an stdin (subprocess.PIPE) in the call to Popen even if there is no process_input
2. The first call to communicate() is returning EAGAIN after code in _communicate() closes stdin (because there's no input)
3. A second call to communicate() drives through the same path and attempts to flush a closed stdin.

The Python bug is that after closing stdin, _communicate() isn't marking stdin as closed (set it to None).

The consequence is that if we don't have process_input, we can completely avoid this problem by not creating an stdin.

On the other hand, if we do have process_input, then retry is not safe because the process_input may have been completely or partially sent to the subprocess and we have to assume a failure.

Proposed code implements this workaround.

Revision history for this message
Amrith Kumar (amrith) wrote :

As an aside, this problem was exposed by a prior fix

https://bugs.launchpad.net/nova/+bug/1271331.

While the intent of that fix was good, it ran into an underlying issue with Python ...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/110933

Changed in oslo:
status: Triaged → In Progress
Revision history for this message
Amrith Kumar (amrith) wrote :

The backtrace you are likely to see will include

File "<path to>/processutils.py", line 186, in execute
   result = obj.communicate()
File "/usr/lib/python2.7/subprocess.py", line 799, in communicate
   return self._communicate(input)
File "/usr/lib/python2.7/subprocess.py", line 1396, in _communicate
   self.stdin.flush()
File "/opt/stack/trove/.tox/py27/local/lib/python2.7/site-packages/eventlet/greenio.py", line 419, in _operationOnClosedFile
   raise ValueError("I/O operation on closed file")
ValueError: I/O operation on closed file

Changed in trove:
assignee: nobody → Amrith (amrith)
summary: - Intermittent failure in unittest:
- GuestAgentMongoDBManagerTest.test_prepare_from_backup
+ intermittent failures in call to execute() with exception "ValueError:
+ I/O operation on closed file"
Changed in trove:
status: New → Triaged
status: Triaged → New
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

How is this affecting neutron?

no longer affects: neutron
Changed in trove:
importance: Undecided → High
milestone: none → juno-3
Revision history for this message
Zhi Yan Liu (lzy-dev) wrote :

How is this affecting glance?

Revision history for this message
Amrith Kumar (amrith) wrote :

both neutron and glance appear to be using processutils.execute and this is a fix to that function.

Amrith Kumar (amrith)
no longer affects: cinder
no longer affects: glance
no longer affects: nova
Revision history for this message
Amrith Kumar (amrith) wrote :

After speaking with Yuriy (YorikSar) and Ihar (ihrachyshka) we're leaning towards the following hypothesis.

something is screwed up in the monkey patching.

We shouldn't be getting an exception thrown from os.read() because it should've been monkey patched to some green thing.

Yet, flush appears to be monkey patched.

----

Sticking this little block of code

+ if not (eventlet.patcher.is_monkey_patched(all)):
+ f = open ('/tmp/trace.txt', 'a+')
+ traceback.print_stack(None, None, f)
+ f.close()
+

above where the execute() call is made showed that when the failure occurs, a backtrace is produced.

I can run the tests that are failing dozens of times successfully and get no backtrace. But when I do get a failure, I also get a backtrace.

There's more to this than meets the eye but we're leaning towards an issue with the way the monkey patching is working.

----

I tried a simple hack to force

eventlet.monkey_patch()

everywhere, where an import eventlet is done.

I can't seem to get the failure to occur with this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/112911

Changed in trove:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/112911
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=f2efdcc29b080f9859d041ee1e87399f04086471
Submitter: Jenkins
Branch: master

commit f2efdcc29b080f9859d041ee1e87399f04086471
Author: Amrith Kumar <email address hidden>
Date: Fri Aug 8 08:01:19 2014 -0400

    Correct monkey patching in GuestAgentBackupTest

    This change addresses some intermittent failures in tests that we've
    been seeing recently.

    The failure would typically manifest itself in
    test_prepare_from_backup test in test_mongodb_manager.py but in theory
    it could occur elsewhere. The good thing about this change is that it
    will make the intermittent failures persistent. But, thanks to the fix
    for bug 1353570 (change I22d839f) the failures won't happen any
    longer.

    This change, without the earlier change for bug 1353570 did in fact
    serve to make the intermittent failures permanent. The issue is that
    GuestAgentBackupTest didn't unpatch execute_with_timeout when it was
    done. This meant that tests running after it were running in an
    environment where execute_with_timeout was patched. But, just in case
    you had a test which didn't itself patch execute_with_timeout and it
    happened to run on a thread that hadn't run GuestAgentBackupTest,
    you'd get a bad result.

    The fix is to include code to save and restore execute_with_timeout in
    setUp and tearDown.

    Change-Id: I60187145d25524b4e7614d5d8cce9027521db44c
    Closes-Bug: #1347337

Changed in trove:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.concurrency (master)

Fix proposed to branch: master
Review: https://review.openstack.org/114664

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo-incubator (master)

Change abandoned by amrith (<email address hidden>) on branch: master
Review: https://review.openstack.org/110933
Reason: Resubmitted change on oslo.concurrency https://review.openstack.org/#/c/114664/

Thierry Carrez (ttx)
Changed in trove:
status: Fix Committed → Fix Released
affects: oslo-incubator → oslo.concurrency
Changed in oslo.concurrency:
milestone: none → juno-rc1
Changed in oslo.concurrency:
status: In Progress → Fix Committed
Changed in oslo.concurrency:
milestone: next-juno → next-kilo
Thierry Carrez (ttx)
Changed in trove:
milestone: juno-3 → 2014.2
Changed in oslo.concurrency:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.