OpenStack DBaaS (Trove)

intermittent failures in call to execute() with exception "ValueError: I/O operation on closed file"

Bug #1347337 reported by Nikhil Manchanda on 2014-07-23

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack DBaaS (Trove)	Fix Released	High	Amrith Kumar	OpenStack DBaaS (Trove) 2014.2 "juno"
	oslo.concurrency	Fix Released	Low	Amrith Kumar	oslo.concurrency 0.1.0

Bug Description

This happened just once, so I'm opening a tracking bug to see if we hit this again.
The details of the run:

2014-07-22 23:41:56.663 | py27 runtests: commands[1] | python setup.py testr --slowest
2014-07-22 23:41:57.027 | running testr
2014-07-22 23:41:59.231 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:41:59.397 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:41:59.906 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:42:00.257 | No handlers could be found for logger "trove.conductor.manager"
2014-07-22 23:42:01.015 | No handlers could be found for logger "trove.guestagent.api"
2014-07-22 23:42:01.695 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:42:02.514 | No handlers could be found for logger "trove.guestagent.backup.backupagent"
2014-07-22 23:42:03.094 | No handlers could be found for logger "trove.guestagent.strategies.storage.swift"
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --list
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpVCBlup
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpeEa7fi
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpEudlCu
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpIVQ9rl
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpROt36p
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpmDS72J
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpDaJo98
2014-07-22 23:42:13.834 | running=${PYTHON:-python} -m subunit.run discover ./trove/tests/unittests --load-list /tmp/tmp.Ss0It4yOgb/tmpMMzRbw
2014-07-22 23:42:13.834 | ======================================================================
2014-07-22 23:42:13.834 | FAIL: guestagent.test_mongodb_manager.GuestAgentMongoDBManagerTest.test_prepare_from_backup
2014-07-22 23:42:13.835 | tags: worker-5
2014-07-22 23:42:13.835 | ----------------------------------------------------------------------
2014-07-22 23:42:13.835 | Traceback (most recent call last):
2014-07-22 23:42:13.835 | File "/home/jenkins/workspace/gate-trove-python27/trove/tests/unittests/guestagent/test_mongodb_manager.py", line 58, in test_prepare_from_backup
2014-07-22 23:42:13.835 | self._prepare_dynamic(backup_id='backup_id_123abc')
2014-07-22 23:42:13.835 | File "/home/jenkins/workspace/gate-trove-python27/trove/tests/unittests/guestagent/test_mongodb_manager.py", line 91, in _prepare_dynamic
2014-07-22 23:42:13.835 | backup_info=backup_info)
2014-07-22 23:42:13.835 | File "trove/guestagent/datastore/mongodb/manager.py", line 66, in prepare
2014-07-22 23:42:13.835 | operating_system.update_owner('mongodb', 'mongodb', mount_point)
2014-07-22 23:42:13.835 | File "trove/guestagent/common/operating_system.py", line 109, in update_owner
2014-07-22 23:42:13.835 | run_as_root=True, root_helper="sudo")
2014-07-22 23:42:13.836 | File "trove/common/utils.py", line 278, in execute_with_timeout
2014-07-22 23:42:13.836 | return execute(*args, **kwargs)
2014-07-22 23:42:13.836 | File "trove/openstack/common/processutils.py", line 186, in execute
2014-07-22 23:42:13.836 | result = obj.communicate()
2014-07-22 23:42:13.836 | File "/usr/lib/python2.7/subprocess.py", line 799, in communicate
2014-07-22 23:42:13.836 | return self._communicate(input)
2014-07-22 23:42:13.836 | File "/usr/lib/python2.7/subprocess.py", line 1396, in _communicate
2014-07-22 23:42:13.836 | self.stdin.flush()
2014-07-22 23:42:13.836 | File "/home/jenkins/workspace/gate-trove-python27/.tox/py27/local/lib/python2.7/site-packages/eventlet/greenio.py", line 419, in _operationOnClosedFile
2014-07-22 23:42:13.836 | raise ValueError("I/O operation on closed file")
2014-07-22 23:42:13.836 | ValueError: I/O operation on closed file
2014-07-22 23:42:13.836 | ======================================================================
2014-07-22 23:42:13.837 | FAIL: process-returncode
2014-07-22 23:42:13.837 | tags: worker-5
2014-07-22 23:42:13.837 | ----------------------------------------------------------------------
2014-07-22 23:42:13.837 | Binary content:
2014-07-22 23:42:13.837 | traceback (test/plain; charset="utf8")
2014-07-22 23:42:13.837 | Ran 712 tests in 14.846s
2014-07-22 23:42:13.837 | FAILED (id=0, failures=2, skips=1)
2014-07-22 23:42:13.838 | error: testr failed (1)
2014-07-22 23:42:13.862 | ERROR: InvocationError: '/home/jenkins/workspace/gate-trove-python27/.tox/py27/bin/python setup.py testr --slowest'
2014-07-22 23:42:13.862 | ___________________________________ summary ____________________________________
2014-07-22 23:42:13.862 | ERROR: py27: commands failed
2014-07-22 23:42:13.870 | + result=1

Nikhil Manchanda (slicknik) on 2014-07-23

Changed in trove:
importance:	Undecided → Low
status:	New → Triaged

Revision history for this message

Amrith Kumar (amrith) wrote on 2014-07-30:

I hit this too ...

http://logs.openstack.org/95/107795/6/check/gate-trove-python27/558a25e/console.html

It happens to be a failure on a line that was changed in

https://review.openstack.org/#/c/107542/2/trove/guestagent/datastore/mongodb/manager.py,cm

I am NOT claiming that this is the cause; just that I hit it right after I merged this change to something I was doing.

And I'm hitting it reasonably regularly (read annoyingly regularly).

Revision history for this message

Amrith Kumar (amrith) wrote on 2014-07-30:

I'm assigning this to myself after speaking with Nikhil.

Changed in trove:
assignee:	nobody → Amrith (amrith)

Amrith Kumar (amrith) on 2014-07-31

affects:

trove → oslo

Revision history for this message

Amrith Kumar (amrith) wrote on 2014-07-31:

Submitting a change shortly. The full explanation of this failure follows. The underlying issue is a problem in python and a bug in python has been reported. A partial workaround for the python issue is going to be proposed for oslo.

http://bugs.python.org/issue22114

The failure is because:

1. We are creating an stdin (subprocess.PIPE) in the call to Popen even if there is no process_input
2. The first call to communicate() is returning EAGAIN after code in _communicate() closes stdin (because there's no input)
3. A second call to communicate() drives through the same path and attempts to flush a closed stdin.

The Python bug is that after closing stdin, _communicate() isn't marking stdin as closed (set it to None).

The consequence is that if we don't have process_input, we can completely avoid this problem by not creating an stdin.

On the other hand, if we do have process_input, then retry is not safe because the process_input may have been completely or partially sent to the subprocess and we have to assume a failure.

Proposed code implements this workaround.

Revision history for this message

Amrith Kumar (amrith) wrote on 2014-07-31:

As an aside, this problem was exposed by a prior fix

https://bugs.launchpad.net/nova/+bug/1271331.

While the intent of that fix was good, it ran into an underlying issue with Python ...

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-31: Fix proposed to oslo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/110933

Changed in oslo:
status:	Triaged → In Progress

Revision history for this message

Amrith Kumar (amrith) wrote on 2014-07-31:

The backtrace you are likely to see will include

File "<path to>/processutils.py", line 186, in execute
   result = obj.communicate()
File "/usr/lib/python2.7/subprocess.py", line 799, in communicate
   return self._communicate(input)
File "/usr/lib/python2.7/subprocess.py", line 1396, in _communicate
   self.stdin.flush()
File "/opt/stack/trove/.tox/py27/local/lib/python2.7/site-packages/eventlet/greenio.py", line 419, in _operationOnClosedFile
   raise ValueError("I/O operation on closed file")
ValueError: I/O operation on closed file

Changed in trove:
assignee:	nobody → Amrith (amrith)
summary:	- Intermittent failure in unittest: - GuestAgentMongoDBManagerTest.test_prepare_from_backup + intermittent failures in call to execute() with exception "ValueError: + I/O operation on closed file"

Nikhil Manchanda (slicknik) on 2014-07-31

Changed in trove:
status:	New → Triaged
status:	Triaged → New

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2014-08-01:

How is this affecting neutron?

no longer affects:

neutron

Nikhil Manchanda (slicknik) on 2014-08-01

Changed in trove:
importance:	Undecided → High
milestone:	none → juno-3

Revision history for this message

Zhi Yan Liu (lzy-dev) wrote on 2014-08-04:

How is this affecting glance?

Revision history for this message

Amrith Kumar (amrith) wrote on 2014-08-04:

both neutron and glance appear to be using processutils.execute and this is a fix to that function.

Amrith Kumar (amrith) on 2014-08-05

no longer affects:	cinder
no longer affects:	glance
no longer affects:	nova

Revision history for this message

Amrith Kumar (amrith) wrote on 2014-08-05:

#10

After speaking with Yuriy (YorikSar) and Ihar (ihrachyshka) we're leaning towards the following hypothesis.

something is screwed up in the monkey patching.

We shouldn't be getting an exception thrown from os.read() because it should've been monkey patched to some green thing.

Yet, flush appears to be monkey patched.

----

Sticking this little block of code

+ if not (eventlet.patcher.is_monkey_patched(all)):
+ f = open ('/tmp/trace.txt', 'a+')
+ traceback.print_stack(None, None, f)
+ f.close()
+

above where the execute() call is made showed that when the failure occurs, a backtrace is produced.

I can run the tests that are failing dozens of times successfully and get no backtrace. But when I do get a failure, I also get a backtrace.

There's more to this than meets the eye but we're leaning towards an issue with the way the monkey patching is working.

----

I tried a simple hack to force

eventlet.monkey_patch()

everywhere, where an import eventlet is done.

I can't seem to get the failure to occur with this.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-08: Fix proposed to trove (master)

#11

Fix proposed to branch: master
Review: https://review.openstack.org/112911

Changed in trove:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-12: Fix merged to trove (master)

#12

Reviewed: https://review.openstack.org/112911
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=f2efdcc29b080f9859d041ee1e87399f04086471
Submitter: Jenkins
Branch: master

commit f2efdcc29b080f9859d041ee1e87399f04086471
Author: Amrith Kumar <email address hidden>
Date: Fri Aug 8 08:01:19 2014 -0400

Correct monkey patching in GuestAgentBackupTest

This change addresses some intermittent failures in tests that we've
been seeing recently.

    The failure would typically manifest itself in
    test_prepare_from_backup test in test_mongodb_manager.py but in theory
    it could occur elsewhere. The good thing about this change is that it
    will make the intermittent failures persistent. But, thanks to the fix
    for bug 1353570 (change I22d839f) the failures won't happen any
    longer.

    This change, without the earlier change for bug 1353570 did in fact
    serve to make the intermittent failures permanent. The issue is that
    GuestAgentBackupTest didn't unpatch execute_with_timeout when it was
    done. This meant that tests running after it were running in an
    environment where execute_with_timeout was patched. But, just in case
    you had a test which didn't itself patch execute_with_timeout and it
    happened to run on a thread that hadn't run GuestAgentBackupTest,
    you'd get a bad result.

The fix is to include code to save and restore execute_with_timeout in
setUp and tearDown.

Change-Id: I60187145d25524b4e7614d5d8cce9027521db44c
Closes-Bug: #1347337

Changed in trove:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-15: Fix proposed to oslo.concurrency (master)

#13

Fix proposed to branch: master
Review: https://review.openstack.org/114664

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-15: Change abandoned on oslo-incubator (master)

#14

Change abandoned by amrith (<email address hidden>) on branch: master
Review: https://review.openstack.org/110933
Reason: Resubmitted change on oslo.concurrency https://review.openstack.org/#/c/114664/

Thierry Carrez (ttx) on 2014-09-05

Changed in trove:
status:	Fix Committed → Fix Released

Doug Hellmann (doug-hellmann) on 2014-09-05

affects:

oslo-incubator → oslo.concurrency

Doug Hellmann (doug-hellmann) on 2014-09-09

Changed in oslo.concurrency:
milestone:	none → juno-rc1

Doug Hellmann (doug-hellmann) on 2014-09-10

Changed in oslo.concurrency:
status:	In Progress → Fix Committed

Doug Hellmann (doug-hellmann) on 2014-09-23

Changed in oslo.concurrency:
milestone:	next-juno → next-kilo

Thierry Carrez (ttx) on 2014-10-16

Changed in trove:
milestone:	juno-3 → 2014.2

Doug Hellmann (doug-hellmann) on 2014-10-24

Changed in oslo.concurrency:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

python-roundup #22114
[2:4] Edit

Bug watches keep track of this bug in other bug trackers.