FTBFS: upstart test_job_process fails in majority of cases / Kernel returning unexpected EIO at end of file
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | linux (Ubuntu) |
High
|
Andy Whitcroft | ||
| | Vivid |
High
|
Andy Whitcroft | ||
| | upstart (Ubuntu) |
High
|
James Hunt | ||
| | Vivid |
High
|
James Hunt | ||
Bug Description
As you see in the history in https:/
ok 53 - with single line command writing fast and exiting
not ok 54 - with single line command writing lots of data fast and exiting
wrong value for bytes, expected 3145728 got 3018027
at tests/test_
1..153
FAIL: test_job_process
Sometimes they succeed on one architecture, and with lots of luck on both, but this is way too flaky to be an useful test. This keeps blocking the propagation of other packages from -proposed.
This was most likely triggered by the kernel update from 3.18 to 3.19.
===
break-fix: 52bce7f8d4fc633
| Changed in upstart (Ubuntu): | |
| importance: | Undecided → High |
| summary: |
- test_job_process fails in majority of cases + FTBFS: test_job_process fails in majority of cases |
| tags: | added: autopkgtset qa-daily-testing |
| tags: |
added: autopkgtest removed: autopkgtset |
| James Hunt (jamesodhunt) wrote : | #2 |
'mk-sbuild vivid' currently gives an error:
You might want to run 'apt-get -f install' to correct these:
The following packages have unmet dependencies:
systemd-sysv : Conflicts: upstart but 1.13.2-0ubuntu9 is to be installed
E: Unmet dependencies. Try 'apt-get -f install' with no packages (or specify a solution).
| Martin Pitt (pitti) wrote : | #3 |
@James: that was due to a set of priority mismatches, these got fixed now. Try again?
| James Hunt (jamesodhunt) wrote : | #4 |
mk-sbuild now fails with a different error:
Setting up udev (219-4ubuntu3) ...
A chroot environment has been detected, udev not started.
/var/lib/
dpkg: error processing package udev (--configure):
subprocess installed post-installation script returned error exit status 127
Errors were encountered while processing:
udev
E: Sub-process /usr/bin/dpkg returned an error code (1)
But back to the test failure... this test has been running successfully on all platforms for a long time (3+ years). It's possible that the test itself is racy and that a recent change in a different package has exposed that. Do we know when this test first started failing (and ideally what changes triggered it (kernel/libc))?
| Martin Pitt (pitti) wrote : | #5 |
It started failing around Feb 27. It roughly coincides with https:/
| description: | updated |
| James Hunt (jamesodhunt) wrote : | #6 |
It does indeed seem to be related to the kernel. I've just run that specific test 77 times on an amd64 system running a 3.18.0-11 kernel with no failures. Once the broken deps issue is resolved, I'll retest on the same system running the latest 3.19 kernel.
| James Hunt (jamesodhunt) wrote : | #7 |
Updating the amd64 system, I see the failure on the 2nd test run.
The Upstart logging code doesn't seem to be the culprit fwics meaning the most likely area is the pty-handling code (either in upstart or in the kernel).
| James Hunt (jamesodhunt) wrote : | #8 |
The "good" news is that this test failure is a real bug: running 1024 job instances of what is effectively the failing tests shows...
- 3.18 kernel: 0 job instances have incorrect size.
- 3.19 kernel: 848 job instances have the **incorrect** size (!!)
Figures are from a fully-updated amd64 vivid system booted to use Upstart and simply toggling the kernel being used.
I've tried to recreate the failure using script(1) and dd(1), but so far this has proven unsuccessful.
| James Hunt (jamesodhunt) wrote : | #9 |
Some quick tests suggest that this issue only seems to affect long runs of nul bytes (even if they are terminated by visible chars + a '\n').
| Changed in upstart (Ubuntu Vivid): | |
| assignee: | nobody → James Hunt (jamesodhunt) |
| Changed in linux (Ubuntu Vivid): | |
| importance: | Undecided → High |
| status: | New → Confirmed |
| assignee: | nobody → Andy Whitcroft (apw) |
| milestone: | none → ubuntu-15.03 |
| Changed in linux (Ubuntu Vivid): | |
| status: | Confirmed → In Progress |
| Andy Whitcroft (apw) wrote : | #10 |
Ok, after a lot of testing and reading of the traces we are seeing a spurious EIO reported to the read at the end of the file. This in combination with the way nih handles IO means we lose the last buffer of the file at times. From the strace log:
1396 read(17, 0x7f78214c3b30, 4096) = -1 EIO (Input/output error)
Looking through the v3.19 changelogs the following commit looks suspicious:
commit 52bce7f8d4fc633
Author: Peter Hurley <email address hidden>
Date: Wed Nov 5 12:13:05 2014 -0500
pty, n_tty: Simplify input processing on final close
When releasing one end of a pty pair, that end may just have written
to the other, which the input processing worker, flush_to_ldisc(), is
still working on but has not completed the copy to the other end's
read buffer. So input may not appear to be available to a waiting
reader but yet TTY_OTHER_CLOSED is now observed. The n_tty line
discipline has worked around this by waiting for input processing
to complete and then re-checking if input is available before
exiting with -EIO.
Since the tty/ldisc lock reordering, the wait for input processing
to complete can now occur during final close before setting
TTY_
see input available (if any) before observing TTY_OTHER_CLOSED.
Reviewed-by: Alan Cox <email address hidden>
Signed-off-by: Peter Hurley <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>
Indeed reverting this change seems to eliminate these symtoms.
| summary: |
- FTBFS: test_job_process fails in majority of cases + FTBFS: upstart test_job_process fails in majority of cases |
| summary: |
- FTBFS: upstart test_job_process fails in majority of cases + FTBFS: upstart test_job_process fails in majority of cases / Kernel + returning unexpected EIO at near of file |
| Andy Whitcroft (apw) wrote : Re: FTBFS: upstart test_job_process fails in majority of cases / Kernel returning unexpected EIO at near of file | #11 |
Ok here are some test kernels with this commit reverted (the was a small conflict). Could you test these and confirm they fix the issue for you as well:
| Andy Whitcroft (apw) wrote : | #12 |
For clarity this is triggered by the kernel sometimes exposing an EIO where we are expecting EOF. This tickles an issue with libnih which accumulates input in a read until EAGAIN loop, and dumps (arguably erroneously) its entire pending buffer when it sees an unexpected error.
| James Hunt (jamesodhunt) wrote : | #13 |
I've written a basic program that surfaces this bug. To recreate:
$ cd /tmp
$ tar xvf bug-1429756.tgz
$ cd bug-1429756
$ make
$ ./bug-1429756.sh
loop 1
loop 2
ERROR: got failure scenario after 2 loops (got=16, count=17, log=/tmp/
$
| James Hunt (jamesodhunt) wrote : | #14 |
Running with a 3.19.0-8-generic kernel I see:
- The Upstart test_job_process test fail within a few seconds when run in a loop looking for the errror.
- My test program in #13 also finds the bug within ~30 seconds on my test system.
Using the kernel in #11, I have run both tests simultaneously for ~1 hour with no errors observed.
| Changed in upstart (Ubuntu Vivid): | |
| status: | New → Invalid |
| summary: |
FTBFS: upstart test_job_process fails in majority of cases / Kernel - returning unexpected EIO at near of file + returning unexpected EIO at end of file |
| James Hunt (jamesodhunt) wrote : | #15 |
LKML thread: https:/
| James Hunt (jamesodhunt) wrote : | #16 |
Hi Andy - any progress on this? Although the buildd's are running older kernels, it seems that the DEP-8 Jenkins environment is running 3.19 and is thus causing upstart tests to fail. For example:
If it will take a while to resolve this, I guess we'll have to consider disabling the test for now. However, doing so is going to change the behaviour of the upstart logger.
| description: | updated |
| tags: | added: kernel-bug-break-fix |
| description: | updated |
| Changed in linux (Ubuntu Vivid): | |
| status: | In Progress → Confirmed |
| Changed in linux (Ubuntu): | |
| status: | In Progress → Confirmed |
| Launchpad Janitor (janitor) wrote : | #17 |
This bug was fixed in the package linux - 4.0.0-4.6
---------------
linux (4.0.0-4.6) wily; urgency=low
[ Andy Whitcroft ]
* Release Tracking Bug
- LP: #1470233
* rebase to mainline v4.0.7
[ Jay Vosburgh ]
* SAUCE: fan: Proof of concept implementation (v2)
- LP: #1439706
* SAUCE: fan: tunnel multiple mapping mode (v3)
- LP: #1470091
[ Upstream Kernel Changes ]
* rebase to v4.0.7
- LP: #1427680
- LP: #1462614
-- Andy Whitcroft <email address hidden> Tue, 30 Jun 2015 16:55:32 +0100
| Changed in linux (Ubuntu): | |
| status: | Confirmed → Fix Released |
| Brad Figg (brad-figg) wrote : | #18 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
| tags: | added: verification-needed-vivid |
| tags: |
added: verification-done-vivid removed: verification-needed-vivid |
| Launchpad Janitor (janitor) wrote : | #19 |
This bug was fixed in the package linux - 3.19.0-23.24
---------------
linux (3.19.0-23.24) vivid; urgency=low
[ Luis Henriques ]
* Release Tracking Bug
- LP: #1472346
[ Chris J Arges ]
* SAUCE: Don't use atomic read in evlist.c
- LP: #1410673
linux (3.19.0-23.23) vivid; urgency=low
[ Brad Figg ]
* Release Tracking Bug
- LP: #1472048
[ Chris J Arges ]
* [Config] Add CRYPTO_DEV_NX_*, 842_* as modules
- LP: #1454687
[ Lu, Han ]
* SAUCE: i915_bpo: drm/i915/audio: add codec wakeup override
enabled/disable callback
- LP: #1460674
[ Timo Aaltonen ]
* SAUCE: Backport I915_OVERLAY_
- LP: #1460674
* SAUCE: i915_bpo: Rebase to drm-intel-
- LP: #1460674
* SAUCE: i915_bpo: Revert "drm/i915: Implement the intel_dp_
function for DP EDID complaince tests"
- LP: #1460674
* SAUCE: i915_bpo: Revert "drm/i915: Add debugfs test control files for
Displayport compliance testing"
- LP: #1460674
* SAUCE: Load i915_bpo from the hda driver on SKL/CHV
- LP: #1460674
* SAUCE: i915_bpo: Don't try to support BXT
- LP: #1460674
* SAUCE: i915_bpo: drm/i915/skl: Fix DMC API version.
[ Upstream Kernel Changes ]
* Revert "usb: dwc2: add bus suspend/resume for dwc2"
- LP: #1471252
* Revert "HID: logitech-hidpp: support combo keyboard touchpad TK820"
- LP: #1471252
* Revert "KVM: x86: drop fpu_activate hook"
- LP: #1471252
* Revert "libceph: clear r_req_lru_item in __unregister_
- LP: #1471252
* drm/i915: add component support
- LP: #1460661
* ALSA: hda: export struct hda_intel
- LP: #1460661
* ALSA: hda: pass intel_hda to all i915 interface functions
- LP: #1460661
* ALSA: hda: add component support
- LP: #1460661
* drm/atomic-helpers: Fix documentation typos and wrong copy&paste
- LP: #1460674
* drm/atomic: Rename drm_atomic_
- LP: #1460674
* drm/atomic-helper: Rename commmit_
- LP: #1460674
* drm/atomic-helpers: make mode_set hooks optional
- LP: #1460674
* drm/atomic-helper: Fix kerneldoc for prepare_planes
- LP: #1460674
* drm: Complete moving rotation property to core
- LP: #1460674
* drm: Share plane pixel format check code between legacy and atomic
- LP: #1460674
* drm/atomic: Constify a bunch of functions pointer structs
- LP: #1460674
* drm: Fix some typo mistake of the annotations
- LP: #1460674
* drm: change connector to tmp_connector
- LP: #1460674
* drm: atomic: Expose CRTC active property
- LP: #1460674
* drm: atomic: Allow setting CRTC active property
- LP: #1460674
* drm/atomic-helpers: Properly avoid full modeset dance
- LP: #1460674
* drm/atomic: Add helpers for state-subclassing drivers
- LP: #1460674
* drm: Fix some typos
- LP: #1460674
* drm/atomic: Add for_each_
- LP: #1460674
* drm/atomic-helper: Don't call atomic_update_plane when it stays off
- LP: #1460674
* drm/atomic-helper: Really recover pre-atomic plane/cursor behavior
...
| Changed in linux (Ubuntu Vivid): | |
| status: | Confirmed → Fix Released |
| tags: | removed: kernel-bug-break-fix |


This is also reproducible on a package build in sbuild locally.