FTBFS: upstart test_job_process fails in majority of cases / Kernel returning unexpected EIO at end of file

Bug #1429756 reported by Martin Pitt
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Andy Whitcroft
Vivid
Fix Released
High
Andy Whitcroft
upstart (Ubuntu)
Invalid
High
James Hunt
Vivid
Invalid
High
James Hunt

Bug Description

As you see in the history in https://jenkins.qa.ubuntu.com/job/vivid-adt-upstart/?, upstart's test fail almost all the time, on

ok 53 - with single line command writing fast and exiting
not ok 54 - with single line command writing lots of data fast and exiting
 wrong value for bytes, expected 3145728 got 3018027
 at tests/test_job_process.c:3886 (test_start).
1..153
FAIL: test_job_process

Sometimes they succeed on one architecture, and with lots of luck on both, but this is way too flaky to be an useful test. This keeps blocking the propagation of other packages from -proposed.

This was most likely triggered by the kernel update from 3.18 to 3.19.

===
break-fix: 52bce7f8d4fc633c9a9d0646eef58ba6ae9a3b73 1a48632ffed61352a7810ce089dc5a8bcd505a60

Martin Pitt (pitti)
Changed in upstart (Ubuntu):
importance: Undecided → High
Revision history for this message
Martin Pitt (pitti) wrote : Re: FTBFS: test_job_process fails in majority of cases

This is also reproducible on a package build in sbuild locally.

summary: - test_job_process fails in majority of cases
+ FTBFS: test_job_process fails in majority of cases
tags: added: autopkgtset qa-daily-testing
tags: added: autopkgtest
removed: autopkgtset
Revision history for this message
James Hunt (jamesodhunt) wrote :

'mk-sbuild vivid' currently gives an error:

You might want to run 'apt-get -f install' to correct these:
The following packages have unmet dependencies:
 systemd-sysv : Conflicts: upstart but 1.13.2-0ubuntu9 is to be installed
E: Unmet dependencies. Try 'apt-get -f install' with no packages (or specify a solution).

Revision history for this message
Martin Pitt (pitti) wrote :

@James: that was due to a set of priority mismatches, these got fixed now. Try again?

Revision history for this message
James Hunt (jamesodhunt) wrote :

mk-sbuild now fails with a different error:

Setting up udev (219-4ubuntu3) ...
A chroot environment has been detected, udev not started.
/var/lib/dpkg/info/udev.postinst: 35: /var/lib/dpkg/info/udev.postinst: update-initramfs: not found
dpkg: error processing package udev (--configure):
 subprocess installed post-installation script returned error exit status 127
Errors were encountered while processing:
 udev
E: Sub-process /usr/bin/dpkg returned an error code (1)

But back to the test failure... this test has been running successfully on all platforms for a long time (3+ years). It's possible that the test itself is racy and that a recent change in a different package has exposed that. Do we know when this test first started failing (and ideally what changes triggered it (kernel/libc))?

Revision history for this message
Martin Pitt (pitti) wrote :

It started failing around Feb 27. It roughly coincides with https://launchpad.net/ubuntu/+source/glibc/2.19-15ubuntu2 (but that looks unlikely), the most likely candidate is the upgrade of linux from 3.18 to 3.19: https://launchpad.net/ubuntu/+source/linux/3.19.0-7.7

description: updated
Revision history for this message
James Hunt (jamesodhunt) wrote :

It does indeed seem to be related to the kernel. I've just run that specific test 77 times on an amd64 system running a 3.18.0-11 kernel with no failures. Once the broken deps issue is resolved, I'll retest on the same system running the latest 3.19 kernel.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Updating the amd64 system, I see the failure on the 2nd test run.

The Upstart logging code doesn't seem to be the culprit fwics meaning the most likely area is the pty-handling code (either in upstart or in the kernel).

Revision history for this message
James Hunt (jamesodhunt) wrote :

The "good" news is that this test failure is a real bug: running 1024 job instances of what is effectively the failing tests shows...

- 3.18 kernel: 0 job instances have incorrect size.
- 3.19 kernel: 848 job instances have the **incorrect** size (!!)

Figures are from a fully-updated amd64 vivid system booted to use Upstart and simply toggling the kernel being used.

I've tried to recreate the failure using script(1) and dd(1), but so far this has proven unsuccessful.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Some quick tests suggest that this issue only seems to affect long runs of nul bytes (even if they are terminated by visible chars + a '\n').

James Hunt (jamesodhunt)
Changed in upstart (Ubuntu Vivid):
assignee: nobody → James Hunt (jamesodhunt)
Andy Whitcroft (apw)
Changed in linux (Ubuntu Vivid):
importance: Undecided → High
status: New → Confirmed
assignee: nobody → Andy Whitcroft (apw)
milestone: none → ubuntu-15.03
Andy Whitcroft (apw)
Changed in linux (Ubuntu Vivid):
status: Confirmed → In Progress
Revision history for this message
Andy Whitcroft (apw) wrote :

Ok, after a lot of testing and reading of the traces we are seeing a spurious EIO reported to the read at the end of the file. This in combination with the way nih handles IO means we lose the last buffer of the file at times. From the strace log:

    1396 read(17, 0x7f78214c3b30, 4096) = -1 EIO (Input/output error)

Looking through the v3.19 changelogs the following commit looks suspicious:

  commit 52bce7f8d4fc633c9a9d0646eef58ba6ae9a3b73
  Author: Peter Hurley <email address hidden>
  Date: Wed Nov 5 12:13:05 2014 -0500

    pty, n_tty: Simplify input processing on final close

    When releasing one end of a pty pair, that end may just have written
    to the other, which the input processing worker, flush_to_ldisc(), is
    still working on but has not completed the copy to the other end's
    read buffer. So input may not appear to be available to a waiting
    reader but yet TTY_OTHER_CLOSED is now observed. The n_tty line
    discipline has worked around this by waiting for input processing
    to complete and then re-checking if input is available before
    exiting with -EIO.

    Since the tty/ldisc lock reordering, the wait for input processing
    to complete can now occur during final close before setting
    TTY_OTHER_CLOSED. In this way, a waiting reader is guaranteed to
    see input available (if any) before observing TTY_OTHER_CLOSED.

    Reviewed-by: Alan Cox <email address hidden>
    Signed-off-by: Peter Hurley <email address hidden>
    Signed-off-by: Greg Kroah-Hartman <email address hidden>

Indeed reverting this change seems to eliminate these symtoms.

summary: - FTBFS: test_job_process fails in majority of cases
+ FTBFS: upstart test_job_process fails in majority of cases
summary: - FTBFS: upstart test_job_process fails in majority of cases
+ FTBFS: upstart test_job_process fails in majority of cases / Kernel
+ returning unexpected EIO at near of file
Revision history for this message
Andy Whitcroft (apw) wrote : Re: FTBFS: upstart test_job_process fails in majority of cases / Kernel returning unexpected EIO at near of file

Ok here are some test kernels with this commit reverted (the was a small conflict). Could you test these and confirm they fix the issue for you as well:

    http://people.canonical.com/~apw/lp1429756-vivid/

Revision history for this message
Andy Whitcroft (apw) wrote :

For clarity this is triggered by the kernel sometimes exposing an EIO where we are expecting EOF. This tickles an issue with libnih which accumulates input in a read until EAGAIN loop, and dumps (arguably erroneously) its entire pending buffer when it sees an unexpected error.

Revision history for this message
James Hunt (jamesodhunt) wrote :

I've written a basic program that surfaces this bug. To recreate:

$ cd /tmp
$ tar xvf bug-1429756.tgz
$ cd bug-1429756
$ make
$ ./bug-1429756.sh
loop 1
loop 2
ERROR: got failure scenario after 2 loops (got=16, count=17, log=/tmp/tmp.uwZHkHh0mc)
$

Revision history for this message
James Hunt (jamesodhunt) wrote :

Running with a 3.19.0-8-generic kernel I see:

- The Upstart test_job_process test fail within a few seconds when run in a loop looking for the errror.
- My test program in #13 also finds the bug within ~30 seconds on my test system.

Using the kernel in #11, I have run both tests simultaneously for ~1 hour with no errors observed.

Changed in upstart (Ubuntu Vivid):
status: New → Invalid
summary: FTBFS: upstart test_job_process fails in majority of cases / Kernel
- returning unexpected EIO at near of file
+ returning unexpected EIO at end of file
Revision history for this message
James Hunt (jamesodhunt) wrote :
Revision history for this message
James Hunt (jamesodhunt) wrote :

Hi Andy - any progress on this? Although the buildd's are running older kernels, it seems that the DEP-8 Jenkins environment is running 3.19 and is thus causing upstart tests to fail. For example:

https://jenkins.qa.ubuntu.com/view/Wily/view/AutoPkgTest/job/wily-adt-upstart/ARCH=amd64,label=adt/lastBuild/

If it will take a while to resolve this, I guess we'll have to consider disabling the test for now. However, doing so is going to change the behaviour of the upstart logger.

Andy Whitcroft (apw)
description: updated
tags: added: kernel-bug-break-fix
Andy Whitcroft (apw)
description: updated
Andy Whitcroft (apw)
Changed in linux (Ubuntu Vivid):
status: In Progress → Confirmed
Changed in linux (Ubuntu):
status: In Progress → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.0.0-4.6

---------------
linux (4.0.0-4.6) wily; urgency=low

  [ Andy Whitcroft ]

  * Release Tracking Bug
    - LP: #1470233
  * rebase to mainline v4.0.7

  [ Jay Vosburgh ]

  * SAUCE: fan: Proof of concept implementation (v2)
    - LP: #1439706
  * SAUCE: fan: tunnel multiple mapping mode (v3)
    - LP: #1470091

  [ Upstream Kernel Changes ]

  * rebase to v4.0.7
    - LP: #1427680
    - LP: #1462614

 -- Andy Whitcroft <email address hidden> Tue, 30 Jun 2015 16:55:32 +0100

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid
Brad Figg (brad-figg)
tags: added: verification-done-vivid
removed: verification-needed-vivid
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (17.6 KiB)

This bug was fixed in the package linux - 3.19.0-23.24

---------------
linux (3.19.0-23.24) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1472346

  [ Chris J Arges ]

  * SAUCE: Don't use atomic read in evlist.c
    - LP: #1410673

linux (3.19.0-23.23) vivid; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1472048

  [ Chris J Arges ]

  * [Config] Add CRYPTO_DEV_NX_*, 842_* as modules
    - LP: #1454687

  [ Lu, Han ]

  * SAUCE: i915_bpo: drm/i915/audio: add codec wakeup override
    enabled/disable callback
    - LP: #1460674

  [ Timo Aaltonen ]

  * SAUCE: Backport I915_OVERLAY_DISABLE_DEST_COLORKEY
    - LP: #1460674
  * SAUCE: i915_bpo: Rebase to drm-intel-next-fixes-2015-05-29
    - LP: #1460674
  * SAUCE: i915_bpo: Revert "drm/i915: Implement the intel_dp_autotest_edid
    function for DP EDID complaince tests"
    - LP: #1460674
  * SAUCE: i915_bpo: Revert "drm/i915: Add debugfs test control files for
    Displayport compliance testing"
    - LP: #1460674
  * SAUCE: Load i915_bpo from the hda driver on SKL/CHV
    - LP: #1460674
  * SAUCE: i915_bpo: Don't try to support BXT
    - LP: #1460674
  * SAUCE: i915_bpo: drm/i915/skl: Fix DMC API version.

  [ Upstream Kernel Changes ]

  * Revert "usb: dwc2: add bus suspend/resume for dwc2"
    - LP: #1471252
  * Revert "HID: logitech-hidpp: support combo keyboard touchpad TK820"
    - LP: #1471252
  * Revert "KVM: x86: drop fpu_activate hook"
    - LP: #1471252
  * Revert "libceph: clear r_req_lru_item in __unregister_linger_request()"
    - LP: #1471252
  * drm/i915: add component support
    - LP: #1460661
  * ALSA: hda: export struct hda_intel
    - LP: #1460661
  * ALSA: hda: pass intel_hda to all i915 interface functions
    - LP: #1460661
  * ALSA: hda: add component support
    - LP: #1460661
  * drm/atomic-helpers: Fix documentation typos and wrong copy&paste
    - LP: #1460674
  * drm/atomic: Rename drm_atomic_helper_commit_pre_planes() state argument
    - LP: #1460674
  * drm/atomic-helper: Rename commmit_post/pre_planes
    - LP: #1460674
  * drm/atomic-helpers: make mode_set hooks optional
    - LP: #1460674
  * drm/atomic-helper: Fix kerneldoc for prepare_planes
    - LP: #1460674
  * drm: Complete moving rotation property to core
    - LP: #1460674
  * drm: Share plane pixel format check code between legacy and atomic
    - LP: #1460674
  * drm/atomic: Constify a bunch of functions pointer structs
    - LP: #1460674
  * drm: Fix some typo mistake of the annotations
    - LP: #1460674
  * drm: change connector to tmp_connector
    - LP: #1460674
  * drm: atomic: Expose CRTC active property
    - LP: #1460674
  * drm: atomic: Allow setting CRTC active property
    - LP: #1460674
  * drm/atomic-helpers: Properly avoid full modeset dance
    - LP: #1460674
  * drm/atomic: Add helpers for state-subclassing drivers
    - LP: #1460674
  * drm: Fix some typos
    - LP: #1460674
  * drm/atomic: Add for_each_{connector,crtc,plane}_in_state helper macros
    - LP: #1460674
  * drm/atomic-helper: Don't call atomic_update_plane when it stays off
    - LP: #1460674
  * drm/atomic-helper: Really recover pre-atomic plane/cursor behavior
 ...

Changed in linux (Ubuntu Vivid):
status: Confirmed → Fix Released
Andy Whitcroft (apw)
tags: removed: kernel-bug-break-fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.