[SRU] segfault in log.c code causes phone reboot loops
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | Canonical System Image |
Critical
|
Ondrej Kubik | ||
| | upstart |
Undecided
|
James Hunt | ||
| | upstart (Ubuntu) |
Critical
|
James Hunt | ||
| | Utopic |
Critical
|
Unassigned | ||
| | Vivid |
Critical
|
James Hunt | ||
| | Wily |
Critical
|
James Hunt | ||
| | upstart (Ubuntu RTM) |
Critical
|
Unassigned | ||
Bug Description
= Summary =
The version of Upstart in vivid is affected by a coule of bugs relating
to the flushing data from early-boot jobs to disk which can both result
in a crash:
== Problem 1 ==
An internal list is mishandled meaning a crash could occur randomly.
== Problem 2 ==
Jobs which spawn processes in the background then themselves exit can
cause a crash due.
= Explanation of how Upstart flushes early job output =
If an Upstart job starts *and ends* early in the boot sequence (before
the log partition is mounted and writable) and produces output to its
stdout/stderr, Upstart will cache the output for later flushing by
adding the 'Log' object associated with the 'Job' to a list.
When the log partition is mounted writable, the
/etc/init/
notify-
of early-boot job output which takes the form of iterating the
'log_unflushed_
= Code Specifics =
There are 2 issues (note that the numbers used below match those used in
the Summary).
== Problem 1 detail ==
Due to a bug in the way the 'log_unflushed_
'Log' cannot be added to the list directly, so is added via an
intermediary ('NihListElem') node), a crash can result when iterating
the list since the 'Log' is freed, but NOT the intermediary node. The
implication is that it is possible for the intermediary node to be
attempt to dereference already-freed data, resulting in a crash.
== Problem 2 detail ==
If a job spawns a process in the background, then itself exits, that
jobs 'Log' entry will be added to the 'log_unflushed_
if the background process produces output and then exits before Upstart
attempts to flush the original jobs data to disk, the 'NihIo'
corresponding to the log will be serviced automatically and the data
flushed to disk. The problem comes when Upstart receives the
notification to flush the 'log_unflushed_
now contains an entry which has already been freed (since all its data
has already been flushed). The result is an assertion failure.
= Fix =
== Problem 1 fix ==
Correct the 'log_unflushed_
'NihListElem' (which will automatically free the 'Log' object), not by
simply freeing the 'Log' object itself.
* Branch: lp:~jamesodhunt/ubuntu/vivid/upstart/bug-1447756/
* New Upstart test added to avoid regression?: Yes.
== Problem 2 fix ==
Correct the assumption that the only entries in the
'log_unflushed_
there is in fact any data to flush; if not, remove the entry from the
'log_unflushed_
automatically by the 'NihIo'.
* Branch: lp:~jamesodhunt/upstart/bug-1447756-the-actual-fix
* New Upstart test added to avoid regression?: Yes.
= Workarounds =
If a system is affected by this bug, it will be manifested by a crash early in the boot sequence.
To overcome the issue, either:
a) Boot by adding "--no-log" to the kernel command-line.
b) Disable the flush-early-job-log job (assuming the machine is bootable) by running the following:
$ echo manual | sudo tee -a /etc/init/
= Impact =
The issue has been present in Upstart since logging was introduced but
no known instances of crashes relating to these problems have been
reported prior to this bug being reported (which relates the the issue
being seen on a very small subset of specific Ubuntu Touch phone
hardware where Upstart is used as the system init daemon).
Note that vivid still uses Upstart for managing the graphical session,
but now uses systemd by default for the system init daemon. Since the session (Upstart) init does not even require
a flush-early-
= Test Case =
This bug is extremely hard to surface so the approach is simply to check that the internal list can be iterated correctly by:
1) Booting the system with upstart
(select the Upstart option from the grub menu or add "init=/
2) Running the following on a system booted with Upstart:
$ for i in $(seq 17); do sudo start flush-early-
= Regression Potential =
None expected:
- As noted in Impact, the problems fixed by this version of Upstart have not been observed on server/desktop systems before.
- The fix is already in wily and no problems have been reported.
- See Impact.
= Original Description =
We recently started getting reprots from phone users that their devices go into a reboot loop after changing the language or getting an OTA upgrade (either of both end with a reboot of the phone)
after a bit of research we collected the log at http://
this shows a segfault of upstarts init binary in the log.c code:
[ 6.999083]init: log.c:819: Assertion failed in log_clear_
[ 7.000279]init: Caught abort, core dumped
[ 7.467176]Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000600
Related branches
- Upstart Reviewers: Pending requested 2015-05-13
-
Diff: 230 lines (+167/-3)3 files modifiedChangeLog (+18/-0)
init/log.c (+42/-0)
init/tests/test_log.c (+107/-3)
| Changed in upstart (Ubuntu): | |
| importance: | Undecided → Critical |
| status: | New → Confirmed |
| Changed in canonical-devices-system-image: | |
| importance: | Undecided → Critical |
| milestone: | none → ww17-2015 |
| status: | New → Confirmed |
| tags: | added: hotfix |
| Steve Langasek (vorlon) wrote : | #1 |
| Steve Langasek (vorlon) wrote : | #2 |
I think I see a potential problem if 'initctl notify-
We can't see anything in the official Ubuntu rootfs that would account for this function being called more than once. On the other hand, we also haven't seen this reproduced on any devices other than those installed in the factory, so it could be that an error has crept in there.
We can try to produce a patch to upstart to fix this bug, to see if it fixes the segfault for those who are seeing it.
| Changed in upstart (Ubuntu): | |
| assignee: | nobody → James Hunt (jamesodhunt) |
| James Hunt (jamesodhunt) wrote : | #3 |
I think I understand what's happening now - it's not the log that isn't being freed, it's the list entry the log is attached to that is not freed. The effect is the same though - calling log_clear_
| Changed in upstart: | |
| assignee: | nobody → James Hunt (jamesodhunt) |
| James Hunt (jamesodhunt) wrote : | #4 |
Test binaries are available here: http://
| James Hunt (jamesodhunt) wrote : | #5 |
i386+amd64 are packaged here: https:/
| James Hunt (jamesodhunt) wrote : | #6 |
I've tried the fix on my bq device and it appears to be in a reboot loop (like the one the fix was supposed to resolve).
As such, I'd recommend testing the binaries only for the session init initially (/usr/bin/
Also, for both system and session scenarios, I'd recommend adding '--debug' to get as much detail as possible.
| James Hunt (jamesodhunt) wrote : | #7 |
Something else to try - disable /etc/init/
$ sudo mount -oremount,rw /
$ echo manual | sudo tee /etc/init/
$ sudo reboot
... and post-boot do the following:
$ for i in $(seq 17); do sudo initctl notify-
As Steve has mentioned, we do need to trace down why 'initctl notify-
| Changed in canonical-devices-system-image: | |
| milestone: | ww17-2015 → ww19-ota |
| Changed in canonical-devices-system-image: | |
| assignee: | nobody → Ondrej Kubik (w-ondra) |
| James Hunt (jamesodhunt) wrote : | #8 |
Hi ondra/ogra - Can you comment on my suggestions in #6 and #7? My device is still bricked so if you have any suggestions on how to perform a full reset, that'd be great as udf is unable to recover it.
| Ondrej Kubik (ondrak) wrote : | #9 |
hi James
So one way is to disable upstart logging all together with "--no-log" kern command option. We are going with this option for next OTA, till we can crack actual root cause of this issue.
As for restore, best bet is with MTK flash_tool
| James Hunt (jamesodhunt) wrote : | #10 |
I've rebuild the fix in a clean environment and the init binary below now boots fine for me on a bq aquaris E4.5:
http://
| Changed in canonical-devices-system-image: | |
| status: | Confirmed → Fix Committed |
| Oliver Grawert (ogra) wrote : | #11 |
i dont think we should close this one, the bu still persists and need to urgently be fixed, the workaround we ship can not stay for log (since it removes all logging for system jobs)
| Ondrej Kubik (ondrak) wrote : | #12 |
Agreed, either we open new bug to track this, or we don't mark this as fixed. This is the time one could use state "workaround"
| John McAleely (john.mcaleely) wrote : | #13 |
DIsabling as a workaround is bug #1452663
| James Hunt (jamesodhunt) wrote : | #14 |
ondra and I have been hammering away at this, but progress is painfully slow given that:
a) the problem is not seen on every boot.
b) we can only view the end of kmsg log.
c) rebuild times are relatively slow.
From what ondra says he's seen today, it sounds as though we might be hitting a stack corruption issue - the debug I've given him is not being displayed as expected. Even with the initial fix I created based on code inspection alone, ondra is still seeing exactly the same assertion failure which we thought would have been fixed by the initial fix for this bug.
I've tried various ways to recreate the issue (both on a device and on a normal system, code review / code analysis tools, runtime checkers), but have so far been unsuccessful.
Current work-arounds:
1) Keep using '--no-log' in the kernel command-line.
Pros: reliable.
Cons: means that no system jobs get their output logged.
2) Disable the /etc/init/
Pros: seems to be reliable (? but needs further testing).
Cons: means early job output is not logged (however, on the device in question the only output seems to be from /etc/init/
| Ondrej Kubik (ondrak) wrote : | #15 |
I think I have nailed it down now, here is brief description what is happening (if I read code right)
There seems to be race, when we get new log data for one of the jobs after job has been terminated, and while processing it we call log_io_reader and eventually log_file_write which will try to flash unflashed buffer to drive.
This succeeds, mind this is before we got disk writable signal.
Since it succeeds, unflashed->len becomes 0, but we don't remove that log instance from list of logs which needs to be flashed (log_unflushed_
So next time when we get signal that disk is writable, we try again to flash that log and BOOM it panics on assert checking that log has something to be flashed, but it was already flashed.
Actual change of unfleshed log len changes on line 562, that's where we shrink unflashed buffer by amount we managed to write to disk, which in our case if full len, making buffer after shrinking zero length.
So I can see at least three fixes:
1) we should after calling nih_io_
2) we need to make log_clear_unflushed more tolerant to logs which has been already flushed successfully before reaching to this point.
3) we don't try to flash unfleshed buffers till we get disk writable signal
| Ondrej Kubik (ondrak) wrote : | #16 |
Sorry in previous comment, replace all "flash" with "flush"
Actually one more issue I can see there is this:
Job dies -> it's added to log_unflushed_files when it has unflushed data
but if job is restarted before we get disk writable signal, it will mess up the things, job will still remain in log_unflushed_files
and once we call log_file_write it will try to flush unflushed data, without removing job from log_unflushed_files list
| James Hunt (jamesodhunt) wrote : | #17 |
Hi Ondrej,
Regarding #15, I'm not sure this is correct. As you say, when the job process terminates, job_process_
743 log_handle_
744 {
745 NihListEntry *elem;
746
747 nih_assert (log);
748 nih_assert (log->detached == 0);
749
750 log_read_watch (log);
751
752 if (! log->unflushed-
753 return 1;
So, if the write is successful and log->unflushed->len becomes zero, the function returns 1 (meaning "log does not need to be added to the unflushed list") and crucially the log is not added to the unflushed list.
| James Hunt (jamesodhunt) wrote : | #18 |
Hi Ondrej,
Regarding #15, I'm not sure this is correct. As you say, when the job process terminates, job_process_
743 log_handle_
744 {
745 NihListEntry *elem;
746
747 nih_assert (log);
748 nih_assert (log->detached == 0);
749
750 log_read_watch (log);
751
752 if (! log->unflushed-
753 return 1;
So, if the write is successful and log->unflushed->len becomes zero, the function returns 1 (meaning "log does not need to be added to the unflushed list") and crucially the log is not added to the unflushed list.
Regarding #16, there shouldn't be a problem in that scenario since when the log gets added to the unflushed list, it is totally detached from its parent job. Hence, the job can be destroyed but the log lives on as an element of the unflushed list. If that job gets recreated, it will get a new set of log objects associated with it.
Did you manage to get the full log_clear_
| Ondrej Kubik (ondrak) wrote : | #19 |
So when job gets terminated we don't succeed to write to to the disk and it is added to the unflushed list.
Problem is another call to write function later on, but before we get writable disk signal:
[ 7.460627]init: log_handle_
[ 7.460682]init: unflushed-
[ 7.460736]init: job_process_
but then later one we get:
[ 7.968094]init: log_io_reader:368:
[ 7.968158]init: log_file_
[ 7.975087]init: log_file_
calling "nih_io_
[ 7.975167]init: log_file_
and eventually we get:
[ 8.369166]init: control_
[ 8.369239]init: log_clear_
[ 8.369280]init: OK checking assert:
[ 8.369435]init: log_clear_
[ 8.369492]init: log_file_
[ 8.370305]init: log_file_
[ 8.370359]init: log_file_
[ 8.370450]init: OK checking assert:
[ 8.370509]init: log.c:874: Assertion failed in log_clear_
As you can see call to write in the middle is one causing the issue, since it flushes log buffer, but does not remove it from unflushed list.
| Ondrej Kubik (ondrak) wrote : | #20 |
So one thing which I still cannot track down is how is that unidentified log_io_reader called. Or who is calling it.
I have put traces to log_read_watch under condition "if (io->recv_buf->len) {" where we call "log_io_reader" but call is not coming from there.
So what are other options for log_io_reader to be called?
| Ondrej Kubik (ondrak) wrote : | #21 |
Sorry for spamming, but I guess that log_io_reader is called by nih_io_watcher which has been initialised by "nih_io_reopen"
which I suppose is called when job starts?
So that would go back to my original finding, job dies and is restarted before we get signal about disk being writable, but at the time job has been restarted it succeeds in writing unflushed logs to the disk while calling log_io_reader
Makes sense?
| Oliver Grawert (ogra) wrote : | #22 |
the ureadahead-touch upstart job does definitely not need to log anything .... we should add "console none" to it so it does not attempt to. (logging there only makes sense when actively debugging ureadahead anyway)
| James Hunt (jamesodhunt) wrote : | #23 |
Ondrej - aha! with the debug, this is making more sense now. Yes, since the ureadahead-touch job spawns a process in the background (ureadahead) and then the job itself exits, the log associated with the main job process gets added to the unflushed list. ureadahead then writes output and the NihIo is automatically serviced to flush the log to disk and set log->unflushed->len to zero. Later, 'initctl notify-
Could you try the following which should be a minimal fix: http://
Another point - the fact that ureadahead-touch is triggering the problem does point to an issue with that job - as can be seen, even when the script section ends (and that associated shell dies), 'ureadahead --daemon' continues and if it produces output, that gets logged. A simple work-around is to remove --daemon so that the job process and the log never become detached.
| Ondrej Kubik (ondrak) wrote : | #24 |
I can confirm fix: http://
It will safely ignore entries of log_unflushed_files list which have log->unflushed->len set to zero. Since we know how this state is reached it seems like previous nih_assert (log->unflushed
| James Hunt (jamesodhunt) wrote : | #25 |
\o/. Yes, Upstart uses asserts extremely agressively. It's unfortunate that we've never hit this issue in testing but I'm currently working on new tests for this slightly unusual scenario.
| Changed in upstart (Ubuntu): | |
| status: | Confirmed → In Progress |
| Changed in upstart: | |
| status: | New → In Progress |
| James Hunt (jamesodhunt) wrote : | #26 |
lp:~jamesodhunt/upstart/bug-1447756-the-actual-fix contains the fix and a new test (which correctly fails with the current lp:upstart but passes with the fix in that branch).
The code has been tested on a failing device and a server system. I am currently testing on a non-failing krillin device. I understand Ondra is going to test on a couple of other devices that exhibit this bug to further sanity check the code.
Once all these tests pass I'll raise an MP for the branch above to get it merged to lp:upstart.
We'll then need to cherry pick the fix into both vivid and wily.
| Changed in canonical-devices-system-image: | |
| milestone: | ww19-ota → ww22-2015 |
| status: | Fix Committed → In Progress |
| James Hunt (jamesodhunt) wrote : | #27 |
Applying the top 2 commits (r1665 and r1666) from lp:~jamesodhunt/upstart/bug-1447756-the-actual-fix [1] to https:/
I've tested this by building on the device itself and also by building in a ARCH=armhf utopic chroot [2].
__________
[1] - Equivalent to the attached patch.
[2] - Interestingly, building in an vivid ARCH=armhf chroot generates a binary that fails at runtime.
| James Hunt (jamesodhunt) wrote : | #28 |
MP raised on lp:upstart to start the trickle-down to the rtm package.
| tags: | added: patch |
| Changed in upstart: | |
| status: | In Progress → Fix Committed |
| Ondrej Kubik (ondrak) wrote : | #29 |
Just to confirm. I just received another device exhibiting boot loop issue and after quick investigation it was same problem, race of chid process continuing logging after parent died but before writable disk signal.
Once I used patched "upstart" binary, device booted normally, so all good.
| James Hunt (jamesodhunt) wrote : | #30 |
Looks like kernel bug 1429756 could block 1.13.2-0ubuntu14 landing for wily (being a pre-req to getting this fix into vivid and utopic):
| Pat McGowan (pat-mcgowan) wrote : | #31 |
@james we can land in vivid first if we need to, and the fix does not need to be made to utopic
We do need the fix to land by May 21
| Steve Langasek (vorlon) wrote : | #32 |
Erroneous test failures can (and should) be overridden in proposed-migration. Also the requirement is not that the package *reach* the development release before being SRUed, only that it be *uploaded* to the development series.
| Launchpad Janitor (janitor) wrote : | #33 |
This bug was fixed in the package upstart - 1.13.2-0ubuntu14
---------------
upstart (1.13.2-0ubuntu14) wily; urgency=medium
* Cherry-pick upstream fix for LP: #1447756, fixing broken handling
when flushing logs to disk.
* Cherry-pick follow-on upstream fix for LP: #1447756.
-- James Hunt <email address hidden> Mon, 18 May 2015 11:09:10 +0100
| Changed in upstart (Ubuntu Wily): | |
| status: | In Progress → Fix Released |
| James Hunt (jamesodhunt) wrote : | #34 |
Thanks for nudging -proposed Steve.
I've silo 021 now includes upstart version 1.13.2-0ubuntu13.1 (which sil2100 synced from wily).
Basic test plan is here: https:/
I've tested this as follows:
$ wget http://
$ ubuntu-device-flash --verbose touch --bootstrap --channel=
$ citrain device-upgrade 021
I then reflashed the boot.img using one from ondra which drops the current '--no-log' option from /proc/cmdline and rebooted.
As far as I'm concerned, the fix is good. All that now remains is for ondra or one of the folk in London who have the "failing" krillin devices to test the above.
| James Hunt (jamesodhunt) wrote : | #35 |
Unfortunately, ondra no longer has the failing phones so we may need to take a decision to just land this if QA are happy the change has not regressed the behaviour for non-failing phones.
I've tried to raise QA but they are sprinting in the US this week so no direct response.
I believe that sil2100 is doing what he can atm, but it's unclear whether this fix will make it into the next OTA at this stage.
| Pat McGowan (pat-mcgowan) wrote : | #36 |
Per Ondrej on private bug https:/
"I can confirm that phone which we received from BQ had same issue and problem has been resolved once applied fix developed in bug #1447756 was applied.
We should now revert workaround committed as part of Bug #1452663"
| Alex Kaluzhny (alex-kaluzhny) wrote : | #37 |
Is the fix landing in vivid?
| Changed in upstart (Ubuntu Utopic): | |
| status: | New → Won't Fix |
| Steve Langasek (vorlon) wrote : Re: [Bug 1447756] Re: segfault in log.c code causes phone reboot loops | #38 |
On Thu, May 21, 2015 at 05:05:16PM -0000, Alex Kaluzhny wrote:
> Is the fix landing in vivid?
The fix has landed in the stable phone overlay ppa (upstart
1.13.2-
James, can you please follow through on SRUing this to vivid? I've copied
the package into the vivid-proposed queue from the stable-
--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://
<email address hidden> <email address hidden>
| Changed in upstart (Ubuntu RTM): | |
| status: | New → Fix Released |
| Changed in canonical-devices-system-image: | |
| status: | In Progress → Fix Released |
| Changed in upstart (Ubuntu Vivid): | |
| status: | New → In Progress |
| assignee: | nobody → James Hunt (jamesodhunt) |
| description: | updated |
| summary: |
- segfault in log.c code causes phone reboot loops + [SRU] segfault in log.c code causes phone reboot loops |
| James Hunt (jamesodhunt) wrote : | #39 |
Hi Steve - ah, sorry - hadn't registered that you'd already copied the package into the queue (we now have 2 :-) Could mine be deleted please?
| Steve Langasek (vorlon) wrote : Re: [Bug 1447756] Re: [SRU] segfault in log.c code causes phone reboot loops | #40 |
On Fri, May 22, 2015 at 01:34:52PM -0000, James Hunt wrote:
> Hi Steve - ah, sorry - hadn't registered that you'd already copied the
> package into the queue (we now have 2 :-) Could mine be deleted please?
Done.
| Alexanrd (w-a-2) wrote : | #41 |
What about the fix for Trusty? I have similar problems with upstart=
Workrounds with kernel option "--no-log" and disabling flush-early-job-log job works for me.
| Steve Langasek (vorlon) wrote : | #42 |
There was no intention of fixing this bug for trusty, because this bug was never reported with trusty. It has only been reported with the Ubuntu phone, which is utopic or later. If you are experiencing crashes with upstart on trusty, can you please attach a tarball of the /etc/init directory from the affected system?
Hello Oliver, or anyone else affected,
Accepted upstart into vivid-proposed. The package will build now and be available at https:/
Please help us by testing this new package. See https:/
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-
Further information regarding the verification process can be found at https:/
| Changed in upstart (Ubuntu Vivid): | |
| status: | In Progress → Fix Committed |
| tags: | added: verification-needed |
As a part of the Stable Release Updates quality process a search for Launchpad bug reports using the version of upstart from vivid-proposed was performed and bug 1470745 was found. Please investigate this bug report to ensure that a regression will not be created by this SRU. In the event that this is not a regression remove the "verification-
| tags: | added: verification-failed |
| tags: | removed: verification-failed |
| Changed in upstart (Ubuntu Vivid): | |
| importance: | Undecided → Critical |
| Changed in upstart (Ubuntu RTM): | |
| importance: | Undecided → Critical |
| Changed in upstart (Ubuntu Utopic): | |
| importance: | Undecided → Critical |
| Changed in upstart (Ubuntu Vivid): | |
| status: | Fix Committed → Won't Fix |
The version of upstart in the proposed pocket of Vivid that was purported to fix this bug report has been removed because the bugs that were to be fixed by the upload were not verified in a timely (105 days) fashion.
| tags: | removed: verification-needed |
| Daniel Palffy (dpalffy) wrote : | #49 |
Please backport this to trusty, it hit me there, too. Standard amd64 desktop.
| Justin King-Lacroix (justinkl) wrote : | #50 |
Yep, this is definitely affecting Trusty as well. Is there any chance the fix will be backported?
| Justin King-Lacroix (justinkl) wrote : | #51 |
Ping?
| Steve Langasek (vorlon) wrote : | #52 |
Justin, please see comment #42. Can you provide a reproducer for this on trusty?
That this has not been widely reported on 14.04 means it has not been a priority for fixing; less so now that a newer LTS release is available.
| Justin King-Lacroix (justinkl) wrote : | #53 |
Not really. The bug is hugely intermittent as it stands.
However, I can confirm we're experiencing this on our fleet.
| Justin King-Lacroix (justinkl) wrote : | #54 |
Guys, this really is affecting us. Help?
| Steve Langasek (vorlon) wrote : | #55 |
This won't make progress as an SRU without a reproducer for the bug. An intermittent reproducer would be better than no reproducer.
| Justin King-Lacroix (justinkl) wrote : | #56 |
I understand. However, our reproducer is: boot.
Removing --daemon from the ureadahead initscripts does appear to reduce the frequency with which this occurs.
| Justin King-Lacroix (justinkl) wrote : | #57 |
Guys, please help us with this. This is a real problem, which is low-probability but high-frustration.


Note for the record that this bug has so far only been reported on the ubuntu-rtm branch, not the ubuntu branch, of the upstart package. However, the differences between these branches are negligible and include no changes to the upstream code.