Comment 15 for bug 1447756

Revision history for this message
Ondrej Kubik (ondrak) wrote : Re: segfault in log.c code causes phone reboot loops

I think I have nailed it down now, here is brief description what is happening (if I read code right)
There seems to be race, when we get new log data for one of the jobs after job has been terminated, and while processing it we call log_io_reader and eventually log_file_write which will try to flash unflashed buffer to drive.
This succeeds, mind this is before we got disk writable signal.
Since it succeeds, unflashed->len becomes 0, but we don't remove that log instance from list of logs which needs to be flashed (log_unflushed_files).
So next time when we get signal that disk is writable, we try again to flash that log and BOOM it panics on assert checking that log has something to be flashed, but it was already flashed.

Actual change of unfleshed log len changes on line 562, that's where we shrink unflashed buffer by amount we managed to write to disk, which in our case if full len, making buffer after shrinking zero length.
So I can see at least three fixes:
1) we should after calling nih_io_buffer_shrink (log->unflushed, (size_t)wlen); try to check and if log->unflashed->len is 0, and if so then remove it from the log_unflushed_files list.
2) we need to make log_clear_unflushed more tolerant to logs which has been already flushed successfully before reaching to this point.
3) we don't try to flash unfleshed buffers till we get disk writable signal