Ubuntu
linux package

Bug #1328727
Comment #60

Comment 60 for bug 1328727

Revision history for this message

In Linux Kernel Bug Tracker #75101, rjw (rjw-linux-kernel-bugs) wrote on 2019-04-03:

#60

On Wednesday, April 3, 2019 11:34:32 AM CEST Jan Kara wrote:
> On Tue 02-04-19 16:25:00, Andrew Morton wrote:
> >
> > I cc'ed a bunch of people from bugzilla.
> >
> > Folks, please please please remember to reply via emailed
> > reply-to-all. Don't use the bugzilla interface!
> >
> > On Mon, 16 Jun 2014 18:29:26 +0200 "Rafael J. Wysocki"
> <email address hidden> wrote:
> >
> > > On 6/13/2014 6:55 AM, Johannes Weiner wrote:
> > > > On Fri, Jun 13, 2014 at 01:50:47AM +0200, Rafael J. Wysocki wrote:
> > > >> On 6/13/2014 12:02 AM, Johannes Weiner wrote:
> > > >>> On Tue, May 06, 2014 at 01:45:01AM +0200, Rafael J. Wysocki wrote:
> > > >>>> On 5/6/2014 1:33 AM, Johannes Weiner wrote:
> > > >>>>> Hi Oliver,
> > > >>>>>
> > > >>>>> On Mon, May 05, 2014 at 11:00:13PM +0200, Oliver Winker wrote:
> > > >>>>>> Hello,
> > > >>>>>>
> > > >>>>>> 1) Attached a full function-trace log + other SysRq outputs, see
> [1]
> > > >>>>>> attached.
> > > >>>>>>
> > > >>>>>> I saw bdi_...() calls in the s2disk paths, but didn't check in
> detail
> > > >>>>>> Probably more efficient when one of you guys looks directly.
> > > >>>>> Thanks, this looks interesting. balance_dirty_pages() wakes up the
> > > >>>>> bdi_wq workqueue as it should:
> > > >>>>>
> > > >>>>> [ 249.148009] s2disk-3327 2.... 48550413us :
> global_dirty_limits <-balance_dirty_pages_ratelimited
> > > >>>>> [ 249.148009] s2disk-3327 2.... 48550414us :
> global_dirtyable_memory <-global_dirty_limits
> > > >>>>> [ 249.148009] s2disk-3327 2.... 48550414us :
> writeback_in_progress <-balance_dirty_pages_ratelimited
> > > >>>>> [ 249.148009] s2disk-3327 2.... 48550414us :
> bdi_start_background_writeback <-balance_dirty_pages_ratelimited
> > > >>>>> [ 249.148009] s2disk-3327 2.... 48550414us :
> mod_delayed_work_on <-balance_dirty_pages_ratelimited
> > > >>>>> but the worker wakeup doesn't actually do anything:
> > > >>>>> [ 249.148009] kworker/-3466 2d... 48550431us :
> finish_task_switch <-__schedule
> > > >>>>> [ 249.148009] kworker/-3466 2.... 48550431us :
> _raw_spin_lock_irq <-worker_thread
> > > >>>>> [ 249.148009] kworker/-3466 2d... 48550431us :
> need_to_create_worker <-worker_thread
> > > >>>>> [ 249.148009] kworker/-3466 2d... 48550432us :
> worker_enter_idle <-worker_thread
> > > >>>>> [ 249.148009] kworker/-3466 2d... 48550432us : too_many_workers
> <-worker_enter_idle
> > > >>>>> [ 249.148009] kworker/-3466 2.... 48550432us : schedule
> <-worker_thread
> > > >>>>> [ 249.148009] kworker/-3466 2.... 48550432us : __schedule
> <-worker_thread
> > > >>>>>
> > > >>>>> My suspicion is that this fails because the bdi_wq is frozen at
> this
> > > >>>>> point and so the flush work never runs until resume, whereas before
> my
> > > >>>>> patch the effective dirty limit was high enough so that image could
> be
> > > >>>>> written in one go without being throttled; followed by an fsync()
> that
> > > >>>>> then writes the pages in the context of the unfrozen s2disk.
> > > >>>>>
> > > >>>>> Does this make sense? Rafael? Tejun?
> > > >>>> Well, it does seem to make sense to me.
> > > >>> From what I see, this is a deadlock in the userspace suspend model
> and
> > > >>> just happened to work by chance in the past.
> > > >> Well, it had been working for quite a while, so it was a rather large
> > > >> opportunity
> > > >> window it seems. :-)
> > > > No doubt about that, and I feel bad that it broke. But it's still a
> > > > deadlock that can't reasonably be accommodated from dirty throttling.
> > > >
> > > > It can't just put the flushers to sleep and then issue a large amount
> > > > of buffered IO, hoping it doesn't hit the dirty limits. Don't shoot
> > > > the messenger, this bug needs to be addressed, not get papered over.
> > > >
> > > >>> Can we patch suspend-utils as follows?
> > > >> Perhaps we can. Let's ask the new maintainer.
> > > >>
> > > >> Rodolfo, do you think you can apply the patch below to suspend-utils?
> > > >>
> > > >>> Alternatively, suspend-utils
> > > >>> could clear the dirty limits before it starts writing and restore
> them
> > > >>> post-resume.
> > > >> That (and the patch too) doesn't seem to address the problem with
> existing
> > > >> suspend-utils
> > > >> binaries, however.
> > > > It's userspace that freezes the system before issuing buffered IO, so
> > > > my conclusion was that the bug is in there. This is arguable. I also
> > > > wouldn't be opposed to a patch that sets the dirty limits to infinity
> > > > from the ioctl that freezes the system or creates the image.
> > >
> > > OK, that sounds like a workable plan.
> > >
> > > How do I set those limits to infinity?
> >
> > Five years have passed and people are still hitting this.
> >
> > Killian described the workaround in comment 14 at
> > https://bugzilla.kernel.org/show_bug.cgi?id=75101.
> >
> > People can use this workaround manually by hand or in scripts. But we
> > really should find a proper solution. Maybe special-case the freezing
> > of the flusher threads until all the writeout has completed. Or
> > something else.
>
> I've refreshed my memory wrt this bug and I believe the bug is really on
> the side of suspend-utils (uswsusp or however it is called). They are low
> level system tools, they ask the kernel to freeze all processes
> (SNAPSHOT_FREEZE ioctl), and then they rely on buffered writeback (which is
> relatively heavyweight infrastructure) to work. That is wrong in my
> opinion.
>
> I can see Johanness was suggesting in comment 11 to use O_SYNC in
> suspend-utils which worked but was too slow. Indeed O_SYNC is rather big
> hammer but using O_DIRECT should be what they need and get better
> performance - no additional buffering in the kernel, no dirty throttling,
> etc. They only need their buffer & device offsets sector aligned - they
> seem to be even page aligned in suspend-utils so they should be fine. And
> if the performance still sucks (currently they appear to do mostly random
> 4k writes so it probably would for rotating disks), they could use AIO DIO
> to get multiple pages in flight (as many as they dare to allocate buffers)
> and then the IO scheduler will reorder things as good as it can and they
> should get reasonable performance.
>
> Is there someone who works on suspend-utils these days?

Not that I know of.

> Because the repo I've found on kernel.org seems to be long dead
> (last commit in 2012).

And that's where the things are as of today, AFAICS.

Cheers,
Rafael

On Wednesday, April 3, 2019 11:34:32 AM CEST Jan Kara wrote:
> On Tue 02-04-19 16:25:00, Andrew Morton wrote:
> > 
> > I cc'ed a bunch of people from bugzilla.
> > 
> > Folks, please please please remember to reply via emailed
> > reply-to-all.  Don't use the bugzilla interface!
> > 
> > On Mon, 16 Jun 2014 18:29:26 +0200 "Rafael J. Wysocki"
> <rafael.j.wysocki@intel.com> wrote:
> > 
> > > On 6/13/2014 6:55 AM, Johannes Weiner wrote:
> > > > On Fri, Jun 13, 2014 at 01:50:47AM +0200, Rafael J. Wysocki wrote:
> > > >> On 6/13/2014 12:02 AM, Johannes Weiner wrote:
> > > >>> On Tue, May 06, 2014 at 01:45:01AM +0200, Rafael J. Wysocki wrote:
> > > >>>> On 5/6/2014 1:33 AM, Johannes Weiner wrote:
> > > >>>>> Hi Oliver,
> > > >>>>>
> > > >>>>> On Mon, May 05, 2014 at 11:00:13PM +0200, Oliver Winker wrote:
> > > >>>>>> Hello,
> > > >>>>>>
> > > >>>>>> 1) Attached a full function-trace log + other SysRq outputs, see
> [1]
> > > >>>>>> attached.
> > > >>>>>>
> > > >>>>>> I saw bdi_...() calls in the s2disk paths, but didn't check in
> detail
> > > >>>>>> Probably more efficient when one of you guys looks directly.
> > > >>>>> Thanks, this looks interesting.  balance_dirty_pages() wakes up the
> > > >>>>> bdi_wq workqueue as it should:
> > > >>>>>
> > > >>>>> [  249.148009]   s2disk-3327    2.... 48550413us :
> global_dirty_limits <-balance_dirty_pages_ratelimited
> > > >>>>> [  249.148009]   s2disk-3327    2.... 48550414us :
> global_dirtyable_memory <-global_dirty_limits
> > > >>>>> [  249.148009]   s2disk-3327    2.... 48550414us :
> writeback_in_progress <-balance_dirty_pages_ratelimited
> > > >>>>> [  249.148009]   s2disk-3327    2.... 48550414us :
> bdi_start_background_writeback <-balance_dirty_pages_ratelimited
> > > >>>>> [  249.148009]   s2disk-3327    2.... 48550414us :
> mod_delayed_work_on <-balance_dirty_pages_ratelimited
> > > >>>>> but the worker wakeup doesn't actually do anything:
> > > >>>>> [  249.148009] kworker/-3466    2d... 48550431us :
> finish_task_switch <-__schedule
> > > >>>>> [  249.148009] kworker/-3466    2.... 48550431us :
> _raw_spin_lock_irq <-worker_thread
> > > >>>>> [  249.148009] kworker/-3466    2d... 48550431us :
> need_to_create_worker <-worker_thread
> > > >>>>> [  249.148009] kworker/-3466    2d... 48550432us :
> worker_enter_idle <-worker_thread
> > > >>>>> [  249.148009] kworker/-3466    2d... 48550432us : too_many_workers
> <-worker_enter_idle
> > > >>>>> [  249.148009] kworker/-3466    2.... 48550432us : schedule
> <-worker_thread
> > > >>>>> [  249.148009] kworker/-3466    2.... 48550432us : __schedule
> <-worker_thread
> > > >>>>>
> > > >>>>> My suspicion is that this fails because the bdi_wq is frozen at
> this
> > > >>>>> point and so the flush work never runs until resume, whereas before
> my
> > > >>>>> patch the effective dirty limit was high enough so that image could
> be
> > > >>>>> written in one go without being throttled; followed by an fsync()
> that
> > > >>>>> then writes the pages in the context of the unfrozen s2disk.
> > > >>>>>
> > > >>>>> Does this make sense?  Rafael?  Tejun?
> > > >>>> Well, it does seem to make sense to me.
> > > >>>  From what I see, this is a deadlock in the userspace suspend model
> and
> > > >>> just happened to work by chance in the past.
> > > >> Well, it had been working for quite a while, so it was a rather large
> > > >> opportunity
> > > >> window it seems. :-)
> > > > No doubt about that, and I feel bad that it broke.  But it's still a
> > > > deadlock that can't reasonably be accommodated from dirty throttling.
> > > >
> > > > It can't just put the flushers to sleep and then issue a large amount
> > > > of buffered IO, hoping it doesn't hit the dirty limits.  Don't shoot
> > > > the messenger, this bug needs to be addressed, not get papered over.
> > > >
> > > >>> Can we patch suspend-utils as follows?
> > > >> Perhaps we can.  Let's ask the new maintainer.
> > > >>
> > > >> Rodolfo, do you think you can apply the patch below to suspend-utils?
> > > >>
> > > >>> Alternatively, suspend-utils
> > > >>> could clear the dirty limits before it starts writing and restore
> them
> > > >>> post-resume.
> > > >> That (and the patch too) doesn't seem to address the problem with
> existing
> > > >> suspend-utils
> > > >> binaries, however.
> > > > It's userspace that freezes the system before issuing buffered IO, so
> > > > my conclusion was that the bug is in there.  This is arguable.  I also
> > > > wouldn't be opposed to a patch that sets the dirty limits to infinity
> > > > from the ioctl that freezes the system or creates the image.
> > > 
> > > OK, that sounds like a workable plan.
> > > 
> > > How do I set those limits to infinity?
> > 
> > Five years have passed and people are still hitting this.
> > 
> > Killian described the workaround in comment 14 at
> > https://bugzilla.kernel.org/show_bug.cgi?id=75101.
> > 
> > People can use this workaround manually by hand or in scripts.  But we
> > really should find a proper solution.  Maybe special-case the freezing
> > of the flusher threads until all the writeout has completed.  Or
> > something else.
> 
> I've refreshed my memory wrt this bug and I believe the bug is really on
> the side of suspend-utils (uswsusp or however it is called). They are low
> level system tools, they ask the kernel to freeze all processes
> (SNAPSHOT_FREEZE ioctl), and then they rely on buffered writeback (which is
> relatively heavyweight infrastructure) to work. That is wrong in my
> opinion.
> 
> I can see Johanness was suggesting in comment 11 to use O_SYNC in
> suspend-utils which worked but was too slow. Indeed O_SYNC is rather big
> hammer but using O_DIRECT should be what they need and get better
> performance - no additional buffering in the kernel, no dirty throttling,
> etc. They only need their buffer & device offsets sector aligned - they
> seem to be even page aligned in suspend-utils so they should be fine. And
> if the performance still sucks (currently they appear to do mostly random
> 4k writes so it probably would for rotating disks), they could use AIO DIO
> to get multiple pages in flight (as many as they dare to allocate buffers)
> and then the IO scheduler will reorder things as good as it can and they
> should get reasonable performance.
> 
> Is there someone who works on suspend-utils these days?