crash> bt 14658
PID: 14658 TASK: ffff9ffd7a9f0000 CPU: 0 COMMAND: "python3"
#0 [ffffab958380bb48] __schedule at ffffffffa89ae441
#1 [ffffab958380bbe8] schedule at ffffffffa89aea7c
#2 [ffffab958380bbf8] bch_bucket_alloc at ffffffffc091c370 [bcache]
#3 [ffffab958380bc68] __bch_bucket_alloc_set at ffffffffc091c5ce [bcache]
#4 [ffffab958380bcb8] bch_bucket_alloc_set at ffffffffc091c66e [bcache]
#5 [ffffab958380bcf8] __uuid_write at ffffffffc0931b69 [bcache]
#6 [ffffab958380bda0] bch_uuid_write at ffffffffc0931f76 [bcache]
#7 [ffffab958380bdc0] __cached_dev_store at ffffffffc0937c08 [bcache]
#8 [ffffab958380be20] bch_cached_dev_store at ffffffffc0938309 [bcache]
#9 [ffffab958380be50] sysfs_kf_write at ffffffffa830c97c
#10 [ffffab958380be60] kernfs_fop_write at ffffffffa830c3e5
#11 [ffffab958380bea0] __vfs_write at ffffffffa827e5bb
#12 [ffffab958380beb0] vfs_write at ffffffffa827e781
#13 [ffffab958380bee8] sys_write at ffffffffa827e9fc
#14 [ffffab958380bf30] do_syscall_64 at ffffffffa8003b03
#15 [ffffab958380bf50] entry_SYSCALL_64_after_hwframe at ffffffffa8a00081
RIP: 00007faffc7bd154 RSP: 00007ffe307cbc88 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007faffc7bd154
RDX: 0000000000000008 RSI: 00000000011ce7f0 RDI: 0000000000000003
RBP: 00007faffccb86c0 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
R13: 0000000000000000 R14: 00000000011ce7f0 R15: 0000000000f33e60
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
In this case the task "python3" (pid 14658) gets stuck in a wait that never completes from bch_bucket_alloc(). The task should that should resume "python3" from this wait is "bcache_allocator" (pid 16588), but the resume never happens, because bcache_allocator is stuck in this "retry_invalidate" busy loop:
/*
* Now, we write their new gens to disk so we can start writing
* new stuff to them:
*/ allocator_wait(ca, !atomic_read(&ca->set->prio_blocked));
if (CACHE_SYNC(&ca->set->sb)) { /* * This could deadlock if an allocation with a btree * node locked ever blocked - having the btree node * locked would block garbage collection, but here we're * waiting on garbage collection before we invalidate * and free anything. * * But this should be safe since the btree code always * uses btree_check_reserve() before allocating now, and * if it fails it blocks without btree nodes locked. */ if (!fifo_full(&ca->free_inc)) goto retry_invalidate;
The exact code path is this: bch_prio_write() fails, because it calls bch_bucket_alloc() that fails (out of free buckets), it's waking up the garbage collector (trying to free up some buckets) and it goes back to retry_invalidate, but it's not enough apprently; bch_prio_write() is going to fail over and over again (due to no buckets available), unable to break out of the busy loop => deadlock.
Looking better at the code it seems safe to resume the bcache_allocator main loop when bch_prio_write() fails (still keeping the wake_up event to the garbage collector), instead of going back to retry_invalidate. This should give the allocator a better chance to free up some buckets, possibly preventing the "out of buckets" deadlock condition.
I'm currently testing a kernel with this change, if I can't trigger any deadlock in the next hour or so, I'll upload a new test kernel.
Some additional info about the deadlock:
crash> bt 16588 thread at ffffffffc091bdc5 [bcache]
PID: 16588 TASK: ffff9ffd7f332b00 CPU: 1 COMMAND: "bcache_allocato"
[exception RIP: bch_crc64+57]
RIP: ffffffffc093b2c9 RSP: ffffab9585767e28 RFLAGS: 00000286
RAX: f1f51403756de2bd RBX: 0000000000000000 RCX: 0000000000000065
RDX: 0000000000000065 RSI: ffff9ffd63980000 RDI: ffff9ffd63925346
RBP: ffffab9585767e28 R8: ffffffffc093db60 R9: ffffab9585739000
R10: 000000000000007f R11: 000000001ffef001 R12: 0000000000000000
R13: 0000000000000008 R14: ffff9ffd63900000 R15: ffff9ffd683d0000
CS: 0010 SS: 0018
#0 [ffffab9585767e30] bch_prio_write at ffffffffc09325c0 [bcache]
#1 [ffffab9585767eb0] bch_allocator_
#2 [ffffab9585767f08] kthread at ffffffffa80b2481
#3 [ffffab9585767f50] ret_from_fork at ffffffffa8a00205
crash> bt 14658 alloc_set at ffffffffc091c5ce [bcache] alloc_set at ffffffffc091c66e [bcache] dev_store at ffffffffc0938309 [bcache] 64_after_ hwframe at ffffffffa8a00081
PID: 14658 TASK: ffff9ffd7a9f0000 CPU: 0 COMMAND: "python3"
#0 [ffffab958380bb48] __schedule at ffffffffa89ae441
#1 [ffffab958380bbe8] schedule at ffffffffa89aea7c
#2 [ffffab958380bbf8] bch_bucket_alloc at ffffffffc091c370 [bcache]
#3 [ffffab958380bc68] __bch_bucket_
#4 [ffffab958380bcb8] bch_bucket_
#5 [ffffab958380bcf8] __uuid_write at ffffffffc0931b69 [bcache]
#6 [ffffab958380bda0] bch_uuid_write at ffffffffc0931f76 [bcache]
#7 [ffffab958380bdc0] __cached_dev_store at ffffffffc0937c08 [bcache]
#8 [ffffab958380be20] bch_cached_
#9 [ffffab958380be50] sysfs_kf_write at ffffffffa830c97c
#10 [ffffab958380be60] kernfs_fop_write at ffffffffa830c3e5
#11 [ffffab958380bea0] __vfs_write at ffffffffa827e5bb
#12 [ffffab958380beb0] vfs_write at ffffffffa827e781
#13 [ffffab958380bee8] sys_write at ffffffffa827e9fc
#14 [ffffab958380bf30] do_syscall_64 at ffffffffa8003b03
#15 [ffffab958380bf50] entry_SYSCALL_
RIP: 00007faffc7bd154 RSP: 00007ffe307cbc88 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007faffc7bd154
RDX: 0000000000000008 RSI: 00000000011ce7f0 RDI: 0000000000000003
RBP: 00007faffccb86c0 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
R13: 0000000000000000 R14: 00000000011ce7f0 R15: 0000000000f33e60
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
In this case the task "python3" (pid 14658) gets stuck in a wait that never completes from bch_bucket_alloc(). The task should that should resume "python3" from this wait is "bcache_allocator" (pid 16588), but the resume never happens, because bcache_allocator is stuck in this "retry_invalidate" busy loop:
static int bch_allocator_ thread( void *arg)
allocator_ wait(ca, ca->set- >gc_mark_ valid &&
!ca- >invalidate_ needs_gc) ;
invalidate_ buckets( ca);
{
...
retry_invalidate:
/*
allocator_ wait(ca, !atomic_ read(&ca- >set->prio_ blocked) ); SYNC(&ca- >set->sb) ) {
/*
* This could deadlock if an allocation with a btree
* node locked ever blocked - having the btree node
* locked would block garbage collection, but here we're
* waiting on garbage collection before we invalidate
* and free anything.
*
* But this should be safe since the btree code always
* uses btree_check_ reserve( ) before allocating now, and
* if it fails it blocks without btree nodes locked.
*/
if (!fifo_ full(&ca- >free_inc) )
goto retry_invalidate;
* Now, we write their new gens to disk so we can start writing
* new stuff to them:
*/
if (CACHE_
}
}
...
The exact code path is this: bch_prio_write() fails, because it calls bch_bucket_alloc() that fails (out of free buckets), it's waking up the garbage collector (trying to free up some buckets) and it goes back to retry_invalidate, but it's not enough apprently; bch_prio_write() is going to fail over and over again (due to no buckets available), unable to break out of the busy loop => deadlock.
Looking better at the code it seems safe to resume the bcache_allocator main loop when bch_prio_write() fails (still keeping the wake_up event to the garbage collector), instead of going back to retry_invalidate. This should give the allocator a better chance to free up some buckets, possibly preventing the "out of buckets" deadlock condition.
I'm currently testing a kernel with this change, if I can't trigger any deadlock in the next hour or so, I'll upload a new test kernel.