Comment 65 for bug 1796292

Revision history for this message
Andrea Righi (arighi) wrote :

Some additional info about the deadlock:

crash> bt 16588
PID: 16588 TASK: ffff9ffd7f332b00 CPU: 1 COMMAND: "bcache_allocato"
    [exception RIP: bch_crc64+57]
    RIP: ffffffffc093b2c9 RSP: ffffab9585767e28 RFLAGS: 00000286
    RAX: f1f51403756de2bd RBX: 0000000000000000 RCX: 0000000000000065
    RDX: 0000000000000065 RSI: ffff9ffd63980000 RDI: ffff9ffd63925346
    RBP: ffffab9585767e28 R8: ffffffffc093db60 R9: ffffab9585739000
    R10: 000000000000007f R11: 000000001ffef001 R12: 0000000000000000
    R13: 0000000000000008 R14: ffff9ffd63900000 R15: ffff9ffd683d0000
    CS: 0010 SS: 0018
 #0 [ffffab9585767e30] bch_prio_write at ffffffffc09325c0 [bcache]
 #1 [ffffab9585767eb0] bch_allocator_thread at ffffffffc091bdc5 [bcache]
 #2 [ffffab9585767f08] kthread at ffffffffa80b2481
 #3 [ffffab9585767f50] ret_from_fork at ffffffffa8a00205

crash> bt 14658
PID: 14658 TASK: ffff9ffd7a9f0000 CPU: 0 COMMAND: "python3"
 #0 [ffffab958380bb48] __schedule at ffffffffa89ae441
 #1 [ffffab958380bbe8] schedule at ffffffffa89aea7c
 #2 [ffffab958380bbf8] bch_bucket_alloc at ffffffffc091c370 [bcache]
 #3 [ffffab958380bc68] __bch_bucket_alloc_set at ffffffffc091c5ce [bcache]
 #4 [ffffab958380bcb8] bch_bucket_alloc_set at ffffffffc091c66e [bcache]
 #5 [ffffab958380bcf8] __uuid_write at ffffffffc0931b69 [bcache]
 #6 [ffffab958380bda0] bch_uuid_write at ffffffffc0931f76 [bcache]
 #7 [ffffab958380bdc0] __cached_dev_store at ffffffffc0937c08 [bcache]
 #8 [ffffab958380be20] bch_cached_dev_store at ffffffffc0938309 [bcache]
 #9 [ffffab958380be50] sysfs_kf_write at ffffffffa830c97c
#10 [ffffab958380be60] kernfs_fop_write at ffffffffa830c3e5
#11 [ffffab958380bea0] __vfs_write at ffffffffa827e5bb
#12 [ffffab958380beb0] vfs_write at ffffffffa827e781
#13 [ffffab958380bee8] sys_write at ffffffffa827e9fc
#14 [ffffab958380bf30] do_syscall_64 at ffffffffa8003b03
#15 [ffffab958380bf50] entry_SYSCALL_64_after_hwframe at ffffffffa8a00081
    RIP: 00007faffc7bd154 RSP: 00007ffe307cbc88 RFLAGS: 00000246
    RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007faffc7bd154
    RDX: 0000000000000008 RSI: 00000000011ce7f0 RDI: 0000000000000003
    RBP: 00007faffccb86c0 R8: 0000000000000000 R9: 0000000000000000
    R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
    R13: 0000000000000000 R14: 00000000011ce7f0 R15: 0000000000f33e60
    ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

In this case the task "python3" (pid 14658) gets stuck in a wait that never completes from bch_bucket_alloc(). The task should that should resume "python3" from this wait is "bcache_allocator" (pid 16588), but the resume never happens, because bcache_allocator is stuck in this "retry_invalidate" busy loop:

static int bch_allocator_thread(void *arg)
{
...
retry_invalidate:
                allocator_wait(ca, ca->set->gc_mark_valid &&
                               !ca->invalidate_needs_gc);
                invalidate_buckets(ca);

                /*
                 * Now, we write their new gens to disk so we can start writing
                 * new stuff to them:
                 */
                allocator_wait(ca, !atomic_read(&ca->set->prio_blocked));
                if (CACHE_SYNC(&ca->set->sb)) {
                        /*
                         * This could deadlock if an allocation with a btree
                         * node locked ever blocked - having the btree node
                         * locked would block garbage collection, but here we're
                         * waiting on garbage collection before we invalidate
                         * and free anything.
                         *
                         * But this should be safe since the btree code always
                         * uses btree_check_reserve() before allocating now, and
                         * if it fails it blocks without btree nodes locked.
                         */
                        if (!fifo_full(&ca->free_inc))
                                goto retry_invalidate;

                        if (bch_prio_write(ca, false) < 0) {
                                ca->invalidate_needs_gc = 1;
                                wake_up_gc(ca->set);
                                goto retry_invalidate;
                        }
                }
        }
...

The exact code path is this: bch_prio_write() fails, because it calls bch_bucket_alloc() that fails (out of free buckets), it's waking up the garbage collector (trying to free up some buckets) and it goes back to retry_invalidate, but it's not enough apprently; bch_prio_write() is going to fail over and over again (due to no buckets available), unable to break out of the busy loop => deadlock.

Looking better at the code it seems safe to resume the bcache_allocator main loop when bch_prio_write() fails (still keeping the wake_up event to the garbage collector), instead of going back to retry_invalidate. This should give the allocator a better chance to free up some buckets, possibly preventing the "out of buckets" deadlock condition.

I'm currently testing a kernel with this change, if I can't trigger any deadlock in the next hour or so, I'll upload a new test kernel.