Comment 24 for bug 1822096

There was another suggestion [1] but with it applied the case still hangs (after 12 and 1 iteration(s), so not much later than usual).

But the threads looked slightly different this time:
  Id Target Id Frame
* 1 Thread 0x7f1eab8efb40 (LWP 13686) "libvirtd" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
  2 Thread 0x7f1eab434700 (LWP 13688) "libvirtd" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x557ce654a534) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
[...]

I see only one process directly in the lowlevellock.S

(gdb) bt
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1 0x00007f1eaf378945 in __GI___pthread_mutex_lock (mutex=0x7f1e8c0016d0) at ../nptl/pthread_mutex_lock.c:80
#2 0x00007f1eaf4ef095 in virMutexLock (m=<optimized out>) at ../../../src/util/virthread.c:89
#3 0x00007f1eaf580fbc in virChrdevFDStreamCloseCb (st=st@entry=0x7f1e9c0128f0, opaque=opaque@entry=0x7f1e9c031090) at ../../../src/conf/virchrdev.c:252
#4 0x00007f1eaf48f180 in virFDStreamCloseInt (st=0x7f1e9c0128f0, streamAbort=<optimized out>) at ../../../src/util/virfdstream.c:742
#5 0x00007f1eaf6bbec9 in virStreamAbort (stream=0x7f1e9c0128f0) at ../../../src/libvirt-stream.c:1244
#6 0x0000557ce5bd83aa in daemonStreamHandleAbort (client=client@entry=0x557ce65cc650, stream=stream@entry=0x7f1e9c0315b0, msg=msg@entry=0x557ce65d1e20) at ../../../src/remote/remote_daemon_stream.c:636
#7 0x0000557ce5bd8ee3 in daemonStreamHandleWrite (stream=0x7f1e9c0315b0, client=0x557ce65cc650) at ../../../src/remote/remote_daemon_stream.c:749
[...]

On a retry this was the same again, so the suggeste dpatch did change something. But not enough yet.

I need to find which lock that actually is and if possible who holds it at the moment.
The lock is on
  virMutexLock(&priv->devs->lock);
in virChrdevFDStreamCloseCb

(gdb) p priv->devs->lock
$1 = {lock = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust = Yes, Shared = No, Protocol = Priority protect, Priority ceiling = 0}}
(gdb) p &priv->devs->lock
$2 = (virMutex *) 0x7f4554020b20

Interesting that it lists Status as "Not aquired"

I wanted to check which path that would be, but the value for the hash seems wrong:
p priv->devs->hash
$6 = (virHashTablePtr) 0x25

Code would usually access hash and 0x25 is not a valid address.
The code after the lock would have failed in
  virHashRemoveEntry(priv->devs->hash, priv->path);
This will access 0x25
  nextptr = table->table + virHashComputeKey(table, name);

So we are seeing a not fully cleaned up structure here.
Most likely if not being a lock issue it would be a crash instead.

OTOH that might be due to our unlocking with the most recent patch [1], allowing it the struct to go partially away. I dropped the patch and debugged again if it would be more useful to check in there for the actual lock and path.

I was back at my two backtraces fighting for the lock.
The lock now was in a "better" state.
(gdb) p priv->devs->lock
$9 = {lock = pthread_mutex_t = {Type = Normal, Status = Acquired, possibly with waiters, Owner ID = 23102, Robust = No, Shared = No, Protocol = None}}
(gdb) p priv->devs->hash
$10 = (virHashTablePtr) 0x7f2928000c00

It is a one entry list:
(gdb) p priv->devs->hash->table.next
Cannot access memory at address 0x0
(gdb) p (virHashEntry)priv->devs->hash->table
$13 = {next = 0x7f2928000fe0, name = 0xa4b28ee3, payload = 0x3}

Letting the FDST unlock in between did not help (if anything it made it worse by a stale partial struct that would crash).

[1]: https://www.redhat.com/archives/libvir-list/2019-April/msg00207.html