glusterfs-server timesout starting when rdma-core is installed

Bug #1771908 reported by jwiegley
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
glusterfs (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Trying to get a glusterfs file share running over infiniband.

If the rdma_ucm kernel module is loaded (as it is when rdma-core is installed) the glusterd service cannot start. It just hangs when starting and the kernel generates...

May 18 01:29:42 gfsa kernel: [ 605.323874] INFO: task glusterd:2765 blocked for more than 120 seconds.
May 18 01:29:42 gfsa kernel: [ 605.324099] Tainted: G I 4.15.0-20-generic #21-Ubuntu
May 18 01:29:42 gfsa kernel: [ 605.324294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 18 01:29:42 gfsa kernel: [ 605.324519] glusterd D 0 2765 2763 0x00000000
May 18 01:29:42 gfsa kernel: [ 605.324522] Call Trace:
May 18 01:29:42 gfsa kernel: [ 605.324532] __schedule+0x297/0x8b0
May 18 01:29:42 gfsa kernel: [ 605.324536] schedule+0x2c/0x80
May 18 01:29:42 gfsa kernel: [ 605.324538] schedule_timeout+0x1cf/0x350
May 18 01:29:42 gfsa kernel: [ 605.324542] ? flush_workqueue+0x198/0x3c0
May 18 01:29:42 gfsa kernel: [ 605.324546] wait_for_completion+0xba/0x140
May 18 01:29:42 gfsa kernel: [ 605.324550] ? wake_up_q+0x80/0x80
May 18 01:29:42 gfsa kernel: [ 605.324556] ucma_destroy_id+0x106/0x1a0 [rdma_ucm]
May 18 01:29:42 gfsa kernel: [ 605.324560] ? common_file_perm+0x58/0x160
May 18 01:29:42 gfsa kernel: [ 605.324563] ucma_write+0xd4/0x150 [rdma_ucm]
May 18 01:29:42 gfsa kernel: [ 605.324567] __vfs_write+0x1b/0x40
May 18 01:29:42 gfsa kernel: [ 605.324570] vfs_write+0xb1/0x1a0
May 18 01:29:42 gfsa kernel: [ 605.324573] SyS_write+0x55/0xc0
May 18 01:29:42 gfsa kernel: [ 605.324577] do_syscall_64+0x73/0x130
May 18 01:29:42 gfsa kernel: [ 605.324580] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
May 18 01:29:42 gfsa kernel: [ 605.324582] RIP: 0033:0x7faf58c882b7
May 18 01:29:42 gfsa kernel: [ 605.324584] RSP: 002b:00007ffeecbceae0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
May 18 01:29:42 gfsa kernel: [ 605.324587] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007faf58c882b7
May 18 01:29:42 gfsa kernel: [ 605.324588] RDX: 0000000000000018 RSI: 00007ffeecbceb20 RDI: 000000000000000b
May 18 01:29:42 gfsa kernel: [ 605.324590] RBP: 00007ffeecbceb20 R08: 0000000000000000 R09: 00007faf599d9540
May 18 01:29:42 gfsa kernel: [ 605.324591] R10: 00000000ffffff78 R11: 0000000000000293 R12: 0000000000000018
May 18 01:29:42 gfsa kernel: [ 605.324593] R13: 00007ffeecbcebd0 R14: 00007ffeecbcec70 R15: 00007ffeecbcec50
May 18 01:31:11 gfsa systemd[1]: glusterd.service: Start operation timed out. Terminating.
May 18 01:31:11 gfsa systemd[1]: glusterd.service: Failed with result 'timeout'.
May 18 01:31:11 gfsa systemd[1]: Failed to start LSB: Gluster File System service for volume management.

If I remove rdma-core then glusterd will start but I cannot use transport tcp,rdma for volumes.

Ubuntu server 18.04 doesn't provide a glusterfs-server that is compatible with RDMA infiniband. I'd be happy to find out what module/package I'm missing loaded rather than this actually be a bug.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: glusterfs-server 3.13.2-1build1
ProcVersionSignature: Ubuntu 4.15.0-20.21-generic 4.15.17
Uname: Linux 4.15.0-20-generic x86_64
ApportVersion: 2.20.9-0ubuntu7
Architecture: amd64
Date: Fri May 18 01:37:45 2018
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: glusterfs
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
jwiegley (jeffw) wrote :
Revision history for this message
Nikolas Britton (nbritton) wrote :

I also encountered this problem on Ubuntu MATE 19.10, glusterd would refuse to start with RDMA. No useful diagnostic information, the glusterd program would just crash during startup. I found this bug report and tried unloading the rdma_ucm kernel module and that worked.

However, the rdma-core package is not actually installed on my systems. I'm using Mellanox's OFED v4.7 distribution, specifically: MLNX_OFED_LINUX-4.7-3.2.9.0-ubuntu19.10-x86_64.tgz

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in glusterfs (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.