By wiping the log on the XFS filesystem, I can repair it, and mount it. On each drive I've repaired so far, I get the message "flfirst 118 in agf 0 too large (max = 118)". It seems likely that this filesystem corruption might be a factor.
Having said that, I don't think that a kernel Oops is ever a good response to filesystem corruption.
For the record, this is the result of one of the repairs:
root@ceph-store5:~# xfs_repair -L /dev/disk/by-uuid/7fdfdbfc-0781-43d2-b8e2-58aa913fd823
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
- scan filesystem freespace and inode maps...
Metadata corruption detected at block 0x1/0x200
flfirst 118 in agf 0 too large (max = 118)
sb_ifree 1088217, counted 1085130
sb_fdblocks 427029509, counted 423420685
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
root@ceph-store5:~# mount /var/lib/ceph/osd/ceph-33
root@ceph-store5:~# start ceph-osd id=33
ceph-osd (ceph/33) start/running, process 384759
By wiping the log on the XFS filesystem, I can repair it, and mount it. On each drive I've repaired so far, I get the message "flfirst 118 in agf 0 too large (max = 118)". It seems likely that this filesystem corruption might be a factor.
Having said that, I don't think that a kernel Oops is ever a good response to filesystem corruption.
For the record, this is the result of one of the repairs:
root@ceph-store5:~# xfs_repair -L /dev/disk/ by-uuid/ 7fdfdbfc- 0781-43d2- b8e2-58aa913fd8 23 ceph/osd/ ceph-33
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
- scan filesystem freespace and inode maps...
Metadata corruption detected at block 0x1/0x200
flfirst 118 in agf 0 too large (max = 118)
sb_ifree 1088217, counted 1085130
sb_fdblocks 427029509, counted 423420685
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
root@ceph-store5:~# mount /var/lib/
root@ceph-store5:~# start ceph-osd id=33
ceph-osd (ceph/33) start/running, process 384759