lvs claims that the volumes are active. But the probe itself
is showing problems reading the volumes.
XFS is telling us that it cannot write it's journal to disk.
[1039812.311433] Filesystem "dm-4": Log I/O Error Detected. Shutting down filesystem: dm-4
"fs/xfs/xfs_rw.c"
90 /*
91 * Force a shutdown of the filesystem instantly while keeping
92 * the filesystem consistent. We don't do an unmount here; just shutdown
93 * the shop, make sure that absolutely nothing persistent happens to
94 * this filesystem after this point.
95 */
96 void
97 xfs_do_force_shutdown(
...
127 if (flags & SHUTDOWN_CORRUPT_INCORE) {
128 xfs_cmn_err(XFS_PTAG_SHUTDOWN_CORRUPT, CE_ALERT, mp,
129 "Corruption of in-memory data detected. Shutting down filesystem: %s",
130 mp->m_fsname);
131 if (XFS_ERRLEVEL_HIGH <= xfs_error_level) {
132 xfs_stack_trace();
133 }
134 } else if (!(flags & SHUTDOWN_FORCE_UMOUNT)) {
135 if (logerror) {
136 xfs_cmn_err(XFS_PTAG_SHUTDOWN_LOGERROR, CE_ALERT, mp,
137 "Log I/O Error Detected. Shutting down filesystem: %s",
138 mp->m_fsname);
The logs conveniently tell us where it was called from too.
1009 void
1010 xlog_iodone(xfs_buf_t *bp)
1011 {
...
1036 /*
1037 * Race to shutdown the filesystem if we see an error.
1038 */
1039 if (XFS_TEST_ERROR((XFS_BUF_GETERROR(bp)), l->l_mp,
1040 XFS_ERRTAG_IODONE_IOERR, XFS_RANDOM_IODONE_IOERR)) {
1041 xfs_ioerror_alert("xlog_iodone", l->l_mp, bp, XFS_BUF_ADDR(bp));
1042 XFS_BUF_STALE(bp);
1043 xfs_force_shutdown(l->l_mp, SHUTDOWN_LOG_IO_ERROR);
1044 /*
1045 * This flag will be propagated to the trans-committed
1046 * callback routines to let them know that the log-commit
1047 * didn't succeed.
1048 */
1049 aborted = XFS_LI_ABORTED;
I assume dm-4 is the LV that XFS is mounted on, did you run the dd test on that?
I'm starting to wonder if the LVM device filter is lying to us, after failover, something
changes which misrepresents the LV and then XFS bails out.
If you can perform that DD for every PV that backs dm-4 successfully then there's
something wrong with the DM map for those LVs after failover occurs.
OK, what I need from you now is a before and after (same fault injection method) of:
0) ls -lR /dev/ > dev_major_minor.log
1) lvs -o lv_attr
2) pvdisplay -vvv
3) lvdisplay -vvv
4) dmsetup table -v
5) "dd test" on all block devices: lv, mp, sd
6) dmesg output
Please attach this as a single tarball, that has a timestamp in the filename
and has a directory structure of:
foo.tgz
before/
after/
If this all checks out, then what's probably happening is the when
multipath begins the failover process, there's enough of a delay that
XFS simply bails out early before IO is ready to be sent down the
remaining paths. group by priority may perform better here and is
something you can test.
I looked at the XFS mount arguments and didn't find anything that would
make it more lenient in these situations.
If you can manage it, a LUN formatted with ext3 under these circumstances
would help in ruling out whether the filesystem is part of the problem.
Hmm, we've got some counter indicators here.
lvs claims that the volumes are active. But the probe itself
is showing problems reading the volumes.
XFS is telling us that it cannot write it's journal to disk.
[1039812.311433] Filesystem "dm-4": Log I/O Error Detected. Shutting down filesystem: dm-4
"fs/xfs/xfs_rw.c"
90 /* force_shutdown( CORRUPT_ INCORE) { err(XFS_ PTAG_SHUTDOWN_ CORRUPT, CE_ALERT, mp, FORCE_UMOUNT) ) { err(XFS_ PTAG_SHUTDOWN_ LOGERROR, CE_ALERT, mp,
91 * Force a shutdown of the filesystem instantly while keeping
92 * the filesystem consistent. We don't do an unmount here; just shutdown
93 * the shop, make sure that absolutely nothing persistent happens to
94 * this filesystem after this point.
95 */
96 void
97 xfs_do_
...
127 if (flags & SHUTDOWN_
128 xfs_cmn_
129 "Corruption of in-memory data detected. Shutting down filesystem: %s",
130 mp->m_fsname);
131 if (XFS_ERRLEVEL_HIGH <= xfs_error_level) {
132 xfs_stack_trace();
133 }
134 } else if (!(flags & SHUTDOWN_
135 if (logerror) {
136 xfs_cmn_
137 "Log I/O Error Detected. Shutting down filesystem: %s",
138 mp->m_fsname);
The logs conveniently tell us where it was called from too. xfs_buf_ t *bp)
1009 void
1010 xlog_iodone(
1011 {
...
1036 /* ERROR(( XFS_BUF_ GETERROR( bp)), l->l_mp, IODONE_ IOERR, XFS_RANDOM_ IODONE_ IOERR)) { alert(" xlog_iodone" , l->l_mp, bp, XFS_BUF_ADDR(bp)); shutdown( l->l_mp, SHUTDOWN_ LOG_IO_ ERROR);
1037 * Race to shutdown the filesystem if we see an error.
1038 */
1039 if (XFS_TEST_
1040 XFS_ERRTAG_
1041 xfs_ioerror_
1042 XFS_BUF_STALE(bp);
1043 xfs_force_
1044 /*
1045 * This flag will be propagated to the trans-committed
1046 * callback routines to let them know that the log-commit
1047 * didn't succeed.
1048 */
1049 aborted = XFS_LI_ABORTED;
I assume dm-4 is the LV that XFS is mounted on, did you run the dd test on that?
I'm starting to wonder if the LVM device filter is lying to us, after failover, something
changes which misrepresents the LV and then XFS bails out.
If you can perform that DD for every PV that backs dm-4 successfully then there's
something wrong with the DM map for those LVs after failover occurs.
OK, what I need from you now is a before and after (same fault injection method) of:
0) ls -lR /dev/ > dev_major_minor.log
1) lvs -o lv_attr
2) pvdisplay -vvv
3) lvdisplay -vvv
4) dmsetup table -v
5) "dd test" on all block devices: lv, mp, sd
6) dmesg output
Please attach this as a single tarball, that has a timestamp in the filename
and has a directory structure of:
foo.tgz
before/
after/
If this all checks out, then what's probably happening is the when
multipath begins the failover process, there's enough of a delay that
XFS simply bails out early before IO is ready to be sent down the
remaining paths. group by priority may perform better here and is
something you can test.
I looked at the XFS mount arguments and didn't find anything that would
make it more lenient in these situations.
If you can manage it, a LUN formatted with ext3 under these circumstances
would help in ruling out whether the filesystem is part of the problem.
Thanks.