ISST-SAN: Filesystem converted into read only after interface failover
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Expired
|
Medium
|
Unassigned |
Bug Description
== Comment: #0 ==
I was running interface failover tests on the Storage Texan2 ( TMS9840). zop03-01 has disks coming from Texan2 via NIPV.
Interface failover basically brings down a interface, so that other inetrfaces take over, waits for 10 min and then brings it back up. There are 4 interfaces on texan2.
While the third one was brought down, the IO on zop03-01 was stopped. I was running tests to create directories and files on the FS which were created on the multipath disks. The tests suddenly started to fail when the interface failover happened on the 3rd interface.
Now i see that when i login to the system everything is read only.
root@zop03-01:~# touch abc
touch: cannot touch ?abc?: Read-only file system
root@zop03-01:~#
root@zop03-01:~# lsb_release -sc; uname -m; uname -r
vivid
ppc64le
3.19.0-9-generic
root@zop03-01:~# less /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/mapper/
/dev/mapper/
kte:/kte /kte nfs soft,rw,
kte:/data /data nfs soft,rw,
kte:/distros /distros nfs soft,rw,
kte:/images /images nfs soft,rw,
root@zop03-01:~#
root@zop03-01:~# df -lh
Filesystem Size Used Avail Use% Mounted on
udev 7.5G 0 7.5G 0% /dev
tmpfs 1.6G 38M 1.5G 3% /run
/dev/sdah2 48G 29G 16G 65% /
tmpfs 7.6G 0 7.6G 0% /dev/shm
tmpfs 5.0M 128K 4.9M 3% /run/lock
tmpfs 7.6G 0 7.6G 0% /sys/fs/cgroup
tmpfs 1.6G 0 1.6G 0% /run/user/0
root@zop03-01:~#
root@zop03-01:~# fsck.ext4 /dev/sdah2
e2fsck 1.42.12 (29-Aug-2014)
/dev/sdah2: recovering journal
/dev/sdah2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (4957824, counted=5069276).
Fix<y>? yes
Free inodes count wrong (2952662, counted=2953167).
Fix<y>? yes
/dev/sdah2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdah2: ***** REBOOT LINUX *****
/dev/sdah2: 186417/3139584 files (1.2% non-contiguous), 7488804/12558080 blocks
root@zop03-01:~#
=================
== Comment: #6 ==
There is no issue now. System is all fine. But i can recreate it easily. If i start IO tests and parallelly start interface failover tests, it can be recreated.
But i just dont want it to get into the same state. and you do an fsck, reboot solves the problem.
We need to get into the root cause for why the system is going to bad state.
== Comment: #8 ==
Re-creating the issue with more details:
The IO tests were running ( IO tests basically create directories and files on the FS).
I started the interface failover tests. It basically fails the interface on the SAN subsystem and failsover to the next available interface.There are such 4 interfaces available.The SAN is texan2.
WHen the interface was failed, The IO halted for a moment, by that i mean:
Created Directory /FS3-part1/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS4-part1/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS3-part2/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS1-part3/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS1-part1/test1 rc = 0 at 04/17/2015 12:24:06
Created Directory /FS7-part1/test1 rc = 0 at 04/17/2015 12:24:07
Created Directory /FS0-part2/test1 rc = 0 at 04/17/2015 12:24:08
Created Directory /FS0-part1/test1 rc = 0 at 04/17/2015 12:24:08
Created Directory /FS0-part3/test1 rc = 0 at 04/17/2015 12:24:09
=======> Here it halted
On the console, i was seeing messages like:
root@zop03-01:~# [35488.779299] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35488.779344] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35489.803377] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35489.803646] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.529306] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.538252] sd 3:0:0:1: [sdl] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.538306] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548175] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548179] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548216] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549267] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549340] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549361] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558355] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558360] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558367] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.568328] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.841752] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.842736] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.843690] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.844659] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.845630] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.481757] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.482930] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.482944] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.483939] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.483954] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status:
Till here the machine was all fine, in rw state.
Now, after 4 minutes of time:
I suddenly saw the errors of IO failing:
Write BLAST Failed on /blast/
Error accessing file /blast/
Error Code 30
---> Replacement print file will be created ...Error accessing file /blast/
Error Code -1
*******
On Console:
2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35618.852794] ibmvfc 30000003: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35640.184808] INFO: task jbd2/sdab2-8:13145 blocked for more than 120 seconds.
[35640.184829] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.184833] "echo 0 > /proc/sys/
[35640.184904] INFO: task kworker/
[35640.184910] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.184913] "echo 0 > /proc/sys/
[35640.185000] INFO: task latest_blast:18543 blocked for more than 120 seconds.
[35640.185005] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185008] "echo 0 > /proc/sys/
[35640.185057] INFO: task latest_blast:18544 blocked for more than 120 seconds.
[35640.185062] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185065] "echo 0 > /proc/sys/
[35640.185123] INFO: task latest_blast:18545 blocked for more than 120 seconds.
[35640.185128] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185131] "echo 0 > /proc/sys/
[35640.185187] INFO: task latest_blast:18546 blocked for more than 120 seconds.
[35640.185192] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185195] "echo 0 > /proc/sys/
[35640.185242] INFO: task latest_blast:18547 blocked for more than 120 seconds.
[35640.185247] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185250] "echo 0 > /proc/sys/
[35640.185297] INFO: task latest_blast:18548 blocked for more than 120 seconds.
[35640.185302] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185305] "echo 0 > /proc/sys/
[35640.185350] INFO: task latest_blast:18549 blocked for more than 120 seconds.
[35640.185355] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185358] "echo 0 > /proc/sys/
[35640.185405] INFO: task latest_blast:18550 blocked for more than 120 seconds.
[35640.185410] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185413] "echo 0 > /proc/sys/
[35676.081382] blk_update_request: I/O error, dev sdf, sector 0
[35676.264929] blk_update_request: I/O error, dev sdj, sector 0
[35676.264951] blk_update_request: I/O error, dev sdg, sector 0
[35676.264992] blk_update_request: I/O error, dev sdh, sector 0
[35676.265026] blk_update_request: I/O error, dev sdi, sector 0
[35677.253544] blk_update_request: I/O error, dev sdab, sector 46601528
[35677.253550] blk_update_request: I/O error, dev sdaa, sector 0
[35677.253609] blk_update_request: I/O error, dev sdy, sector 0
[35677.253658] blk_update_request: I/O error, dev sdz, sector 0
[35677.253690] blk_update_request: I/O error, dev sdw, sector 0
[35677.254020] Aborting journal on device sdab2-8.
[35677.254037] EXT4-fs error (device sdab2) in add_dirent_
[35677.254138] EXT4-fs error (device sdab2) in ext4_dx_
[35677.254151] EXT4-fs error (device sdab2) in __ext4_
[35677.254204] EXT4-fs error (device sdab2) in __ext4_
[35677.255088] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255096] EXT4-fs error (device sdab2): ext4_journal_
[35677.255100] EXT4-fs error (device sdab2): ext4_journal_
[35677.255101] EXT4-fs error (device sdab2): ext4_journal_
[35677.255102] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255103] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255123] EXT4-fs error (device sdab2): ext4_journal_
[35677.255130] EXT4-fs (sdab2): ext4_writepages: jbd2_start: 3648 pages, ino 1835459; err -30
[35677.255179] EXT4-fs error (device sdab2) in ext4_evict_
[35677.255557] EXT4-fs error (device sdab2): ext4_journal_
[35677.255581] EXT4-fs error (device sdab2): ext4_journal_
[35677.266659] EXT4-fs error (device sdab2): ext4_journal_
[35677.266680] EXT4-fs error (device sdab2): ext4_journal_
[35677.266759] EXT4-fs error (device sdab2): ext4_journal_
[35677.266777] EXT4-fs error (device sdab2): ext4_journal_
[35677.266789] EXT4-fs error (device sdab2): ext4_journal_
[35677.266829] EXT4-fs error (device sdab2): ext4_journal_
[35677.266857] EXT4-fs error (device sdab2): ext4_journal_
[35677.266883] EXT4-fs error (device sdab2): ext4_journal_
[35677.266907] EXT4-fs error (device sdab2): ext4_journal_
[35677.266981] EXT4-fs error (device sdab2): ext4_journal_
[35690.568831] ibmvfc 30000004: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35691.175548] ibmvfc 30000006: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
and then i see that
root@zop03-01:~# touch abc
touch: cannot touch ?abc?: Read-only file system
root@zop03-01:~#
It has gone into RO mode again. I will atatch the dmesg and syslog in separate attachments.
Changed in linux (Ubuntu): | |
status: | New → Confirmed |
importance: | Undecided → Medium |
Default Comment by Bridge