ISST-SAN: Filesystem converted into read only after interface failover

Bug #1446064 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

== Comment: #0 ==
I was running interface failover tests on the Storage Texan2 ( TMS9840). zop03-01 has disks coming from Texan2 via NIPV.

Interface failover basically brings down a interface, so that other inetrfaces take over, waits for 10 min and then brings it back up. There are 4 interfaces on texan2.

While the third one was brought down, the IO on zop03-01 was stopped. I was running tests to create directories and files on the FS which were created on the multipath disks. The tests suddenly started to fail when the interface failover happened on the 3rd interface.

Now i see that when i login to the system everything is read only.

root@zop03-01:~# touch abc
touch: cannot touch ?abc?: Read-only file system
root@zop03-01:~#

root@zop03-01:~# lsb_release -sc; uname -m; uname -r
vivid
ppc64le
3.19.0-9-generic

root@zop03-01:~# less /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/mapper/mpath10-part2 / ext4 errors=remount-ro 0 1
/dev/mapper/mpath10-part3 none swap sw 0 0
kte:/kte /kte nfs soft,rw,nolock,auto,exec 0 0
kte:/data /data nfs soft,rw,nolock,auto,exec 0 0
kte:/distros /distros nfs soft,rw,nolock,auto,exec 0 0
kte:/images /images nfs soft,rw,nolock,auto,exec 0 0
root@zop03-01:~#

root@zop03-01:~# df -lh
Filesystem Size Used Avail Use% Mounted on
udev 7.5G 0 7.5G 0% /dev
tmpfs 1.6G 38M 1.5G 3% /run
/dev/sdah2 48G 29G 16G 65% /
tmpfs 7.6G 0 7.6G 0% /dev/shm
tmpfs 5.0M 128K 4.9M 3% /run/lock
tmpfs 7.6G 0 7.6G 0% /sys/fs/cgroup
tmpfs 1.6G 0 1.6G 0% /run/user/0
root@zop03-01:~#

root@zop03-01:~# fsck.ext4 /dev/sdah2
e2fsck 1.42.12 (29-Aug-2014)
/dev/sdah2: recovering journal
/dev/sdah2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (4957824, counted=5069276).
Fix<y>? yes
Free inodes count wrong (2952662, counted=2953167).
Fix<y>? yes

/dev/sdah2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdah2: ***** REBOOT LINUX *****
/dev/sdah2: 186417/3139584 files (1.2% non-contiguous), 7488804/12558080 blocks
root@zop03-01:~#
=================

== Comment: #6 ==

There is no issue now. System is all fine. But i can recreate it easily. If i start IO tests and parallelly start interface failover tests, it can be recreated.

But i just dont want it to get into the same state. and you do an fsck, reboot solves the problem.

We need to get into the root cause for why the system is going to bad state.

== Comment: #8 ==

Re-creating the issue with more details:

The IO tests were running ( IO tests basically create directories and files on the FS).
I started the interface failover tests. It basically fails the interface on the SAN subsystem and failsover to the next available interface.There are such 4 interfaces available.The SAN is texan2.

WHen the interface was failed, The IO halted for a moment, by that i mean:

Created Directory /FS3-part1/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS4-part1/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS3-part2/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS1-part3/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS1-part1/test1 rc = 0 at 04/17/2015 12:24:06
Created Directory /FS7-part1/test1 rc = 0 at 04/17/2015 12:24:07
Created Directory /FS0-part2/test1 rc = 0 at 04/17/2015 12:24:08
Created Directory /FS0-part1/test1 rc = 0 at 04/17/2015 12:24:08
Created Directory /FS0-part3/test1 rc = 0 at 04/17/2015 12:24:09

=======> Here it halted

On the console, i was seeing messages like:

root@zop03-01:~# [35488.779299] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35488.779344] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35489.803377] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35489.803646] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.529306] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.538252] sd 3:0:0:1: [sdl] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.538306] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548175] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548179] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548216] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549267] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549340] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549361] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558355] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558360] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558367] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.568328] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.841752] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.842736] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.843690] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.844659] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.845630] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.481757] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.482930] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.482944] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.483939] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.483954] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status:

Till here the machine was all fine, in rw state.

Now, after 4 minutes of time:

I suddenly saw the errors of IO failing:
Write BLAST Failed on /blast/log/BLAST_zop03-01_root_04_17_2015_12_20_31.prt prt file - error = 30 EROFS Read-only file system
Error accessing file /blast/log/BLAST_zop03-01_root_04_17_2015_12_20_31.prt
Error Code 30
---> Replacement print file will be created ...Error accessing file /blast/log/BLAST_zop03-01_root_04_17_2015_12_28_17.prt
Error Code -1

***********************

On Console:
2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35618.852794] ibmvfc 30000003: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35640.184808] INFO: task jbd2/sdab2-8:13145 blocked for more than 120 seconds.
[35640.184829] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.184833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.184904] INFO: task kworker/u320:1:18262 blocked for more than 120 seconds.
[35640.184910] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.184913] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185000] INFO: task latest_blast:18543 blocked for more than 120 seconds.
[35640.185005] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185008] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185057] INFO: task latest_blast:18544 blocked for more than 120 seconds.
[35640.185062] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185065] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185123] INFO: task latest_blast:18545 blocked for more than 120 seconds.
[35640.185128] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185187] INFO: task latest_blast:18546 blocked for more than 120 seconds.
[35640.185192] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185195] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185242] INFO: task latest_blast:18547 blocked for more than 120 seconds.
[35640.185247] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185297] INFO: task latest_blast:18548 blocked for more than 120 seconds.
[35640.185302] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185350] INFO: task latest_blast:18549 blocked for more than 120 seconds.
[35640.185355] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185358] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185405] INFO: task latest_blast:18550 blocked for more than 120 seconds.
[35640.185410] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185413] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35676.081382] blk_update_request: I/O error, dev sdf, sector 0
[35676.264929] blk_update_request: I/O error, dev sdj, sector 0
[35676.264951] blk_update_request: I/O error, dev sdg, sector 0
[35676.264992] blk_update_request: I/O error, dev sdh, sector 0
[35676.265026] blk_update_request: I/O error, dev sdi, sector 0
[35677.253544] blk_update_request: I/O error, dev sdab, sector 46601528
[35677.253550] blk_update_request: I/O error, dev sdaa, sector 0
[35677.253609] blk_update_request: I/O error, dev sdy, sector 0
[35677.253658] blk_update_request: I/O error, dev sdz, sector 0
[35677.253690] blk_update_request: I/O error, dev sdw, sector 0
[35677.254020] Aborting journal on device sdab2-8.
[35677.254037] EXT4-fs error (device sdab2) in add_dirent_to_buf:1699: Journal has aborted
[35677.254138] EXT4-fs error (device sdab2) in ext4_dx_add_entry:2077: Journal has aborted
[35677.254151] EXT4-fs error (device sdab2) in __ext4_new_inode:850: Journal has aborted
[35677.254204] EXT4-fs error (device sdab2) in __ext4_new_inode:850: Journal has aborted
[35677.255088] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255096] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.255100] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.255101] EXT4-fs error (device sdab2): ext4_journal_check_start:56:
[35677.255102] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255103] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255123] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.255130] EXT4-fs (sdab2): ext4_writepages: jbd2_start: 3648 pages, ino 1835459; err -30
[35677.255179] EXT4-fs error (device sdab2) in ext4_evict_inode:243: Journal has aborted
[35677.255557] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal[35677.255570] EXT4-fs error (device sdab2) in ext4_evict_inode:243: Journal has aborted
[35677.255581] EXT4-fs error (device sdab2): ext4_journal_check_start:56: EXT4-fs error (device sdab2): ext4_journal_check_start:56:
[35677.266659] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266680] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266697] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal[35677.266712] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266725] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266737] EXT4-fs error (device sdab2): ext4_journal_check_start:56:
[35677.266759] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266777] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266789] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266801] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266829] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266857] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266883] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266907] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266981] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35690.568831] ibmvfc 30000004: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35691.175548] ibmvfc 30000006: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7

and then i see that

root@zop03-01:~# touch abc
touch: cannot touch ?abc?: Read-only file system
root@zop03-01:~#

It has gone into RO mode again. I will atatch the dmesg and syslog in separate attachments.

Revision history for this message
bugproxy (bugproxy) wrote : dmesg after recreate

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-124029 severity-critical targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : syslog of recreate

Default Comment by Bridge

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1446064/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Reassigning to linux; this seems most likely to be a filesystem/ibmvfc driver issue.

affects: ubuntu → linux (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2015-05-19 20:54 EDT-------
(In reply to comment #22)
> Reassigning to linux; this seems most likely to be a filesystem/ibmvfc
> driver issue.

Fine, but we are still waiting for answers to the above questions, see comment 19. Like I said: there may in fact be a bug here, but we need to understand if the root volume was intended to be multipathed (i.e. configured in the OS as a multipath group), and if that failed to happen during install (or a subsequent reboot), and if the root volume was part of the failover tests also. Because if the above are true, of course you'll get the result here. Thx.

------- Comment From <email address hidden> 2015-05-19 20:58 EDT-------
Oops, I see the above comment was from Ubuntu, not IBM!. Sorry about that!

But we're waiting on our IBM submitters to answer these questions, since it looks like what I described above is possiblly what happened here. Thanks for you patience on the Ubuntu side.

Chris J Arges (arges)
Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Chris J Arges (arges) wrote :

Since there have been some updates to multipath, can this be retested with the latest linux/multipath-tools etc?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.