Activity log for bug #1709889

Date Who What changed Old value New value Message
2017-08-10 14:29:26 bugproxy bug added bug
2017-08-10 14:29:29 bugproxy tags architecture-ppc64le bugnameltc-152603 severity-critical targetmilestone-inin1704
2017-08-10 14:29:31 bugproxy attachment added sorted output from htxerr https://bugs.launchpad.net/bugs/1709889/+attachment/4930092/+files/htxerr-19.lba
2017-08-10 14:29:32 bugproxy ubuntu: assignee Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
2017-08-10 14:29:34 bugproxy affects ubuntu linux (Ubuntu)
2017-08-10 14:44:35 Andrew Cloke bug task added ubuntu-power-systems
2017-08-10 14:44:51 Andrew Cloke ubuntu-power-systems: importance Undecided Critical
2017-08-10 14:45:00 Andrew Cloke ubuntu-power-systems: assignee Canonical Kernel Team (canonical-kernel-team)
2017-08-11 18:21:58 Joseph Salisbury linux (Ubuntu): status New In Progress
2017-08-11 18:22:01 Joseph Salisbury linux (Ubuntu): importance Undecided Critical
2017-08-11 18:22:04 Joseph Salisbury linux (Ubuntu): assignee Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) Joseph Salisbury (jsalisbury)
2017-08-11 18:22:12 Joseph Salisbury nominated for series Ubuntu Zesty
2017-08-11 18:22:12 Joseph Salisbury bug task added linux (Ubuntu Zesty)
2017-08-11 18:22:17 Joseph Salisbury linux (Ubuntu Zesty): status New In Progress
2017-08-11 18:22:20 Joseph Salisbury linux (Ubuntu Zesty): importance Undecided Critical
2017-08-11 18:22:23 Joseph Salisbury linux (Ubuntu Zesty): assignee Joseph Salisbury (jsalisbury)
2017-08-11 19:55:38 Frank Heimes ubuntu-power-systems: status New In Progress
2017-08-15 14:04:18 Joseph Salisbury linux (Ubuntu): assignee Joseph Salisbury (jsalisbury) Colin Ian King (colin-king)
2017-08-15 14:04:31 Joseph Salisbury linux (Ubuntu Zesty): assignee Joseph Salisbury (jsalisbury) Colin Ian King (colin-king)
2017-08-17 12:50:05 bugproxy attachment added Patch with 3 patches plus debugging https://bugs.launchpad.net/bugs/1709889/+attachment/4934188/+files/fix-patch-3
2017-08-23 13:30:11 bugproxy bug watch added https://bugzilla.kernel.org/show_bug.cgi?id=196737
2017-08-23 14:03:48 Colin Ian King bug task added linux
2017-08-28 19:05:36 Manoj Iyer tags architecture-ppc64le bugnameltc-152603 severity-critical targetmilestone-inin1704 architecture-ppc64le bugnameltc-152603 severity-critical targetmilestone-inin1704 triage-r
2017-08-28 19:16:24 Manoj Iyer tags architecture-ppc64le bugnameltc-152603 severity-critical targetmilestone-inin1704 triage-r architecture-ppc64le bugnameltc-152603 severity-critical targetmilestone-inin1704 triage-g
2017-09-18 13:50:18 Andrew Cloke ubuntu-power-systems: status In Progress Incomplete
2018-01-18 05:17:32 Steve Langasek linux (Ubuntu Zesty): status In Progress Won't Fix
2018-01-22 14:47:32 Manoj Iyer linux (Ubuntu): status In Progress Invalid
2018-01-22 14:47:45 Manoj Iyer ubuntu-power-systems: status Incomplete Invalid
2018-05-14 17:50:42 bugproxy tags architecture-ppc64le bugnameltc-152603 severity-critical targetmilestone-inin1704 triage-g architecture-ppc64le bugnameltc-152603 severity-critical targetmilestone-inin1804 triage-g
2018-07-09 17:29:44 bugproxy tags architecture-ppc64le bugnameltc-152603 severity-critical targetmilestone-inin1804 triage-g architecture-ppc64le bugnameltc-152603 severity-high targetmilestone-inin1804 triage-g
2018-07-31 18:00:52 Dimitri John Ledkov linux (Ubuntu): status Invalid New
2018-07-31 18:00:56 Dimitri John Ledkov linux (Ubuntu): importance Critical Undecided
2018-07-31 18:00:59 Dimitri John Ledkov linux (Ubuntu): assignee Colin Ian King (colin-king)
2018-07-31 18:01:02 Dimitri John Ledkov linux (Ubuntu Zesty): assignee Colin Ian King (colin-king)
2018-07-31 18:01:06 Dimitri John Ledkov ubuntu-power-systems: status Invalid New
2018-07-31 18:01:08 Dimitri John Ledkov ubuntu-power-systems: importance Critical Undecided
2018-08-03 15:23:32 Joseph Salisbury nominated for series Ubuntu Bionic
2018-08-03 15:23:32 Joseph Salisbury bug task added linux (Ubuntu Bionic)
2018-08-03 15:23:40 Joseph Salisbury linux (Ubuntu Bionic): status New Triaged
2018-08-03 15:23:43 Joseph Salisbury linux (Ubuntu Bionic): importance Undecided Critical
2018-08-03 15:23:53 Joseph Salisbury tags architecture-ppc64le bugnameltc-152603 severity-high targetmilestone-inin1804 triage-g architecture-ppc64le bugnameltc-152603 kernel-da-key severity-high targetmilestone-inin1804 triage-g
2018-08-03 15:25:03 Joseph Salisbury summary Ubuntu 17.04: Bug in cfq scheduler, I/Os do not get submitted to adapter for a very long time. Ubuntu 18.04: Bug in cfq scheduler, I/Os do not get submitted to adapter for a very long time.
2018-08-06 13:09:27 Manoj Iyer ubuntu-power-systems: importance Undecided Critical
2018-08-06 13:09:29 Manoj Iyer linux (Ubuntu): importance Undecided Critical
2018-08-06 13:09:47 Manoj Iyer ubuntu-power-systems: status New Triaged
2018-08-06 13:10:08 Manoj Iyer linux (Ubuntu): assignee Canonical Kernel Team (canonical-kernel-team)
2018-09-04 11:34:05 lasantha bug added subscriber lasantha
2018-09-13 13:16:16 Manoj Iyer ubuntu-power-systems: importance Critical Medium
2018-09-13 13:16:19 Manoj Iyer linux (Ubuntu): importance Critical Medium
2018-09-13 13:16:20 Manoj Iyer linux (Ubuntu Zesty): importance Critical Medium
2018-09-13 13:16:22 Manoj Iyer linux (Ubuntu Bionic): importance Critical Medium
2018-10-19 08:51:27 Christian Ehrhardt  description ---Problem Description--- When running stress test, sometimes seeing IO hung in dmesg or seeing "Host adapter abort request" error. ---Steps to Reproduce--- There are two ways to re-create the issues: (1)running HTX, you will see IO timeout backtrace in dmesg in several hours (2)running some IO test, then reboot system, repeat this two steps, it takes long time to re-create the issue. ---uname output--- 4.10.0-11-generic The bulk of the effort for this issue is currently being worked in MicroSemi's JIRA https://jira.pmcs.com/browse/ESDIBMOP-133. Ran an interesting test: Ran HTX until I started getting the "stall" messages on the console, then shutdown HTX and examined the I/O counters for the tested disks in sysfs: root@bostonp15:~# for i in /sys/devices/pci0003:00/0003:00:00.0/0003:01:00.0/host0/target0:2:[2345]/0:2:[2345]:0; do echo ${i##*/} $(<${i}/iorequest_cnt) $(<${i}/iodone_cnt); done 0:2:2:0 0x5eba3d 0x5eba3d 0:2:3:0 0x773cc9 0x773cc9 0:2:4:0 0x782c61 0x782c61 0:2:5:0 0x5ca134 0x5ca134 root@bostonp15:~# So, none of the disks showed any evidence of having lost an I/O. I then restarted HTX and aside from having to manually restart one of the disks, see no problems with the testing. It appears that what was "hung" was purely in userland. This does not absolve the kernel or aacraid driver from blame, but it shows that the OS "believes" that it completed the I/O and thus removed it from the queue. What we don't know is whether the OS truly notified HTX about the completion, or if HTX (or userland libraries) just failed to process the notification. Tests are running again, will see what happens next. Update from JIRA: I have run some more experiments. Not sure what it tells us, but here's what I've seen. First test, ran until I got kernel messages about stalled tasks, then shutdown HTX. After HTX was down, I checked the above mentioned counters and found that on each disk iorequest_cnt matched iodone_cnt. The disks were usable and I could restart HTX. This suggests that the problem is not in the PM8069 firmware, and makes the case for the aacraid driver having a bug somewhat weaker. However, this merely says that the driver "completed" the I/O as far as the kernel is concerned, not that a completion rippled back to the application. I restarted HTX and have run until errors. This time, I am leaving HTX running and observing. Two of the disks reached the HTX error threshold and the testers stopped (those 2 disks are now idle). Another disks saw errors but then stopped and appears to be running fine now. The last disk has not seen any errors (yet). On the two idle (errored-out) disks I see iorequest_cnt matches iodone_cnt. I am able to "terminate and restart" the two idle disks and HTX appears to be testing them again "normally". Note that no reboot was required, further supporting the evidence that, as far as the kernel is concerned, there is nothing wrong with the disks and their I/O paths. So, I don't believe this completely eliminates aacraid from the picture, especially given we don't see this behavior on other systems/drivers. But, it probably moves the focus of the investigation away form the adapter firmware. Tried build upstream 4.11 kernel on Ubuntu. This still gets the hangs. Both Ubuntu 4.10 and upstream 4.11 have aacraid driver 1.2.1[50792]-custom. Good new/bad news... While doing an initial evaluation of the LSI-3008 SAS HBA on Boston and Ubuntu 17.04, I am hitting this same problem. So, it appears to have nothing specific to do with the PM8069 or aacraid driver. Some notes on reproduce this. I have been using the github release of HTX, built using the following steps: 1. apt install make gcc g++ git libncurses5-dev libcxl-dev libdapl-dev (others may be required) 2. git clone https://github.com/open-power/HTX 3. cd HTX 4. make 5. make deb Then install the resulting "htxubuntu.deb" package. Note, HTX will not test disks that have a filesystem or OS installed, so there must be at least two disks made available to HTX by clearing any previous data. A partition table is optional, in my testing I have none. Also, it may be desirable to run HTX somewhere other than the console, leaving the console free to watch for messages. To run: A. su - htx (this may take some time) B. htx C. Select the test file "mdt.io" D. Hit ENTER for default log file option E. Once menu is display, select item 2 (Enable/disable hardware to test) E1. Enter "h" to disable (halt) all devices testing E2. Select at least two disks for testing (enter their line numbers) E3. Enter "q" to return to main menu F. Select item "4" (Continue On Error flags) F1. Enter line numbers for each disk previously selected to test. F2. Enter "q" to return to main menu. G. Select item "1" to begin the test exercisers. H. Optionally, select item "5" to display status of testing. After about 10-12 hours, there should be a few "INFO: task hxestorage:XXXXX blocked for more than 120 seconds." messages with stack traces. The typical stack trace is: sysctl_sched_migration_cost+0x0/0x4 (unreliable) __switch_to+0x2c0/0x450 __schedule+0x2f8/0x990 schedule+0x48/0xc0 schedule_timeout+0x274/0x470 io_schedule_timeout+0xd0/0x160 debug_schedule+0x318/0x3c0 __blkdev_direct_IO_simple+0x258/0x440 blkdev_direct_IO+0x4e0/0x520 generic_file_read_iter+0x2c8/0xaa0 blkdev_read_iter+0x50/0x80 new_sync_read+0xec/0x140 vfs_read+0xbc/0x1b0 SyS_read+0x68/0x110 system_call+0x38/0xe0 About 8 minutes after the "blocked" messages, you should start to see HTX reporting errors in "/tmp/htxerr" (HTX reports errors for I/Os that do not complete in 10 minutes, but continues to run). With added debugging, it was seen that the I/Os do eventually complete, but in some cases it can take over an hour. It is also observed that I/O traffic continues through these periods of stalls, and so only a portion of the total I/O traffic actually gets stalled. The system does not hang, and if HTX is shutdown (stopped), any stalled I/Os will complete immediately. Referencing LP1469829, it seems that it was requested that "cfq" scheduler not be used by default as it has this exact sort of bug, and that "deadline" should be used instead. Somewhere, the default got reverted back to "cfq" which exposes this bug again. It appears that the bug in "cfq" was never fixed, either. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469829 A couple upstream commits of interest, ordered by perceived relevance. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5be6b75610cefd1e21b98a218211922c2feb6e08 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=142bbdfccc8b3e9f7342f2ce8422e76a3b45beae ---Problem Description--- When running stress test, sometimes seeing IO hung in dmesg or seeing "Host adapter abort request" error. ---Steps to Reproduce---  There are two ways to re-create the issues: (1)running HTX, you will see IO timeout backtrace in dmesg in several hours (2)running some IO test, then reboot system, repeat this two steps, it takes long time to re-create the issue. ---uname output--- 4.10.0-11-generic - still valid up to latest kernel in Bionic The bulk of the effort for this issue is currently being worked in MicroSemi's JIRA https://jira.pmcs.com/browse/ESDIBMOP-133. Ran an interesting test: Ran HTX until I started getting the "stall" messages on the console, then shutdown HTX and examined the I/O counters for the tested disks in sysfs: root@bostonp15:~# for i in /sys/devices/pci0003:00/0003:00:00.0/0003:01:00.0/host0/target0:2:[2345]/0:2:[2345]:0; do echo ${i##*/} $(<${i}/iorequest_cnt) $(<${i}/iodone_cnt); done 0:2:2:0 0x5eba3d 0x5eba3d 0:2:3:0 0x773cc9 0x773cc9 0:2:4:0 0x782c61 0x782c61 0:2:5:0 0x5ca134 0x5ca134 root@bostonp15:~# So, none of the disks showed any evidence of having lost an I/O. I then restarted HTX and aside from having to manually restart one of the disks, see no problems with the testing. It appears that what was "hung" was purely in userland. This does not absolve the kernel or aacraid driver from blame, but it shows that the OS "believes" that it completed the I/O and thus removed it from the queue. What we don't know is whether the OS truly notified HTX about the completion, or if HTX (or userland libraries) just failed to process the notification. Tests are running again, will see what happens next. Update from JIRA: I have run some more experiments. Not sure what it tells us, but here's what I've seen. First test, ran until I got kernel messages about stalled tasks, then shutdown HTX. After HTX was down, I checked the above mentioned counters and found that on each disk iorequest_cnt matched iodone_cnt. The disks were usable and I could restart HTX. This suggests that the problem is not in the PM8069 firmware, and makes the case for the aacraid driver having a bug somewhat weaker. However, this merely says that the driver "completed" the I/O as far as the kernel is concerned, not that a completion rippled back to the application. I restarted HTX and have run until errors. This time, I am leaving HTX running and observing. Two of the disks reached the HTX error threshold and the testers stopped (those 2 disks are now idle). Another disks saw errors but then stopped and appears to be running fine now. The last disk has not seen any errors (yet). On the two idle (errored-out) disks I see iorequest_cnt matches iodone_cnt. I am able to "terminate and restart" the two idle disks and HTX appears to be testing them again "normally". Note that no reboot was required, further supporting the evidence that, as far as the kernel is concerned, there is nothing wrong with the disks and their I/O paths. So, I don't believe this completely eliminates aacraid from the picture, especially given we don't see this behavior on other systems/drivers. But, it probably moves the focus of the investigation away form the adapter firmware. Tried build upstream 4.11 kernel on Ubuntu. This still gets the hangs. Both Ubuntu 4.10 and upstream 4.11 have aacraid driver 1.2.1[50792]-custom. Good new/bad news... While doing an initial evaluation of the LSI-3008 SAS HBA on Boston and Ubuntu 17.04, I am hitting this same problem. So, it appears to have nothing specific to do with the PM8069 or aacraid driver. Some notes on reproduce this. I have been using the github release of HTX, built using the following steps: 1. apt install make gcc g++ git libncurses5-dev libcxl-dev libdapl-dev (others may be required) 2. git clone https://github.com/open-power/HTX 3. cd HTX 4. make 5. make deb Then install the resulting "htxubuntu.deb" package. Note, HTX will not test disks that have a filesystem or OS installed, so there must be at least two disks made available to HTX by clearing any previous data. A partition table is optional, in my testing I have none. Also, it may be desirable to run HTX somewhere other than the console, leaving the console free to watch for messages. To run: A. su - htx (this may take some time) B. htx C. Select the test file "mdt.io" D. Hit ENTER for default log file option E. Once menu is display, select item 2 (Enable/disable hardware to test)     E1. Enter "h" to disable (halt) all devices testing     E2. Select at least two disks for testing (enter their line numbers)     E3. Enter "q" to return to main menu F. Select item "4" (Continue On Error flags)     F1. Enter line numbers for each disk previously selected to test.     F2. Enter "q" to return to main menu. G. Select item "1" to begin the test exercisers. H. Optionally, select item "5" to display status of testing. After about 10-12 hours, there should be a few "INFO: task hxestorage:XXXXX blocked for more than 120 seconds." messages with stack traces. The typical stack trace is:  sysctl_sched_migration_cost+0x0/0x4 (unreliable)  __switch_to+0x2c0/0x450  __schedule+0x2f8/0x990  schedule+0x48/0xc0  schedule_timeout+0x274/0x470  io_schedule_timeout+0xd0/0x160  debug_schedule+0x318/0x3c0  __blkdev_direct_IO_simple+0x258/0x440  blkdev_direct_IO+0x4e0/0x520  generic_file_read_iter+0x2c8/0xaa0  blkdev_read_iter+0x50/0x80  new_sync_read+0xec/0x140  vfs_read+0xbc/0x1b0  SyS_read+0x68/0x110  system_call+0x38/0xe0 About 8 minutes after the "blocked" messages, you should start to see HTX reporting errors in "/tmp/htxerr" (HTX reports errors for I/Os that do not complete in 10 minutes, but continues to run). With added debugging, it was seen that the I/Os do eventually complete, but in some cases it can take over an hour. It is also observed that I/O traffic continues through these periods of stalls, and so only a portion of the total I/O traffic actually gets stalled. The system does not hang, and if HTX is shutdown (stopped), any stalled I/Os will complete immediately. Referencing LP1469829, it seems that it was requested that "cfq" scheduler not be used by default as it has this exact sort of bug, and that "deadline" should be used instead. Somewhere, the default got reverted back to "cfq" which exposes this bug again. It appears that the bug in "cfq" was never fixed, either. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469829 A couple upstream commits of interest, ordered by perceived relevance. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5be6b75610cefd1e21b98a218211922c2feb6e08 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=142bbdfccc8b3e9f7342f2ce8422e76a3b45beae
2019-02-18 10:42:12 Manoj Iyer ubuntu-power-systems: status Triaged Incomplete
2019-04-08 14:04:42 Manoj Iyer linux (Ubuntu): status New Fix Released
2019-04-08 14:05:55 Manoj Iyer linux (Ubuntu Bionic): status Triaged Fix Released
2019-04-08 14:06:02 Manoj Iyer ubuntu-power-systems: status Incomplete Fix Released
2019-07-24 21:06:52 Brad Figg tags architecture-ppc64le bugnameltc-152603 kernel-da-key severity-high targetmilestone-inin1804 triage-g architecture-ppc64le bugnameltc-152603 cscc kernel-da-key severity-high targetmilestone-inin1804 triage-g