scsi device that can't be deleted

Bug #1680774 reported by wondra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

I'm using OpenStack on Ubuntu 14.04 with the 4.4.0-47 kernel. Volumes are being attached over Fibre Channel (qla2xxx adapter) with multipath. About once a month, a compute node fails to detach a volume. After that, the scsi device is left in a semi-deleted state:

Normal device:
root@cmp02:/sys/dev/block/65:96/device# ls /sys/dev/block/65:80/device
block generic scsi_device
bsg inquiry scsi_disk
delete iocounterbits scsi_generic
device_blocked iodone_cnt scsi_level
device_busy ioerr_cnt state
dh_state iorequest_cnt subsystem
driver modalias timeout
eh_timeout model type
evt_capacity_change_reported power uevent
evt_inquiry_change_reported queue_depth vendor
evt_lun_change_reported queue_ramp_up_period vpd_pg80
evt_media_change queue_type vpd_pg83
evt_mode_parameter_change_reported rescan
evt_soft_threshold_reached rev

Half-deleted device:
root@cmp02:/sys/dev/block/65:96/device# ls /sys/dev/block/65:96/device
block scsi_disk

There is nothing interesting in /var/log/messages or /var/log/kern.log. multipath -l gives empty output, which pretty much blocks all operation of OpenStack on that node and forces me to reboot it.

Tried issuing rescan on the FC adapters without success. The scsi device seems to be detached from the adapter.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1680774

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.11 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11-rc5

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key xenial
Revision history for this message
wondra (wondra) wrote :

Previously, I was using kernel 3.19 and I don't remember having the issue.
I don't really want to test an unstable kernel on production infrastructure. I was hoping for some concrete tips to troubleshoot scsi layer problems.
Anyways, if I deployed it now, I could confirm that it does NOT happen in no less than half a year. The particular machine that failed now had an uptime of 119 days, another one with the same software has 135 days and counting. But taken together, I'm experiencing the same problem about every month in the cloud cluster.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.