iscsiadm hangs in recvmsg()

Bug #1053306 reported by dann frazier on 2012-09-20
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
open-iscsi (Fedora)
Fix Released
High
open-iscsi (Ubuntu)
High
Adam Stokes
Precise
High
Adam Stokes
Quantal
High
Adam Stokes

Bug Description

[Impact]
iscsid can hang or crash

[Test Case]
1) This issue is exacerbated by increased logging output. To configure open-iscsi to emit the maximum amount of messages, edit /etc/init.d/open-iscsi and change the line:
  start-stop-daemon --start --quiet --exec $DAEMON
to:
  start-stop-daemon --start --quiet --exec $DAEMON -- -d 8

On your storage server, create multiple LUN targets and expose them to your Ubuntu system. I recommend 3 or more.

Edit the attached repro.sh script to set the TARGET variable to point to the correct IP of your storage server. Execute this script on your Ubuntu system. When an error occurs, this script should hang or begin emitting errors.

[Regression Potential]
I would expect regressions here to have the same symptoms of the original bug - e.g. hangs, crashes.

Original report follows.

We occasionally see iscsiadm hang indefinitely on a system. It is difficult to reproduce, there maybe months between occurrences. This happened again today, and we collected the following information.

We captured an strace of a failing iscsiadm discover process, I'll attach it here. All iscsiadm processes that are hung are also in recvmsg() when we attach strace.

It was observed that there was one iscsid process - where normally there is 2. The pids are normally consecutive, so I presume one forks the other before daemonizing.

syslog contains the following messages from iscsid:

Sep 20 15:55:57 cn5 iscsid: semop up failed 4
Sep 20 15:56:05 cn5 iscsid: semop down failed
Sep 20 15:56:23 cn5 iscsid: semop up failed 4
Sep 20 16:02:23 cn5 iscsid: semop up failed 4
Sep 20 16:02:26 cn5 iscsid: semop up failed 4
Sep 20 16:02:26 cn5 iscsid: semop up failed 4
Sep 20 16:02:26 cn5 iscsid: semop up failed 4
Sep 20 16:02:32 cn5 iscsid: semop up failed 4
Sep 20 18:01:49 cn5 iscsid: semop up failed 4
Sep 20 18:03:56 cn5 iscsid: semop up failed 4
Sep 20 18:05:03 cn5 iscsid: semop up failed 4

This sounds similar to the issue described in https://bugzilla.redhat.com/show_bug.cgi?id=676804, but note the different errno.

Related branches

Description of problem:

Each process must have its own semarg structure - or they step on each
others' toes - which could cause either deadlocks or smearing of the shared
memory protected by the semaphore.

Version-Release number of selected component (if applicable):

iscsi-initiator-utils-6.2.0.871-0.18.fc12.src.rpm (Anything based on 871 series).

How reproducible:

Unknown.

Steps to Reproduce:
1. Configure the iscsi initator util with at least two target devices.
2. Configure it to the maximum amount of logging
3. ?????

Actual results:

Logging can possibly cause a deadlock in waiting for semaphore access.

Expected results:

No deadlock to occur.

Additional info:

This is a backport of the patch from : http://git.kernel.org/?p=linux/kernel/git/mnc/open-iscsi.git;a=commitdiff;h=b95a7ca80b255f0c5b34b2669367b629c8489552;hp=fb4f2d3072bee96606d01e3535c100dc99b8d331 to the 871 series.

Mike Christie posted Guy Kerens patch (I think). This patch is already in the EL6 build.

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

I am going to resync the iscsi package to the upstream/rhel6 version, so this will get brought in.

Thanks for the patch and report.

Event posted on 2010-08-17 09:26 JST by fuchi

G'day Wade,

A bugzilla has been opened, thank you so much!!
And here is a HotFix request. please approve this request.

http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=5453

Kind regards,
Fuchi

This event sent from IssueTracker by fuchi
 issue 1256153

Fixed in iscsi-initiator-utils-6.2.0.871-0.19:
http://people.redhat.com/mchristi/iscsi/rhel5.5.z/iscsi-initiator-utils/
This is what I am making for 5.5.z.

Patch is included as iscsi-initiator-utils-fix-semop-log.patch in iscsi-initiator-utils-6.2.0.871-0.20.el5 and compiles correctly.

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.

    New Contents:
* Previously, the open-iscsi administration utility (iscsiadm) sent incorrect initiator task tags (ITT) when encountering a multi-PDU This caused the discovery to fail. With this update, the ITT initialization and allocation sends the correct tags and iscsiadm works as expected. (BZ#633122)

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.

    Diffed Contents:
@@ -1 +1 @@
-* Previously, the open-iscsi administration utility (iscsiadm) sent incorrect initiator task tags (ITT) when encountering a multi-PDU This caused the discovery to fail. With this update, the ITT initialization and allocation sends the correct tags and iscsiadm works as expected. (BZ#633122)+Previously, iscsid shared the 'semarg' structure between multiple devices and could cause a deadlock when waiting for the semaphore access. With this update, each process is assigned its own structure and deadlocks no longer occur.

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.

    Diffed Contents:
@@ -1 +1 @@
-Previously, iscsid shared the 'semarg' structure between multiple devices and could cause a deadlock when waiting for the semaphore access. With this update, each process is assigned its own structure and deadlocks no longer occur.+Previously, iscsid shared the 'semarg' structure between multiple devices and thus caused a deadlock to occur when logging was initiated. With this update, each process is assigned its own structure and deadlocks no longer occurs during the logging process.

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.

    Diffed Contents:
@@ -1 +1 @@
-Previously, iscsid shared the 'semarg' structure between multiple devices and thus caused a deadlock to occur when logging was initiated. With this update, each process is assigned its own structure and deadlocks no longer occurs during the logging process.+Due to iscsid sharing the "semarg" structure between multiple devices, a deadlock may have occurred when logging was initiated. With this update, each process is assigned its own structure, so that such deadlocks no longer occur.

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0072.html

dann frazier (dannf) wrote :
dann frazier (dannf) wrote :

Oops, obvious missing information - this is precise's version, 2.0.871-0ubuntu9.

Also note that in this case we do have multiple target devices, which is part of the reproduction criteria for https://bugzilla.redhat.com/show_bug.cgi?id=624344. This was fixed upstream w/ the following changeset that does not appear to be in precise:

https://github.com/mikechristie/open-iscsi/commit/b95a7ca80b255f0c5b34b2669367b629c8489552

julian wang (zeratul-j) wrote :

After this issue happened, 12 sessions which already logged in still can be seen with iscsiadm -m session.

Ante Karamatić (ivoks) wrote :

Broken:

stat("/sys/bus/iscsi_transport/devices/tcp", 0x7fff6decee10) = -1 ENOENT (No such file or directory)
stat("/sys/class/iscsi_transport/tcp", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
socket(PF_FILE, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_FILE, path=@"ISCSIADM_ABSTRACT_NAMESPACE"}, 110) = 0
write(3, "\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 5280) = 5280
recvfrom(3, 0x7fff6dece3e0, 2464, 256, 0, 0) = ? ERESTARTSYS (To be restarted)

Normal:

stat("/sys/bus/iscsi_transport/devices/tcp", 0x7fffc7c20500) = -1 ENOENT (No such file or directory)
stat("/sys/class/iscsi_transport/tcp", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
socket(PF_FILE, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_FILE, path=@"ISCSIADM_ABSTRACT_NAMESPACE"}, 110) = 0
write(3, "\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 5280) = 5280
recvfrom(3, "\10\0\0\0\0\0\0\0iqn.1993-08.org.debian:0"..., 2464, MSG_WAITALL, NULL, NULL) = 2464

'iqn.1993-08.org.debian...' is defined in /etc/iscsi/initiatorname.iscsi.

So far, I'm unable to reproduce the issue.

tags: added: precise
Changed in open-iscsi (Ubuntu):
importance: Undecided → High
dann frazier (dannf) wrote :

By increasing the iscsid debug level to 8 (to increase logging), we are able to reproduce failures by running discover/loginall/logoutall in a loop against a SAN exposing 3 targets within seconds. Symptoms vary - sometimes iscsid crashes, sometimes we see semop errors in syslog, sometimes just hangs - but I suppose that's expected w/ data corruption issues. After applying https://github.com/mikechristie/open-iscsi/commit/b95a7ca80b255f0c5b34b2669367b629c8489552, we've been able to survive 12 hours and counting.

dann frazier (dannf) wrote :
description: updated
Narinder Gupta (narindergupta) wrote :

I tried the same with p4500 SAN box and I am able to reproduce the issue in our lab with the script provided by Dann. I have downloaded, build and install open-iscsi-2.0-873.tar.gz version of paxkage and and running then now without any issues so far. Will keep an eye on it and let you guys knows in case find different option.

Ante Karamatić (ivoks) wrote :

Since open-iscsi-2.0-873 is already in Quantal, I'm suggesting fixing this in 12.04 (and maybe older).

Adam Stokes (adam-stokes) wrote :

Do we want to convert open-iscsi in Precise to using quilt like in Quantal? Also there is no merge proposal or debdiff attached to the case in order to have SRU team review it.

Thanks
Adam

Changed in open-iscsi (Ubuntu):
milestone: none → ubuntu-12.04.2

On Mon, Oct 01, 2012 at 05:57:46PM -0000, Adam Stokes wrote:
> Do we want to convert open-iscsi in Precise to using quilt like in
> Quantal? Also there is no merge proposal or debdiff attached to the case
> in order to have SRU team review it.

I have associated a branch w/ this bug but, for whatever reason, LP
doesn't give me the option to MP it:

  https://code.launchpad.net/~dannf/open-iscsi/lp1053306

I didn't convert to 3.0 quilt in my branch as it seemed superfluous
for an SRU, and increases the diff w/ debian/squeeze, but I've no
real objection to it.

Adam Stokes (adam-stokes) wrote :

Ah i see what happened, you need to setup your directory to mimic Ubuntu's for example:

lp:~adam-stokes/ubuntu/precise/open-iscsi/fix-lp-1053306

Gives me the option to propose for merging. Dann, could you delete that current branch and re-push to

lp:~dannf/ubuntu/precise/open-iscsi/lp1053306

Then you can create your merge proposal and we can get this moving for the Sponsors team :)

Thanks!

dann frazier (dannf) wrote :
dann frazier (dannf) wrote :

On Mon, Oct 01, 2012 at 06:54:17PM -0000, Adam Stokes wrote:
> Ah i see what happened, you need to setup your directory to mimic
> Ubuntu's for example:
>
> lp:~adam-stokes/ubuntu/precise/open-iscsi/fix-lp-1053306
>
> Gives me the option to propose for merging. Dann, could you delete that
> current branch and re-push to
>
> lp:~dannf/ubuntu/precise/open-iscsi/lp1053306

Ah, ok. Thanks - done.

Adam Stokes (adam-stokes) wrote :

Dann,

Awesome thanks, I'm assigning to me so I can keep track of the bug and make sure it gets reviewed/sponsored.

Thanks,
Adam

Changed in open-iscsi (Ubuntu):
assignee: nobody → Adam Stokes (adam-stokes)
status: New → In Progress
Changed in open-iscsi (Ubuntu Quantal):
status: In Progress → Fix Released
Changed in open-iscsi (Ubuntu Precise):
status: New → Confirmed
importance: Undecided → High
Changed in open-iscsi (Ubuntu Quantal):
milestone: ubuntu-12.04.2 → none
Changed in open-iscsi (Ubuntu Precise):
milestone: none → ubuntu-12.04.2
assignee: nobody → Adam Stokes (adam-stokes)
Marc Deslauriers (mdeslaur) wrote :

This is already fixed in the version in Quantal.

ACK on the Precise merge request. I've uploaded it to precise-proposed with a slight version number change for processing by the Ubuntu SRU team. Thanks!

Daniel Holbach (dholbach) wrote :
Changed in open-iscsi (Ubuntu Precise):
status: Confirmed → In Progress
Daniel Holbach (dholbach) wrote :

As there's nothing to review and sponsor, I'll unsubscribe the sponsor team for now.

Hello dann, or anyone else affected,

Accepted open-iscsi into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/open-iscsi/2.0.871-0ubuntu9.12.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in open-iscsi (Ubuntu Precise):
status: In Progress → Fix Committed
tags: added: verification-needed
tags: added: verification-done
removed: verification-needed
Scott Kitterman (kitterman) wrote :

Narinder: What did you do to verify this is fixed?

dann frazier (dannf) wrote :

@Scott: fyi, I asked for Narinder's help to verify this as I no longer have access to the system where I saw the problem. We were in IRC communication during this, so I can speak for what was done. Narinder was able to reproduce the problem on his system with my script. I asked him to try the package in proposed on the same system- he ran my script for a few hours and observed no hang, nor any "semop" errors in the syslog.

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package open-iscsi - 2.0.871-0ubuntu9.12.04.1

---------------
open-iscsi (2.0.871-0ubuntu9.12.04.1) precise-proposed; urgency=low

  * usr/log.c: Do not use semarg in shared-mem for semop calls,
    backport of upstream commit b95a7ca8 (LP: #1053306)
 -- dann frazier <email address hidden> Mon, 24 Sep 2012 11:04:52 +0900

Changed in open-iscsi (Ubuntu Precise):
status: Fix Committed → Fix Released
Changed in open-iscsi (Fedora):
importance: Unknown → High
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.