test_delete_server_while_in_attached_volume failed with "Volume None failed to reach in-use status within the required time"

Bug #1398078 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
John Griffith

Bug Description

http://logs.openstack.org/17/117417/11/gate/gate-tempest-dsvm-neutron-pg-full/c3fe760/console.html.gz#_2014-11-26_14_47_59_184

There is obviously a tempest bug here too in that it's not showing the volume uuid in the error message, unless the bug IS in tempest and it's not tracking the UUID correctly.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (4.1 KiB)

I see this in the cinder volume logs:

http://logs.openstack.org/17/117417/11/gate/gate-tempest-dsvm-neutron-pg-full/c3fe760/logs/screen-c-vol.txt.gz#_2014-11-26_14_16_03_152

2014-11-26 14:16:03.152 1390 WARNING cinder.brick.iscsi.iscsi [req-1c2046a2-0e5a-4a9c-9d11-b1f8f4a62e3d 233bb4a2bf5a463eb92255fda6d45011 92d6d94d6d384ea984ec2f838413a396 - - -] Failed to create iscsi target for volume id:volume-c1dfc318-a139-45a3-b235-f35d1b1fb9d0: Unexpected error while running command.
Command: None
Exit code: -
Stdout: u"Unexpected error while running command.\nCommand: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --update iqn.2010-10.org.openstack:volume-c1dfc318-a139-45a3-b235-f35d1b1fb9d0\nExit code: 22\nStdout: u'Command:\\n\\ttgtadm -C 0 --lld iscsi --op new --mode target --tid 1 -T iqn.2010-10.org.openstack:volume-c1dfc318-a139-45a3-b235-f35d1b1fb9d0\\nexited with code: 22.\\n'\nStderr: u'tgtadm: this target already exists\\n'"
Stderr: None
2014-11-26 14:16:03.153 1390 ERROR oslo.messaging.rpc.dispatcher [req-1c2046a2-0e5a-4a9c-9d11-b1f8f4a62e3d 233bb4a2bf5a463eb92255fda6d45011 92d6d94d6d384ea984ec2f838413a396 - - -] Exception during message handling: Failed to create iscsi target for volume volume-c1dfc318-a139-45a3-b235-f35d1b1fb9d0.
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher return f(*args, **kwargs)
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/cinder/cinder/volume/manager.py", line 898, in initialize_connection
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher volume)
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher return f(*args, **kwargs)
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/cinder/cinder/volume/drivers/lvm.py", line 549, in create_export
2014-11-26 14:16:03.153 1390 TRACE oslo.messaging.rpc.dispatcher return self._cr...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

So the volume uuid is c1dfc318-a139-45a3-b235-f35d1b1fb9d0, and it already exists when cinder tried to create it.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This seems to drop off of logstash after 11/27 but I don't see anything in cinder commits since then that look related. Maybe it was fixed in Tempest with this?

https://github.com/openstack/tempest/commit/f6bf3d5a33398bf7ea5c94173ce66530823a5009

Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is the problem in Tempest:

http://git.openstack.org/cgit/openstack/tempest/tree/tempest/services/compute/json/volumes_extensions_client.py?id=2#n96

That error message should be using the volume id (uuid) rather than the volume name, which is a different response between cinder v1 and v2.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Related Tempest timeout exception message fix: https://review.openstack.org/#/c/138126/

Revision history for this message
Matt Riedemann (mriedem) wrote :

jgriffith pointed this out:

http://logs.openstack.org/17/117417/11/gate/gate-tempest-dsvm-neutron-pg-full/c3fe760/logs/screen-c-vol.txt.gz#_2014-11-26_14_16_02_494

Sounds like that shouldn't happen, bug since it's logged at debug level logstash won't pick it up so we don't know the correlation to gate failures. He's going to push a change to make that WARNING level so we can see if it's related.

The weird thing is that when looking at the command failure here:

http://logs.openstack.org/17/117417/11/gate/gate-tempest-dsvm-neutron-pg-full/c3fe760/logs/screen-c-vol.txt.gz#_2014-11-26_14_16_03_152

That's doing an op new but the code at:

File "/opt/stack/new/cinder/cinder/brick/iscsi/iscsi.py", line 252, in create_iscsi_target

is doing an op show. It seems like there is some processutils/oslo.concurrency weirdness in here.

Revision history for this message
Matt Riedemann (mriedem) wrote :

There is another volume being successfully created at the same time as the one that fails, I'm wondering if there is shared global variables in this code or something that is contributing to the weird things we're seeing in the logs, where it looks like maybe something is getting mixed up in the flow.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in cinder:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/138166

Changed in cinder:
assignee: nobody → John Griffith (john-griffith)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by John Griffith (<email address hidden>) on branch: master
Review: https://review.openstack.org/138166
Reason: splitting up into two patches

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/138173

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/138173
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=d7bd65e8e9ff800d370effc1a8d24ea900a5d435
Submitter: Jenkins
Branch: master

commit d7bd65e8e9ff800d370effc1a8d24ea900a5d435
Author: John Griffith <email address hidden>
Date: Mon Dec 1 13:29:02 2014 -0700

    Deal with tgt already exists errors

    This patch intends to provide a clear marker for
    ER in the case that create export fails due to the
    target entry already existing.

    Also this patch will enable us to go ahead and just use
    the existing target rather than bomb out and fail everything.

    Root cause of why we're getting a second create is still
    unknown and needs addressed, but this might help in getting
    more info as well as keeping things stable until we address
    the root issue.

    Change-Id: If96f4fab7d9709dbb0726b1dca6237d7bc66ecbe
    Partial-Bug: #1398078

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/141886

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (master)

Reviewed: https://review.openstack.org/141886
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=eb20e9e5009541ea0037588a43443c74e8968c9c
Submitter: Jenkins
Branch: master

commit eb20e9e5009541ea0037588a43443c74e8968c9c
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 15 11:23:42 2014 -0800

    Add unit test for commit 22abe9081

    Commit 22abe9081 fixed a bug in commit d7bd65e8e which we could have
    found the first time with a simple unit test.

    This change adds the unit test to cover that new error handling block.

    Related-Bug: #1398078

    Change-Id: I86d2fd477f3ae5590c5f079a4c806d1f50eb96d0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/145608

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/145630

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/145608
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=941e5f761be165accaabbdeec1608189b04ab67e
Submitter: Jenkins
Branch: master

commit 941e5f761be165accaabbdeec1608189b04ab67e
Author: John Griffith <email address hidden>
Date: Wed Jan 7 15:04:07 2015 -0700

    Deal with tgt already exists errors

    So there's a major problem in the iscsi code (and has been
    for quite some time). The tgt show command sometimes seems
    to be corrupt and the issued command is actually a tgt create.
    The result is that we already have a tgt and the call raises
    which then causes the operation to fail.

    An example of the issue:
      Stdout: Unexpected error while running command.Command:
      sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin
      --update iqn.2010-10.org.openstack:volume-f055d3c5-db7a-
      484e-9d0d-b98495439413
      Exit code: 22
      Stdout:
      Command:tgtadm -C 0 --lld iscsi --op new --mode target
      --tid 1 -T iqn.2010-10.org.openstack:volume-f055d3c5-db7a-
      484e-9d0d-b98495439413
      exited with code: 22.
      Stderr: u'tgtadm: this target already exists

    What's disturbing however is that in that section of code
    we're sending a --op show!! Could be something we're
    doing with our member executor? Or maybe something to
    do with the new oslo concurrency code?

    Regardless, his patch intends to provide a clear marker for
    ER in the case that create export fails due to the
    target entry already existing.

    Also this patch will enable us to go ahead and just use
    the existing target rather than bomb out and fail everything.
    Root cause of why we're getting a second create is still
    unknown and needs addressed, but this might help in getting
    more info as well as keeping things stable until we address
    the root issue.

    Change-Id: I0faf4d49b2d3e631b08ec1dff4361ff2376e3308
    Partial-Bug: #1398078

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/145630
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=80f47322982355ff6ec6eac87e09926432bf1f3e
Submitter: Jenkins
Branch: master

commit 80f47322982355ff6ec6eac87e09926432bf1f3e
Author: John Griffith <email address hidden>
Date: Wed Jan 7 16:25:48 2015 -0700

    Use cinder.utils.execute directly

    When we were in the process of working on brick libs
    we did quite a bit of funky stuff with setting a member
    variable for each class to point to an executor that was
    passed in during init.

    There's no longer any reason to do this with the target drivers,
    so we should simplify our lives a bit and go back to using the
    good old cinder.utils wrapper.

    It's also believed that the use of the member variable is
    susceptible to some concurrency issue that was causing the
    wrong cmd string to be executed. We'll mark this as closing
    that bug and reopen if we still see the signature in Kibana.

    Change-Id: I7a5648d496d15eec5d7ac3643411198226f1ffdf
    Closes-Bug: #1398078

Changed in cinder:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/145985

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (master)

Reviewed: https://review.openstack.org/145985
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=1e49234db32d2b4ba000e676ab36602a313b6784
Submitter: Jenkins
Branch: master

commit 1e49234db32d2b4ba000e676ab36602a313b6784
Author: John Griffith <email address hidden>
Date: Thu Jan 8 17:42:05 2015 -0700

    Add debug messaging for tgt already exists

    We have a race condition where the call to create an
    iscsi target seems to be issued twice. This patch
    adds some debug logging to try and get some more info
    around where things are going wrong.

    First, we add a log message to the intialize_connection call
    in the cinder.volume.api and include the connector info.

    Second, we add a target show output prior to the update/create
    command in the target driver.

    In addition, we also had the update and show commands in the same
    try/except block which led us to believe the show was failing,
    but that's NOT the case at all. So this patch also moves that
    existing show (debug purposes as well) out of the main try block.

    Note that there are some old bugs logged against TgtAdm for issues
    with it "loosing count" of target id's it's issued, and in turn
    reissuing the same ID multiple times. This patch is intended to
    give us enough info to determine if that's what's going on or not.

    Other possibility is that the initialize call is being issued more
    than once.

    Change-Id: If17acff6c0b697261ae0e7a40fc26962cbee6c7b
    Related-Bug: #1398078

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/147566
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=4dc367f7d736ae506db9532270e92977c8b3954d
Submitter: Jenkins
Branch: master

commit 4dc367f7d736ae506db9532270e92977c8b3954d
Author: John Griffith <email address hidden>
Date: Thu Jan 15 08:56:28 2015 -0700

    Add retry for tgtadm update when tgt exists

    For target creation using tgtadm driver, we create a persistence
    file for the target, then send a tgt-admin --update 'name' where
    name is the specific persistence file we want to read in and update
    from.

    It turns out that we can hit race conditions where the persistence
    file is written, and an update is requested but target has already
    done work to make it believe that the target has already been created.

    One thought was to just use "update ALL" but this still seems to have
    issues, and changes the error to an account exists failure.

    This patch takes the brute force approach and adds the cinder.utils
    retry decorator to the tgt-admin --update command. To do this
    we just break out the tgt-admin --update call into it's own method
    and add the decorator to it.

    Closes-Bug: #1398078

    Change-Id: Ic72fb5f4fca70a7e7d1c306efe19674fd7e42cd6

Thierry Carrez (ttx)
Changed in cinder:
milestone: none → kilo-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in cinder:
milestone: kilo-2 → 2015.1.0
Revision history for this message
Darren Carpenter (wdarrenc) wrote :

Any chance for a backport of this to Juno?

Revision history for this message
John Griffith (john-griffith) wrote :

Backport was proposed for Juno but rejected because it brought in an external lib making things difficult for package maintainers. Proposals here:
https://review.openstack.org/#/c/168424/
https://review.openstack.org/#/c/168419/

IIRC Mike said he was going to work this up, I'll take a look and submit something. Not sure it will get approval at this point or not, but I'll propose it anyway. Should be easy, Vishy already had a version I can just pull that over and use it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/213304

Revision history for this message
John Griffith (john-griffith) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (stable/juno)

Change abandoned by Alan Pevec (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/213304

Revision history for this message
zhangsong (zhangsong) wrote :

The current code will retry if it encounters the error, but this is a work-around, not the root cause.

Current code of tgt.py use command 'tgt-admin --update name' to create the iscsi target of tgt:
https://github.com/openstack/cinder/blob/master/cinder/volume/targets/tgt.py#L119

The workflow of tgt-admin is:
1. get the available tid(usually it is the current max tid plus 1),I named it next_tid ;
2. use the next_tid to create a target:
tgtadm -C 0 --lld iscsi --op new --mode target --tid next_tid -T target_name

Here is the source code of tgt-admin:
https://github.com/fujita/tgt/blob/master/scripts/tgt-admin#L278

But there is a race condition between step 1 and step 2. If multiple tgt-admin commands executed simultaneous in the same node, some of them have the chance to get the same next_tid, and then use the same tid to create target, this will cause the error of 'tgtadm: this target already exists'.

I think the best way to solve this problem is to add a lock when create a tgt target.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/323452

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by zhangsong (<email address hidden>) on branch: master
Review: https://review.openstack.org/323452
Reason: Retry not cause a problem, just keep
 it. Thank you.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.