test_extend_attached_volume fails with Unexpected compute_extend_volume result 'Error'

Bug #1732199 reported by Boden R
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
melanie witt
os-brick
Fix Released
Undecided
Unassigned

Bug Description

tempest.api.volume.test_volumes_extend.VolumesExtendAttachedTest.test_extend_attached_volume[id-301f5a30-1c6f-4ea0-be1a-91fd28d44354] appears to have intermittent failures that result in the following stack trace:

2017-11-13 20:01:08.062361 | primary | ==============================
2017-11-13 20:01:08.062378 | primary |
2017-11-13 20:01:08.062433 | primary | tempest.api.volume.test_volumes_extend.VolumesExtendAttachedTest.test_extend_attached_volume[id-301f5a30-1c6f-4ea0-be1a-91fd28d44354]
2017-11-13 20:01:08.062488 | primary | -------------------------------------------------------------------------------------------------------------------------------------
2017-11-13 20:01:08.062505 | primary |
2017-11-13 20:01:08.062527 | primary | Captured traceback:
2017-11-13 20:01:08.062550 | primary | ~~~~~~~~~~~~~~~~~~~
2017-11-13 20:01:08.062578 | primary | Traceback (most recent call last):
2017-11-13 20:01:08.062621 | primary | File "tempest/api/volume/test_volumes_extend.py", line 184, in test_extend_attached_volume
2017-11-13 20:01:08.062650 | primary | 'request_id': action['request_id']})
2017-11-13 20:01:08.062701 | primary | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/unittest2/case.py", line 702, in assertTrue
2017-11-13 20:01:08.062730 | primary | raise self.failureException(msg)
2017-11-13 20:01:08.062785 | primary | AssertionError: False is not true : Unexpected compute_extend_volume result 'Error' for request req-78ea7bec-834d-4534-8779-1e97f90a54e4.

For example [1]. There appears to be 12 such failures in the past 7 days [2].

[1] http://logs.openstack.org/67/518967/2/check/legacy-tempest-dsvm-neutron-dvr/1188c6c/job-output.txt.gz#_2017-11-13_20_01_08_062361
[2] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Unexpected%20compute_extend_volume%20result%20'Error'%5C%22

Revision history for this message
Matthew Treinish (treinish) wrote :

This doesn't look like a tempest bug to me, it looks like nova is failing during the volume extend operation and that causes the test to fail. For example, see:

http://logs.openstack.org/67/518967/2/check/legacy-tempest-dsvm-neutron-dvr/1188c6c/logs/screen-n-cpu.txt.gz?level=TRACE#_Nov_13_19_52_00_831317

Do you have any reasons or indication that tempest is doing something incorrectly causing this error state?

Changed in tempest:
status: New → Incomplete
Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
no longer affects: tempest
Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :

Marking os-brick as Incomplete for now until we know if this is on the Nova side or if there are actually issues in os-brick.

Changed in os-brick:
status: New → Incomplete
Revision history for this message
Matt Riedemann (mriedem) wrote :

With more recent logs, here is where c-vol POSTS the volume-extended event to nova:

http://logs.openstack.org/69/595069/2/gate/tempest-full/169db43/controller/logs/screen-c-vol.txt.gz#_Sep_24_10_55_25_282139

Sep 24 10:55:25.282139 ubuntu-xenial-rax-ord-0002235107 cinder-volume[17004]: DEBUG novaclient.v2.client [None req-eb1112fc-2010-4878-b6ed-7bd6d5c65b82 tempest-VolumesExtendAttachedTest-1848768 None] REQ: curl -g -i --cacert "/opt/stack/data/ca-bundle.pem" -X POST https://10.210.68.187/compute/v2.1/os-server-external-events -H "Accept: application/json" -H "Content-Type: application/json" -H "OpenStack-API-Version: compute 2.51" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA256}923f610d5ad7db48532294c1bb9afa3f9696c80c054fdaf0e20fea7e6f695ef0" -H "X-OpenStack-Nova-API-Version: 2.51" -H "X-OpenStack-Request-ID: req-eb1112fc-2010-4878-b6ed-7bd6d5c65b82" -d '{"events": [{"tag": "26727af1-a60f-4753-baf5-7e7012361915", "name": "volume-extended", "server_uuid": "6360ef8e-bda1-4775-86db-cfc19255d4f2"}]}' {{(pid=17123) _http_log_request /usr/local/lib/python2.7/dist-packages/keystoneauth1/session.py:448}}

And then we fail on the nova-compute side:

http://logs.openstack.org/69/595069/2/gate/tempest-full/169db43/controller/logs/screen-n-cpu.txt.gz#_Sep_24_10_55_26_759691

Sep 24 10:55:26.759691 ubuntu-xenial-rax-ord-0002235107 nova-compute[15157]: WARNING os_brick.initiator.connectors.iscsi [req-eb1112fc-2010-4878-b6ed-7bd6d5c65b82 req-118f3f55-90c0-47b4-98c9-9b8069625dce service nova] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5
Sep 24 10:55:26.760040 ubuntu-xenial-rax-ord-0002235107 nova-compute[15157]: iscsiadm: could not find session info for session22
Sep 24 10:55:26.760399 ubuntu-xenial-rax-ord-0002235107 nova-compute[15157]:
Sep 24 10:55:26.760774 ubuntu-xenial-rax-ord-0002235107 nova-compute[15157]: INFO os_brick.initiator.connectors.iscsi [req-eb1112fc-2010-4878-b6ed-7bd6d5c65b82 req-118f3f55-90c0-47b4-98c9-9b8069625dce service nova] Found paths for volume []

Is this transient? Should nova-compute or os-brick add a retry loop if the volume paths aren't found during extend?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Does the volume extend operation in cinder create a new export? Because if it does, nova-compute is probably sending an old connection_info with old target_portal* information.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I guess we don't even get far enough to check the connection_info. Around the time of the failure, I see this in the syslog:

http://logs.openstack.org/69/595069/2/gate/tempest-full/169db43/controller/logs/syslog.txt.gz#_Sep_24_10_55_24

Sep 24 10:55:24 ubuntu-xenial-rax-ord-0002235107 kernel: connection21:0: detected conn error (1020)

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Gorka Eguileor (gorka) wrote :

It looks like when the extend happens, the target has "problems", which results in an iSCSI connection failure [1]:

  Sep 24 10:55:24 ubuntu-xenial-rax-ord-0002235107 kernel: connection21:0: detected conn error (1020)

Which in turn results in the iscsiadm tool returning an error when being called [2]:

  Sep 24 10:55:26.759691 ubuntu-xenial-rax-ord-0002235107 nova-compute[15157]: WARNING os_brick.initiator.connectors.iscsi [req-eb1112fc-2010-4878-b6ed-7bd6d5c65b82 req-118f3f55-90c0-47b4-98c9-9b8069625dce service nova] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5
  Sep 24 10:55:26.760040 ubuntu-xenial-rax-ord-0002235107 nova-compute[15157]: iscsiadm: could not find session info for session22

But just a couple of seconds later the "iscsiadm -m session" command is run successfully [3] for another test.

And the connection for the first device seems to get fixed, and the size change detected [4]:

  Sep 24 10:55:28 ubuntu-xenial-rax-ord-0002235107 kernel: sd 22:0:0:1: [sda] 4194304 512-byte logical blocks: (2.15 GB/2.00 GiB)
  Sep 24 10:55:28 ubuntu-xenial-rax-ord-0002235107 kernel: sda: detected capacity change from 1073741824 to 2147483648
  Sep 24 10:55:28 ubuntu-xenial-rax-ord-0002235107 kernel: sd 22:0:0:1: [sda] Synchronizing SCSI cache

In summary, it looks like the extend is the cause of the issue, as it creates connection errors that then make the iscsiadm command fail.

This can probably be fixed in os-brick by adding a simple retry decorator with exponential backoff on the extend_volume method.

But if extending a volume breaks "iscsiadm -m session", that means that the extend probably needs to be a mutually exclusive operation, or we need a new way of retrieving that information without relying on iscsiadm.

[1]: http://logs.openstack.org/69/595069/2/gate/tempest-full/169db43/controller/logs/syslog.txt.gz#_Sep_24_10_55_24
[2]: http://logs.openstack.org/69/595069/2/gate/tempest-full/169db43/controller/logs/screen-n-cpu.txt.gz#_Sep_24_10_55_26_756780
[3]: http://logs.openstack.org/69/595069/2/gate/tempest-full/169db43/controller/logs/screen-n-cpu.txt.gz#_Sep_24_10_55_27_881039
[4]: http://logs.openstack.org/69/595069/2/gate/tempest-full/169db43/controller/logs/syslog.txt.gz#_Sep_24_10_55_28

Revision history for this message
iain MacDonnell (imacdonn) wrote :

I can consistently reproduce the connection closure ("connectionXXX:0: detected conn error (1020)") by invoking:

tgt-admin --update iqn.2010-10.org.openstack:volume-XXX -f

which is what happens at:

https://github.com/openstack/cinder/blob/master/cinder/volume/targets/iscsi.py#L338-L339

when an attached volume gets extended.

I've not yet been able to reproduce the "iscsiadm -m session" failure, though, even when continually running it in a loop whilst reproducing the connection closure as above... so I'm not sure if these two symptoms are actually connected.

Revision history for this message
iain MacDonnell (imacdonn) wrote :

On studying this quite a bit more ... I was confused about what caused SCSI host 23 to appear (syslog) at 10:55:26. It wasn't anything to do with the volume extend test - that was using host 22, which was created by the login at 10:55:19. It wasn't anything that c-vol was doing at the time ... so what was it? Then I realised that c-bak could have been doing something aswell. Sure enough:

Sep 24 10:55:26.003013 ubuntu-xenial-rax-ord-0002235107 cinder-backup[16673]: DEBUG oslo.privsep.daemon [-] privsep: request[140306203894800]: (3, 'os_brick.privileged.rootwrap.execute_root', ('iscsiadm', '-m', 'node', '-T', u'iqn.2010-10.org.openstack:volume-b8059959-d35e-4567-86cd-a2094e8a5fff', '-p', u'10.210.68.187:3260', '--login'), {'attempts': 1, 'check_exit_code': (0, 15, 255), 'delay_on_retry': True}) {{(pid=11439) loop /usr/local/lib/python2.7/dist-packages/oslo_privsep/daemon.py:443}}

So I tried adding a target manually, then I discovered it on the initiator, then did "iscsiadm -m node -l", whilst continually doing 'iscsiadm -m session'. It reproduced this:

iscsiadm: could not read session targetname: 5
iscsiadm: could not find session info for session519

for a few seconds, then went back to normal.

So, it appears that logging into a new target causes a temporary failure to list sessions. Seems like this may be a bug in the iscsi initiator implementation.

Revision history for this message
iain MacDonnell (imacdonn) wrote :

On further inspection, iscsiadm is actually returning 0 (success) - it's just the stderr output that's tripping up os-brick, at:

https://github.com/openstack/os-brick/blob/master/os_brick/initiator/connectors/iscsi.py#L120-L123

which produces:

Sep 24 10:55:26.759691 ubuntu-xenial-rax-ord-0002235107 nova-compute[15157]: WARNING os_brick.initiator.connectors.iscsi [req-eb1112fc-2010-4878-b6ed-7bd6d5c65b82 req-118f3f55-90c0-47b4-98c9-9b8069625dce service nova] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5

Maybe it should be looking at the exit code instead of the stderr text (maybe just log it as a warning and move on?)...

Revision history for this message
iain MacDonnell (imacdonn) wrote :

For discussion, at least: https://review.openstack.org/604961

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master)

Reviewed: https://review.openstack.org/604961
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=d398fa82334bbabbc4c7d5b8f1fe5a7a4465200e
Submitter: Zuul
Branch: master

commit d398fa82334bbabbc4c7d5b8f1fe5a7a4465200e
Author: imacdonn <email address hidden>
Date: Tue Sep 25 03:40:31 2018 +0000

    'iscsiadm -m session' failure handling

    Sometimes 'iscsiadm -m session' outputs warnings on stderr, but
    still serves its purpose (usable stdout). We should not give up on
    it when stderr is non-empty - instead, rely on the exit status, log
    the stderr as a warning, and move on.

    This change also removes 1 (ISCSI_ERR) from the list of acceptable
    exit codes for the iscsiadm command, to ensure that failures get
    caught.

    Change-Id: Id8183cf3d8baf2f8ba6a00a47fa2ad7cc2a96aa5
    Closes-Bug: #1732199

Changed in os-brick:
status: Incomplete → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 2.6.0

This issue was fixed in the openstack/os-brick 2.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/611109

Changed in nova:
assignee: nobody → melanie witt (melwitt)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/611109
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=37d046c511b418835cbd26c71cff7026c29f8dd3
Submitter: Zuul
Branch: master

commit 37d046c511b418835cbd26c71cff7026c29f8dd3
Author: melanie witt <email address hidden>
Date: Tue Oct 16 17:30:20 2018 +0000

    Bump os-brick version to 2.6.1

    This will get us bug fixes for gate bug:

      https://bugs.launchpad.net/nova/+bug/1732199

    and another bug that looks like it could affect the gate:

      https://bugs.launchpad.net/os-brick/+bug/1794829

    Closes-Bug: #1732199

    Change-Id: I709b4a1438aedcfc95d191462b4ec3f1f8c4355a

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/643039

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like we still have this issue going back to stable/queens in the gate:

http://logs.openstack.org/39/634439/1/gate/tempest-full/74a55a4/controller/logs/screen-n-cpu.txt.gz?level=TRACE#_Mar_13_00_06_37_698020

So we should work on backporting the os-brick fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/643400

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/643401

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/rocky)

Reviewed: https://review.openstack.org/643039
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=80ee8f846c85a48f12828fab0de5c4c866349283
Submitter: Zuul
Branch: stable/rocky

commit 80ee8f846c85a48f12828fab0de5c4c866349283
Author: imacdonn <email address hidden>
Date: Tue Sep 25 03:40:31 2018 +0000

    'iscsiadm -m session' failure handling

    Sometimes 'iscsiadm -m session' outputs warnings on stderr, but
    still serves its purpose (usable stdout). We should not give up on
    it when stderr is non-empty - instead, rely on the exit status, log
    the stderr as a warning, and move on.

    This change also removes 1 (ISCSI_ERR) from the list of acceptable
    exit codes for the iscsiadm command, to ensure that failures get
    caught.

    Change-Id: Id8183cf3d8baf2f8ba6a00a47fa2ad7cc2a96aa5
    Closes-Bug: #1732199
    (cherry picked from commit d398fa82334bbabbc4c7d5b8f1fe5a7a4465200e)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/queens)

Reviewed: https://review.openstack.org/643401
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=44a0486d053b946ae0123e6ebc22b3da68dd7510
Submitter: Zuul
Branch: stable/queens

commit 44a0486d053b946ae0123e6ebc22b3da68dd7510
Author: imacdonn <email address hidden>
Date: Tue Sep 25 03:40:31 2018 +0000

    'iscsiadm -m session' failure handling

    Sometimes 'iscsiadm -m session' outputs warnings on stderr, but
    still serves its purpose (usable stdout). We should not give up on
    it when stderr is non-empty - instead, rely on the exit status, log
    the stderr as a warning, and move on.

    This change also removes 1 (ISCSI_ERR) from the list of acceptable
    exit codes for the iscsiadm command, to ensure that failures get
    caught.

    Change-Id: Id8183cf3d8baf2f8ba6a00a47fa2ad7cc2a96aa5
    Closes-Bug: #1732199
    (cherry picked from commit d398fa82334bbabbc4c7d5b8f1fe5a7a4465200e)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/pike)

Reviewed: https://review.openstack.org/643400
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=c183991a5ce3c9af8c3d5960b2fef2e45738cfdd
Submitter: Zuul
Branch: stable/pike

commit c183991a5ce3c9af8c3d5960b2fef2e45738cfdd
Author: imacdonn <email address hidden>
Date: Tue Sep 25 03:40:31 2018 +0000

    'iscsiadm -m session' failure handling

    Sometimes 'iscsiadm -m session' outputs warnings on stderr, but
    still serves its purpose (usable stdout). We should not give up on
    it when stderr is non-empty - instead, rely on the exit status, log
    the stderr as a warning, and move on.

    This change also removes 1 (ISCSI_ERR) from the list of acceptable
    exit codes for the iscsiadm command, to ensure that failures get
    caught.

    Change-Id: Id8183cf3d8baf2f8ba6a00a47fa2ad7cc2a96aa5
    Closes-Bug: #1732199
    (cherry picked from commit d398fa82334bbabbc4c7d5b8f1fe5a7a4465200e)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 1.15.8

This issue was fixed in the openstack/os-brick 1.15.8 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 2.5.6

This issue was fixed in the openstack/os-brick 2.5.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 2.3.6

This issue was fixed in the openstack/os-brick 2.3.6 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.