tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern 'qemu-nbd: Failed to bdrv_open'

Bug #1226337 reported by Christopher Yeoh on 2013-09-17
48
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Cinder
Critical
John Griffith
OpenStack Compute (nova)
Undecided
Unassigned
tempest
High
Unassigned

Bug Description

When running tempest tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern fails with the server going into an ERROR state.

From the console log:

Traceback (most recent call last):
  File "tempest/scenario/test_volume_boot_pattern.py", line 154, in test_volume_boot_pattern
keypair)
  File "tempest/scenario/test_volume_boot_pattern.py", line 53, in _boot_instance_from_volume
create_kwargs=create_kwargs)
  File "tempest/scenario/manager.py", line 390, in create_server
self.status_timeout(client.servers, server.id, 'ACTIVE')
  File "tempest/scenario/manager.py", line 290, in status_timeout
self._status_timeout(things, thing_id, expected_status=expected_status)
  File "tempest/scenario/manager.py", line 338, in _status_timeout
self.config.compute.build_interval):
  File "tempest/test.py", line 237, in call_until_true
if func():
  File "tempest/scenario/manager.py", line 329, in check_status
raise exceptions.BuildErrorException(message)
  BuildErrorException: Server %(server_id)s failed to build and is in ERROR status
Details: <Server: scenario-server-89179012> failed to get to expected status. In ERROR state.

The exception:
http://logs.openstack.org/64/47264/2/gate/gate-tempest-devstack-vm-full/dced339/logs/screen-n-cpu.txt.gz#_2013-09-24_04_44_31_806

Logs are located here:
http://logs.openstack.org/64/47264/2/gate/gate-tempest-devstack-vm-full/dced339

-----------------

Originally the failure was (before some changes to timeouts in tempest):

t178.1: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern[compute,image,volume]_StringException: Empty attachments:
  stderr
  stdout

pythonlogging:'': {{{
2013-09-16 15:59:44,214 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:44,417 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:45,348 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:45,495 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:47,644 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:48,762 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:49,879 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:50,980 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:00:52,581 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:00:52,897 Authentication (publickey) successful!
2013-09-16 16:00:53,105 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:00:53,428 Authentication (publickey) successful!
2013-09-16 16:00:53,431 Secsh channel 1 opened.
2013-09-16 16:00:53,607 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:00:53,875 Authentication (publickey) successful!
2013-09-16 16:00:53,880 Secsh channel 1 opened.
2013-09-16 16:01:58,999 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:01:59,288 Authentication (publickey) successful!
2013-09-16 16:01:59,457 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:01:59,784 Authentication (publickey) successful!
2013-09-16 16:01:59,801 Secsh channel 1 opened.
2013-09-16 16:02:00,005 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:00,080 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:01,127 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:01,192 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:01,414 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:02,494 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:03,615 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:04,724 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:05,825 Starting new HTTP connection (1): 127.0.0.1
}}}

Traceback (most recent call last):
  File "tempest/scenario/test_volume_boot_pattern.py", line 157, in test_volume_boot_pattern
    ssh_client = self._ssh_to_server(instance_from_snapshot, keypair)
  File "tempest/scenario/test_volume_boot_pattern.py", line 101, in _ssh_to_server
    private_key=keypair.private_key)
  File "tempest/scenario/manager.py", line 453, in get_remote_client
    return RemoteClient(ip, username, pkey=private_key)
  File "tempest/common/utils/linux/remote_client.py", line 47, in __init__
    if not self.ssh_client.test_connection_auth():
  File "tempest/common/ssh.py", line 148, in test_connection_auth
    connection = self._get_ssh_connection()
  File "tempest/common/ssh.py", line 70, in _get_ssh_connection
    time.sleep(bsleep)
  File "/usr/local/lib/python2.7/dist-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()
TimeoutException

Full logs here:

http://logs.openstack.org/80/43280/9/gate/gate-tempest-devstack-vm-full/cba22ae/testr_results.html.gz

Looks a bit like the instance failed to boot.

Matthew Treinish (treinish) wrote :

Logstash query: @message:"qemu-nbd: Failed to bdrv_open" AND @fields.build_status:"FAILURE" AND @fields.filename:"logs/screen-n-cpu.txt"

Matthew Treinish (treinish) wrote :

Looking at the logs for a test run there is a repeating INFO message which becomes a warning (after it retries enough times) when nova compute tries to create the server:

http://logs.openstack.org/83/47283/1/gate/gate-tempest-devstack-vm-full/cdc49f9/logs/screen-n-cpu.txt.gz#_2013-09-23_14_16_22_131

2013-09-23 14:16:20.126 INFO nova.virt.disk.mount.nbd [req-2e800637-20ac-486c-b3dd-cff14742e2a0 tempest.scenario.manager348477648-user tempest.scenario.manager1015858706-tenant] NBD mount error: qemu-nbd error: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf qemu-nbd -c /dev/nbd7 /opt/stack/data/nova/instances/af550eec-7246-4500-898d-b22eb45109a8/disk
Exit code: 1
Stdout: ''
Stderr: "qemu-nbd: Failed to bdrv_open '/opt/stack/data/nova/instances/af550eec-7246-4500-898d-b22eb45109a8/disk': No such file or directory\n"

tags: added: testing
Matthew Treinish (treinish) wrote :
Download full text (3.5 KiB)

I should scroll down more in the future, here is the actual stacktrace that is causing the failure (it's may be related to the above messages)

http://logs.openstack.org/83/47283/1/gate/gate-tempest-devstack-vm-full/cdc49f9/logs/screen-n-cpu.txt.gz#_2013-09-23_14_18_41_867

2013-09-23 14:18:41.867 ERROR nova.compute.manager [req-25a33c89-f4ba-4250-96e0-a122bb16b3c2 tempest.scenario.manager348477648-user tempest.scenario.manager1015858706-tenant] [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] Instance failed to spawn
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] Traceback (most recent call last):
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/compute/manager.py", line 1408, in _spawn
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] block_device_info)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2068, in spawn
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] write_to_disk=True)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 3026, in to_xml
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] disk_info, rescue, block_device_info)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2912, in get_guest_config
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] inst_type):
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2738, in get_guest_storage_config
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] info)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1030, in volume_driver_method
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] return method(connection_info, *args, **kwargs)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/openstack/common/lockutils.py", line 246, in inner
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] return f(*args, **kwargs)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/volume.py", line 277, in connect_volume
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance...

Read more...

Matthew Treinish (treinish) wrote :

better logstash query:
@message:"NovaException: iSCSI device not found at" AND @fields.build_status:"FAILURE" AND @fields.filename:"logs/screen-n-cpu.txt"

Changed in tempest:
importance: Undecided → High
status: New → Triaged
Matthew Treinish (treinish) wrote :

Another example of this bug is:

The exception:
http://logs.openstack.org/64/47264/2/gate/gate-tempest-devstack-vm-full/dced339/logs/screen-n-cpu.txt.gz#_2013-09-24_04_44_31_806

2013-09-24 04:16:36.948 | Traceback (most recent call last):
2013-09-24 04:16:36.948 | File "tempest/scenario/test_volume_boot_pattern.py", line 154, in test_volume_boot_pattern
2013-09-24 04:16:36.948 | keypair)
2013-09-24 04:16:36.948 | File "tempest/scenario/test_volume_boot_pattern.py", line 53, in _boot_instance_from_volume
2013-09-24 04:16:36.948 | create_kwargs=create_kwargs)
2013-09-24 04:16:36.948 | File "tempest/scenario/manager.py", line 390, in create_server
2013-09-24 04:16:36.949 | self.status_timeout(client.servers, server.id, 'ACTIVE')
2013-09-24 04:16:36.949 | File "tempest/scenario/manager.py", line 290, in status_timeout
2013-09-24 04:16:36.949 | self._status_timeout(things, thing_id, expected_status=expected_status)
2013-09-24 04:16:36.949 | File "tempest/scenario/manager.py", line 338, in _status_timeout
2013-09-24 04:16:36.949 | self.config.compute.build_interval):
2013-09-24 04:16:36.949 | File "tempest/test.py", line 237, in call_until_true
2013-09-24 04:16:36.950 | if func():
2013-09-24 04:16:36.950 | File "tempest/scenario/manager.py", line 329, in check_status
2013-09-24 04:16:36.950 | raise exceptions.BuildErrorException(message)
2013-09-24 04:16:36.950 | BuildErrorException: Server %(server_id)s failed to build and is in ERROR status
2013-09-24 04:16:36.950 | Details: <Server: scenario-server-89179012> failed to get to expected status. In ERROR state.

description: updated
Attila Fazekas (afazekas) wrote :

Not a tempest bug, I would like to see it as critical bug in another project (nova or cinder or libvirt o tgtd ..) before skipping.

Nikola Đipanov (ndipanov) wrote :

This does not seem like a nova bug. As per the error that was extracted in comment #3 (thanks Matthew) - this error occurs when nova cannot find the iscsi device after CONF.num_iscsi_scan_tries (by default 3), sleeping try_no ** 2 secs in between - so 14 secs by default. A simple solution might be to increase this limit for tempest runs.

Joe Gordon (jogo) wrote :

Adding cinder.

Joe Gordon (jogo) on 2013-09-26
tags: added: havana-rc-potential
John Griffith (john-griffith) wrote :

Nikoa, Joe et all;

Missing something here in terms of Cinder being the cause of the server not coming up? Any insight?

John Griffith (john-griffith) wrote :

Looking at this and here's my take:

* We create the volume and target in Cinder
* Nova grabs the volume info and starts building the instance
* Nova issues the discovery, and actually succesfully logs in to the target
* Nova attempts to actually make the attach and map to /dev/disk-by-path

The last step is what actually never seems to complete in the allowed period of time with the given retries.

I don't have a good theory here as I haven't been able to reproduce the error, but the only thing that I can suggest at this point is that we either increase the retry factor from 2 to 3 to give us almost 30 seconds before the last retry, or the same thing could be accomplished by changing the default for CONF.num_iscsi_retries

I am out of the office until 10/07/2013.

I will take vacation from 28th Sep to 7th Oct . If have any urgent, please
call 13910806810

Note: This is an automated response to your message "[Bug 1226337] Re:
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern flake
failure" sent on 09/28/2013 4:05:16.

This is the only notification you will receive while this person is away.

Additional logging it appears the target is no good, small sample it appears lun 1 is missing. Looking into how we might address this.

Brian D. Burns (iosctr) wrote :

I received this as well. Logs: http://logs.openstack.org/63/48863/1/check/check-tempest-devstack-vm-full/318679b/

However, I wanted to point out that I also received a failure in "process-returncode".

ft1.1: process-returncode_StringException: Binary content:
  traceback (test/plain;charset=utf8)

Shouldn't that be "text/plain"?

See also the end of http://logs.openstack.org/63/48863/1/check/check-tempest-devstack-vm-full/318679b/subunit_log.txt.gz

tags: -worker-0
time: 2013-09-29 20:22:01.851443Z
tags: worker-0
test: process-returncode
time: 2013-09-29 20:22:01.851443Z
failure: process-returncode [ multipart
Content-Type: test/plain;charset=utf8
traceback
C
returncode 10
]

Attila Fazekas (afazekas) wrote :

In the syslog the initiator related messages are shorter than usual :

The message on failure:
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 kernel: [ 4997.993316] scsi7 : iSCSI Initiator over TCP/IP
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 kernel: [ 4998.499703] scsi 7:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 kernel: [ 4998.500015] scsi 7:0:0:0: Attached scsi generic sg4 type 12
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 iscsid: connection6:0 is operational now

The usual message looks like:
Sep 24 04:43:13 devstack-precise-hpcloud-az3-378359 kernel: [ 4934.758730] scsi6 : iSCSI Initiator over TCP/IP
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.266718] scsi 6:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.267148] scsi 6:0:0:0: Attached scsi generic sg2 type 12
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.268207] scsi 6:0:0:1: Direct-Access IET VIRTUAL-DISK 0001 PQ: 0 ANSI: 5
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.269186] sd 6:0:0:1: Attached scsi generic sg3 type 0
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.272887] sd 6:0:0:1: [sdb] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.273303] sd 6:0:0:1: [sdb] Write Protect is off
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.273309] sd 6:0:0:1: [sdb] Mode Sense: 49 00 00 08
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.273617] sd 6:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.275856] sdb: unknown partition table
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.277354] sd 6:0:0:1: [sdb] Attached SCSI disk
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 iscsid: connection5:0 is operational now

When the after nova/libvirt gives up (usual prologue):
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 kernel: [ 5013.039383] sd 4:0:0:1: [sda] Synchronizing SCSI cache
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 kernel: [ 5013.291661] connection3:0: detected conn error (1020)
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(101) connection closed, 0x22f8f28 1
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(107) sesson 0x22fa100 1
Sep 24 04:44:33 devstack-precise-hpcloud-az3-378359 kernel: [ 5014.128859] connection6:0: detected conn error (1020)
Sep 24 04:44:33 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(101) connection closed, 0x230c418 1
Sep 24 04:44:33 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(107) sesson 0x230c6e0 1

But the prior [sda] related message is missing!!.

Fix proposed to branch: master
Review: https://review.openstack.org/48970

Changed in cinder:
assignee: nobody → John Griffith (john-griffith)
status: New → In Progress

Reviewed: https://review.openstack.org/48970
Committed: http://github.com/openstack/cinder/commit/d9514d796185220715b09c8c17e71eeb1337ba2f
Submitter: Jenkins
Branch: master

commit d9514d796185220715b09c8c17e71eeb1337ba2f
Author: John Griffith <email address hidden>
Date: Mon Sep 30 08:45:52 2013 -0600

    Dont retry if target creation succeeds

    The target creation retry loop was not breaking when
    the target was succesfully created. This is "sometimes" ok,
    however in other cases this will actually cause a failure in
    the second create and the backing Lun will be deleted and not
    created again succesfully due most likely to a busy status on the
    target.

    Add a break in the try block.

    Change-Id: I875f6302868651b5b42d4796cd2714fba35e525e
    Closes-Bug: #1226337

Changed in cinder:
status: In Progress → Fix Committed
Alan Pevec (apevec) on 2013-10-01
tags: added: grizzly-backport-potential
Alan Pevec (apevec) on 2013-10-01
tags: removed: grizzly-backport-potential

Believe this can be marked invalid for Nova.

Changed in nova:
status: New → Invalid
Dolph Mathews (dolph) on 2013-10-01
description: updated
Joe Gordon (jogo) wrote :

Looks like this is still happening: http://logs.openstack.org/07/49107/1/check/check-tempest-devstack-vm-postgres-full/6e09abc/logs/screen-n-cpu.txt.gz#_2013-10-01_10_58_48_112

But with a much much lower frequency. logstash reports 4 hits since the fix (https://review.openstack.org/48970) merged.

I think this is worth re-opening but not as critical

Reopened and left as critical although the frequency is in fact significantly lower. I believe the follow up patch in 49229 will hit the rest of it for us once we get it to land.

I don't want to downgrade from critical yet until it lands and I'm feeling more confident with this particular issue.

Changed in cinder:
status: Fix Committed → In Progress
importance: Undecided → Critical
milestone: none → havana-rc1
Joe Gordon (jogo) on 2013-10-02
summary: - tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern flake
- failure
+ tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern 'qemu-
+ nbd: Failed to bdrv_open'

Reviewed: https://review.openstack.org/49229
Committed: http://github.com/openstack/cinder/commit/1aef7b6304af6c19b8ceb81cbd9a3c83bb4880de
Submitter: Jenkins
Branch: master

commit 1aef7b6304af6c19b8ceb81cbd9a3c83bb4880de
Author: John Griffith <email address hidden>
Date: Tue Oct 1 12:05:14 2013 -0600

    Check for backing lun on iscsi target create

    Check to verify the backing lun was actually created and not just
    the controller lun. If it was NOT created, attempt to issue
    tgtadm --op new to see if we can recover.

    If this fails, then we want to actually fail in Cinder rather than
    pretending that everything went well, so we'll log the error and raise.

    Change-Id: I3cabab0d214c051267638a627664df2b673236e3
    Closes-Bug: #1226337

Changed in cinder:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2013-10-04
Changed in cinder:
status: Fix Committed → Fix Released
Changed in tempest:
status: Triaged → Invalid
Thierry Carrez (ttx) on 2013-10-17
Changed in cinder:
milestone: havana-rc1 → 2013.2
Attila Fazekas (afazekas) wrote :

The elastic recheck can suggest this bug number even if the test case not failed.

Attila Fazekas (afazekas) wrote :

https://review.openstack.org/#/c/52740/ test_stamp_pattern test case also triggers this bug, and it is not fixed.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers