tempest

tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern 'qemu-nbd: Failed to bdrv_open'

Bug #1226337 reported by Christopher Yeoh on 2013-09-17

This bug affects 7 people

	Status	Importance	Assigned to	Milestone
Cinder	Fix Released	Critical	John Griffith	Cinder 2013.2 "havana"
OpenStack Compute (nova)	Invalid	Undecided	Unassigned
tempest	Invalid	High	Unassigned

Bug Description

When running tempest tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern fails with the server going into an ERROR state.

From the console log:

Traceback (most recent call last):
  File "tempest/scenario/test_volume_boot_pattern.py", line 154, in test_volume_boot_pattern
keypair)
  File "tempest/scenario/test_volume_boot_pattern.py", line 53, in _boot_instance_from_volume
create_kwargs=create_kwargs)
  File "tempest/scenario/manager.py", line 390, in create_server
self.status_timeout(client.servers, server.id, 'ACTIVE')
  File "tempest/scenario/manager.py", line 290, in status_timeout
self._status_timeout(things, thing_id, expected_status=expected_status)
  File "tempest/scenario/manager.py", line 338, in _status_timeout
self.config.compute.build_interval):
  File "tempest/test.py", line 237, in call_until_true
if func():
  File "tempest/scenario/manager.py", line 329, in check_status
raise exceptions.BuildErrorException(message)
  BuildErrorException: Server %(server_id)s failed to build and is in ERROR status
Details: <Server: scenario-server-89179012> failed to get to expected status. In ERROR state.

The exception:
http://logs.openstack.org/64/47264/2/gate/gate-tempest-devstack-vm-full/dced339/logs/screen-n-cpu.txt.gz#_2013-09-24_04_44_31_806

Logs are located here:
http://logs.openstack.org/64/47264/2/gate/gate-tempest-devstack-vm-full/dced339

-----------------

Originally the failure was (before some changes to timeouts in tempest):

t178.1: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern[compute,image,volume]_StringException: Empty attachments:
stderr
stdout

pythonlogging:'': {{{
2013-09-16 15:59:44,214 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:44,417 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:45,348 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:45,495 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:47,644 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:48,762 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:49,879 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 15:59:50,980 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:00:52,581 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:00:52,897 Authentication (publickey) successful!
2013-09-16 16:00:53,105 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:00:53,428 Authentication (publickey) successful!
2013-09-16 16:00:53,431 Secsh channel 1 opened.
2013-09-16 16:00:53,607 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:00:53,875 Authentication (publickey) successful!
2013-09-16 16:00:53,880 Secsh channel 1 opened.
2013-09-16 16:01:58,999 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:01:59,288 Authentication (publickey) successful!
2013-09-16 16:01:59,457 Connected (version 2.0, client dropbear_2012.55)
2013-09-16 16:01:59,784 Authentication (publickey) successful!
2013-09-16 16:01:59,801 Secsh channel 1 opened.
2013-09-16 16:02:00,005 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:00,080 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:01,127 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:01,192 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:01,414 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:02,494 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:03,615 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:04,724 Starting new HTTP connection (1): 127.0.0.1
2013-09-16 16:02:05,825 Starting new HTTP connection (1): 127.0.0.1
}}}

Traceback (most recent call last):
  File "tempest/scenario/test_volume_boot_pattern.py", line 157, in test_volume_boot_pattern
    ssh_client = self._ssh_to_server(instance_from_snapshot, keypair)
  File "tempest/scenario/test_volume_boot_pattern.py", line 101, in _ssh_to_server
    private_key=keypair.private_key)
  File "tempest/scenario/manager.py", line 453, in get_remote_client
    return RemoteClient(ip, username, pkey=private_key)
  File "tempest/common/utils/linux/remote_client.py", line 47, in __init__
    if not self.ssh_client.test_connection_auth():
  File "tempest/common/ssh.py", line 148, in test_connection_auth
    connection = self._get_ssh_connection()
  File "tempest/common/ssh.py", line 70, in _get_ssh_connection
    time.sleep(bsleep)
  File "/usr/local/lib/python2.7/dist-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()
TimeoutException

Full logs here:

http://logs.openstack.org/80/43280/9/gate/gate-tempest-devstack-vm-full/cba22ae/testr_results.html.gz

Looks a bit like the instance failed to boot.

See original description

Tags:

Revision history for this message

Matthew Treinish (treinish) wrote on 2013-09-23:

Logstash query: @message:"qemu-nbd: Failed to bdrv_open" AND @fields.build_status:"FAILURE" AND @fields.filename:"logs/screen-n-cpu.txt"

Revision history for this message

Matthew Treinish (treinish) wrote on 2013-09-23:

Looking at the logs for a test run there is a repeating INFO message which becomes a warning (after it retries enough times) when nova compute tries to create the server:

http://logs.openstack.org/83/47283/1/gate/gate-tempest-devstack-vm-full/cdc49f9/logs/screen-n-cpu.txt.gz#_2013-09-23_14_16_22_131

2013-09-23 14:16:20.126 INFO nova.virt.disk.mount.nbd [req-2e800637-20ac-486c-b3dd-cff14742e2a0 tempest.scenario.manager348477648-user tempest.scenario.manager1015858706-tenant] NBD mount error: qemu-nbd error: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf qemu-nbd -c /dev/nbd7 /opt/stack/data/nova/instances/af550eec-7246-4500-898d-b22eb45109a8/disk
Exit code: 1
Stdout: ''
Stderr: "qemu-nbd: Failed to bdrv_open '/opt/stack/data/nova/instances/af550eec-7246-4500-898d-b22eb45109a8/disk': No such file or directory\n"

tags:

added: testing

Revision history for this message

Matthew Treinish (treinish) wrote on 2013-09-23:

Download full text (3.5 KiB)

I should scroll down more in the future, here is the actual stacktrace that is causing the failure (it's may be related to the above messages)

http://logs.openstack.org/83/47283/1/gate/gate-tempest-devstack-vm-full/cdc49f9/logs/screen-n-cpu.txt.gz#_2013-09-23_14_18_41_867

2013-09-23 14:18:41.867 ERROR nova.compute.manager [req-25a33c89-f4ba-4250-96e0-a122bb16b3c2 tempest.scenario.manager348477648-user tempest.scenario.manager1015858706-tenant] [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] Instance failed to spawn
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] Traceback (most recent call last):
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/compute/manager.py", line 1408, in _spawn
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] block_device_info)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2068, in spawn
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] write_to_disk=True)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 3026, in to_xml
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] disk_info, rescue, block_device_info)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2912, in get_guest_config
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] inst_type):
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2738, in get_guest_storage_config
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] info)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1030, in volume_driver_method
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] return method(connection_info, *args, **kwargs)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/openstack/common/lockutils.py", line 246, in inner
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] return f(*args, **kwargs)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] File "/opt/stack/new/nova/nova/virt/libvirt/volume.py", line 277, in connect_volume
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance...

I should scroll down more in the future, here is the actual stacktrace that is causing the failure (it's may be related to the above messages)

http://logs.openstack.org/83/47283/1/gate/gate-tempest-devstack-vm-full/cdc49f9/logs/screen-n-cpu.txt.gz#_2013-09-23_14_18_41_867

2013-09-23 14:18:41.867 ERROR nova.compute.manager [req-25a33c89-f4ba-4250-96e0-a122bb16b3c2 tempest.scenario.manager348477648-user tempest.scenario.manager1015858706-tenant] [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] Instance failed to spawn
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] Traceback (most recent call last):
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]   File "/opt/stack/new/nova/nova/compute/manager.py", line 1408, in _spawn
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]     block_device_info)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2068, in spawn
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]     write_to_disk=True)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 3026, in to_xml
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]     disk_info, rescue, block_device_info)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2912, in get_guest_config
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]     inst_type):
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2738, in get_guest_storage_config
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]     info)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1030, in volume_driver_method
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]     return method(connection_info, *args, **kwargs)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]   File "/opt/stack/new/nova/nova/openstack/common/lockutils.py", line 246, in inner
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]     return f(*args, **kwargs)
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]   File "/opt/stack/new/nova/nova/virt/libvirt/volume.py", line 277, in connect_volume
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]     % (host_device))
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7] NovaException: iSCSI device not found at /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-b31ec3e4-74f0-4aa8-aa25-2fcdf70e8d1b-lun-1
2013-09-23 14:18:41.867 23218 TRACE nova.compute.manager [instance: 196feb6a-fdd7-4ff6-9fb9-863b92c834a7]

Revision history for this message

Matthew Treinish (treinish) wrote on 2013-09-23:

better logstash query:
@message:"NovaException: iSCSI device not found at" AND @fields.build_status:"FAILURE" AND @fields.filename:"logs/screen-n-cpu.txt"

Matthew Treinish (treinish) on 2013-09-24

Changed in tempest:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Matthew Treinish (treinish) wrote on 2013-09-24:

Another example of this bug is:

The exception:
http://logs.openstack.org/64/47264/2/gate/gate-tempest-devstack-vm-full/dced339/logs/screen-n-cpu.txt.gz#_2013-09-24_04_44_31_806

2013-09-24 04:16:36.948 | Traceback (most recent call last):
2013-09-24 04:16:36.948 | File "tempest/scenario/test_volume_boot_pattern.py", line 154, in test_volume_boot_pattern
2013-09-24 04:16:36.948 | keypair)
2013-09-24 04:16:36.948 | File "tempest/scenario/test_volume_boot_pattern.py", line 53, in _boot_instance_from_volume
2013-09-24 04:16:36.948 | create_kwargs=create_kwargs)
2013-09-24 04:16:36.948 | File "tempest/scenario/manager.py", line 390, in create_server
2013-09-24 04:16:36.949 | self.status_timeout(client.servers, server.id, 'ACTIVE')
2013-09-24 04:16:36.949 | File "tempest/scenario/manager.py", line 290, in status_timeout
2013-09-24 04:16:36.949 | self._status_timeout(things, thing_id, expected_status=expected_status)
2013-09-24 04:16:36.949 | File "tempest/scenario/manager.py", line 338, in _status_timeout
2013-09-24 04:16:36.949 | self.config.compute.build_interval):
2013-09-24 04:16:36.949 | File "tempest/test.py", line 237, in call_until_true
2013-09-24 04:16:36.950 | if func():
2013-09-24 04:16:36.950 | File "tempest/scenario/manager.py", line 329, in check_status
2013-09-24 04:16:36.950 | raise exceptions.BuildErrorException(message)
2013-09-24 04:16:36.950 | BuildErrorException: Server %(server_id)s failed to build and is in ERROR status
2013-09-24 04:16:36.950 | Details: <Server: scenario-server-89179012> failed to get to expected status. In ERROR state.

Matthew Treinish (treinish) on 2013-09-24

description:

updated

Revision history for this message

Attila Fazekas (afazekas) wrote on 2013-09-26:

Not a tempest bug, I would like to see it as critical bug in another project (nova or cinder or libvirt o tgtd ..) before skipping.

Revision history for this message

Nikola Đipanov (ndipanov) wrote on 2013-09-26:

This does not seem like a nova bug. As per the error that was extracted in comment #3 (thanks Matthew) - this error occurs when nova cannot find the iscsi device after CONF.num_iscsi_scan_tries (by default 3), sleeping try_no ** 2 secs in between - so 14 secs by default. A simple solution might be to increase this limit for tempest runs.

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-26:

Adding cinder.

Joe Gordon (jogo) on 2013-09-26

tags:

added: havana-rc-potential

Revision history for this message

John Griffith (john-griffith) wrote on 2013-09-27:

Nikoa, Joe et all;

Missing something here in terms of Cinder being the cause of the server not coming up? Any insight?

Revision history for this message

John Griffith (john-griffith) wrote on 2013-09-27:

#10

Looking at this and here's my take:

* We create the volume and target in Cinder
* Nova grabs the volume info and starts building the instance
* Nova issues the discovery, and actually succesfully logs in to the target
* Nova attempts to actually make the attach and map to /dev/disk-by-path

The last step is what actually never seems to complete in the allowed period of time with the given retries.

I don't have a good theory here as I haven't been able to reproduce the error, but the only thing that I can suggest at this point is that we either increase the retry factor from 2 to 3 to give us almost 30 seconds before the last retry, or the same thing could be accomplished by changing the default for CONF.num_iscsi_retries

Revision history for this message

Zhikun Liu (zhikunliu) wrote on 2013-09-27: AUTO: Zhi Kun ZK Liu is on vacation

#11

I am out of the office until 10/07/2013.

I will take vacation from 28th Sep to 7th Oct . If have any urgent, please
call 13910806810

Note: This is an automated response to your message "[Bug 1226337] Re:
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern flake
failure" sent on 09/28/2013 4:05:16.

This is the only notification you will receive while this person is away.

Revision history for this message

John Griffith (john-griffith) wrote on 2013-09-29: Re: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern flake failure

#12

Additional logging it appears the target is no good, small sample it appears lun 1 is missing. Looking into how we might address this.

Revision history for this message

Brian D. Burns (iosctr) wrote on 2013-09-29:

#13

I received this as well. Logs: http://logs.openstack.org/63/48863/1/check/check-tempest-devstack-vm-full/318679b/

However, I wanted to point out that I also received a failure in "process-returncode".

ft1.1: process-returncode_StringException: Binary content:
traceback (test/plain;charset=utf8)

Shouldn't that be "text/plain"?

See also the end of http://logs.openstack.org/63/48863/1/check/check-tempest-devstack-vm-full/318679b/subunit_log.txt.gz

tags: -worker-0
time: 2013-09-29 20:22:01.851443Z
tags: worker-0
test: process-returncode
time: 2013-09-29 20:22:01.851443Z
failure: process-returncode [ multipart
Content-Type: test/plain;charset=utf8
traceback
C
returncode 10
]

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-09-30:

#14

intriguing bug report - https://bugs.launchpad.net/ubuntu/+source/tgt/+bug/1048008

Revision history for this message

Attila Fazekas (afazekas) wrote on 2013-09-30:

#15

In the syslog the initiator related messages are shorter than usual :

The message on failure:
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 kernel: [ 4997.993316] scsi7 : iSCSI Initiator over TCP/IP
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 kernel: [ 4998.499703] scsi 7:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 kernel: [ 4998.500015] scsi 7:0:0:0: Attached scsi generic sg4 type 12
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 iscsid: connection6:0 is operational now

The usual message looks like:
Sep 24 04:43:13 devstack-precise-hpcloud-az3-378359 kernel: [ 4934.758730] scsi6 : iSCSI Initiator over TCP/IP
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.266718] scsi 6:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.267148] scsi 6:0:0:0: Attached scsi generic sg2 type 12
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.268207] scsi 6:0:0:1: Direct-Access IET VIRTUAL-DISK 0001 PQ: 0 ANSI: 5
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.269186] sd 6:0:0:1: Attached scsi generic sg3 type 0
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.272887] sd 6:0:0:1: [sdb] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.273303] sd 6:0:0:1: [sdb] Write Protect is off
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.273309] sd 6:0:0:1: [sdb] Mode Sense: 49 00 00 08
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.273617] sd 6:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.275856] sdb: unknown partition table
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.277354] sd 6:0:0:1: [sdb] Attached SCSI disk
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 iscsid: connection5:0 is operational now

When the after nova/libvirt gives up (usual prologue):
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 kernel: [ 5013.039383] sd 4:0:0:1: [sda] Synchronizing SCSI cache
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 kernel: [ 5013.291661] connection3:0: detected conn error (1020)
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(101) connection closed, 0x22f8f28 1
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(107) sesson 0x22fa100 1
Sep 24 04:44:33 devstack-precise-hpcloud-az3-378359 kernel: [ 5014.128859] connection6:0: detected conn error (1020)
Sep 24 04:44:33 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(101) connection closed, 0x230c418 1
Sep 24 04:44:33 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(107) sesson 0x230c6e0 1

But the prior [sda] related message is missing!!.

In the syslog the initiator  related messages are shorter than usual :

The message on failure:
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 kernel: [ 4997.993316] scsi7 : iSCSI Initiator over TCP/IP
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 kernel: [ 4998.499703] scsi 7:0:0:0: RAID              IET      Controller       0001 PQ: 0 ANSI: 5
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 kernel: [ 4998.500015] scsi 7:0:0:0: Attached scsi generic sg4 type 12
Sep 24 04:44:17 devstack-precise-hpcloud-az3-378359 iscsid: connection6:0 is operational now

The usual message looks like:
Sep 24 04:43:13 devstack-precise-hpcloud-az3-378359 kernel: [ 4934.758730] scsi6 : iSCSI Initiator over TCP/IP
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.266718] scsi 6:0:0:0: RAID              IET      Controller       0001 PQ: 0 ANSI: 5
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.267148] scsi 6:0:0:0: Attached scsi generic sg2 type 12
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.268207] scsi 6:0:0:1: Direct-Access     IET      VIRTUAL-DISK     0001 PQ: 0 ANSI: 5
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.269186] sd 6:0:0:1: Attached scsi generic sg3 type 0
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.272887] sd 6:0:0:1: [sdb] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.273303] sd 6:0:0:1: [sdb] Write Protect is off
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.273309] sd 6:0:0:1: [sdb] Mode Sense: 49 00 00 08
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.273617] sd 6:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.275856]  sdb: unknown partition table
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 kernel: [ 4935.277354] sd 6:0:0:1: [sdb] Attached SCSI disk
Sep 24 04:43:14 devstack-precise-hpcloud-az3-378359 iscsid: connection5:0 is operational now

When the after nova/libvirt gives up (usual prologue):
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 kernel: [ 5013.039383] sd 4:0:0:1: [sda] Synchronizing SCSI cache
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 kernel: [ 5013.291661]  connection3:0: detected conn error (1020)
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(101) connection closed, 0x22f8f28 1
Sep 24 04:44:32 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(107) sesson 0x22fa100 1
Sep 24 04:44:33 devstack-precise-hpcloud-az3-378359 kernel: [ 5014.128859]  connection6:0: detected conn error (1020)
Sep 24 04:44:33 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(101) connection closed, 0x230c418 1
Sep 24 04:44:33 devstack-precise-hpcloud-az3-378359 tgtd: conn_close(107) sesson 0x230c6e0 1

But the prior [sda] related message is missing!!.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-09-30:

#16

Looking at a latest run which has additional debug information

Looking for volume-a109df91-d5b4-47b6-9c3e-17e514a08fe4

I see the disk - http://logs.openstack.org/76/48876/4/check/check-tempest-devstack-vm-postgres-full/2ce7533/logs/screen-c-vol.txt.gz#_2013-09-30_13_08_30_878
and then it's gone - http://logs.openstack.org/76/48876/4/check/check-tempest-devstack-vm-postgres-full/2ce7533/logs/screen-c-vol.txt.gz#_2013-09-30_13_08_32_538

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-09-30: Fix proposed to cinder (master)

#17

Fix proposed to branch: master
Review: https://review.openstack.org/48970

Changed in cinder:
assignee:	nobody → John Griffith (john-griffith)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-01: Fix merged to cinder (master)

#18

Reviewed: https://review.openstack.org/48970
Committed: http://github.com/openstack/cinder/commit/d9514d796185220715b09c8c17e71eeb1337ba2f
Submitter: Jenkins
Branch: master

commit d9514d796185220715b09c8c17e71eeb1337ba2f
Author: John Griffith <email address hidden>
Date: Mon Sep 30 08:45:52 2013 -0600

Dont retry if target creation succeeds

    The target creation retry loop was not breaking when
    the target was succesfully created. This is "sometimes" ok,
    however in other cases this will actually cause a failure in
    the second create and the backing Lun will be deleted and not
    created again succesfully due most likely to a busy status on the
    target.

Add a break in the try block.

Change-Id: I875f6302868651b5b42d4796cd2714fba35e525e
Closes-Bug: #1226337

Changed in cinder:
status:	In Progress → Fix Committed

Alan Pevec (apevec) on 2013-10-01

tags:

added: grizzly-backport-potential

Alan Pevec (apevec) on 2013-10-01

tags:

removed: grizzly-backport-potential

Revision history for this message

John Griffith (john-griffith) wrote on 2013-10-01: Re: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern flake failure

#19

Believe this can be marked invalid for Nova.

Changed in nova:
status:	New → Invalid

Dolph Mathews (dolph) on 2013-10-01

description:

updated

Revision history for this message

Joe Gordon (jogo) wrote on 2013-10-01:

#20

Looks like this is still happening: http://logs.openstack.org/07/49107/1/check/check-tempest-devstack-vm-postgres-full/6e09abc/logs/screen-n-cpu.txt.gz#_2013-10-01_10_58_48_112

But with a much much lower frequency. logstash reports 4 hits since the fix (https://review.openstack.org/48970) merged.

I think this is worth re-opening but not as critical

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-01: Fix proposed to cinder (master)

#21

Fix proposed to branch: master
Review: https://review.openstack.org/49229

Revision history for this message

John Griffith (john-griffith) wrote on 2013-10-02: Re: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern flake failure

#22

Reopened and left as critical although the frequency is in fact significantly lower. I believe the follow up patch in 49229 will hit the rest of it for us once we get it to land.

I don't want to downgrade from critical yet until it lands and I'm feeling more confident with this particular issue.

Changed in cinder:
status:	Fix Committed → In Progress
importance:	Undecided → Critical
milestone:	none → havana-rc1

Joe Gordon (jogo) on 2013-10-02

summary:

- tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern flake
- failure
+ tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern 'qemu-
+ nbd: Failed to bdrv_open'

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-03: Fix merged to cinder (master)

#23

Reviewed: https://review.openstack.org/49229
Committed: http://github.com/openstack/cinder/commit/1aef7b6304af6c19b8ceb81cbd9a3c83bb4880de
Submitter: Jenkins
Branch: master

commit 1aef7b6304af6c19b8ceb81cbd9a3c83bb4880de
Author: John Griffith <email address hidden>
Date: Tue Oct 1 12:05:14 2013 -0600

Check for backing lun on iscsi target create

    Check to verify the backing lun was actually created and not just
    the controller lun. If it was NOT created, attempt to issue
    tgtadm --op new to see if we can recover.

If this fails, then we want to actually fail in Cinder rather than
pretending that everything went well, so we'll log the error and raise.

Change-Id: I3cabab0d214c051267638a627664df2b673236e3
Closes-Bug: #1226337

Changed in cinder:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2013-10-04

Changed in cinder:
status:	Fix Committed → Fix Released

Matthew Treinish (treinish) on 2013-10-04

Changed in tempest:
status:	Triaged → Invalid

Thierry Carrez (ttx) on 2013-10-17

Changed in cinder:
milestone:	havana-rc1 → 2013.2

Revision history for this message

Attila Fazekas (afazekas) wrote on 2013-10-19:

#24

The elastic recheck can suggest this bug number even if the test case not failed.

Revision history for this message

Attila Fazekas (afazekas) wrote on 2013-10-20:

#25

https://review.openstack.org/#/c/52740/ test_stamp_pattern test case also triggers this bug, and it is not fixed.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.