Cinder

Volume creation fails randomly - Failed to create iscsi target for volume id:volume-<UUID>. Please ensure your tgtd config file contains 'include /var/lib/cinder/volumes/*'

Bug #1191429 reported by Hrushikesh on 2013-06-15

This bug report is a duplicate of: Bug #1223469: Volume create tgtadmin fails with error code 22. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Invalid	Undecided	Unassigned

Bug Description

ACTUAL BEHAVIOR: With already 300 volumes in play, the next set of 50 instances + volumes provisioning fails on volume creation. While creating a set of 50 volumes in a row, one of the volumes creation fails, but continues on creating the remaining. The error message puzzles as it reports missing configuration file. If you notice this log, the volumes were being created successfully before and after the error. So something relates to timing:

2013-06-14 15:04:58 INFO [cinder.volume.manager] volume volume-25ce444b-ff09-48f5-9baf-a527a02ebf6d: created successfully
2013-06-14 15:04:58 INFO [cinder.volume.manager] Clear capabilities
2013-06-14 15:04:58 INFO [cinder.volume.manager] volume volume-573adb49-a088-4121-b584-e3041c7f3f04: creating
2013-06-14 15:04:58 INFO [cinder.volume.iscsi] Creating iscsi_target for: volume-24966369-95f7-4750-9389-886de3c8ced1
2013-06-14 15:04:58 ERROR [cinder.volume.iscsi] Failed to create iscsi target for volume id:volume-078cd44b-7b39-4867-a1e9-78bb758ae0a7. Please ensure your tgtd config file contains 'include /var/lib/cinder/volumes/*'
2013-06-14 15:04:58 ERROR [cinder.volume.manager] volume volume-078cd44b-7b39-4867-a1e9-78bb758ae0a7: create failed
2013-06-14 15:04:58 ERROR [cinder.openstack.common.rpc.amqp] Exception during message handling
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cinder/openstack/common/rpc/amqp.py", line 430, in _process_data
    rval = self.proxy.dispatch(ctxt, version, method, **args)
  File "/usr/lib/python2.7/dist-packages/cinder/openstack/common/rpc/dispatcher.py", line 133, in dispatch
    return getattr(proxyobj, method)(ctxt, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 282, in create_volume
    LOG.error(_("volume %s: create failed"), volume_ref['name'])
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 274, in create_volume
    model_update = self.driver.create_export(context, volume_ref)
  File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/lvm.py", line 484, in create_export
    chap_auth)
  File "/usr/lib/python2.7/dist-packages/cinder/volume/iscsi.py", line 176, in create_iscsi_target
    raise exception.NotFound()
NotFound: Resource could not be found.
2013-06-14 15:04:59 INFO [cinder.volume.iscsi] Creating iscsi_target for: volume-3ce24c2b-8d2d-4eac-81ec-68a713a593d3
2013-06-14 15:04:59 INFO [cinder.volume.manager] volume volume-1a35e9be-3db6-4ef8-9402-617950716b06: creating
2013-06-14 15:04:59 INFO [cinder.volume.iscsi] Creating iscsi_target for: volume-0e435664-0ed8-4a37-8852-79a90c82ceab
2013-06-14 15:04:59 INFO [cinder.volume.manager] volume volume-37f93f31-4005-4977-b0e5-e7655a56df1c: creating
2013-06-14 15:04:59 INFO [cinder.volume.manager] volume volume-a222d818-bf75-4530-b565-61cad943ea3a: created successfully
2013-06-14 15:04:59 INFO [cinder.volume.manager] Clear capabilities
2013-06-14 15:04:59 INFO [cinder.volume.manager] volume volume-f171c6c0-f97a-4343-b098-6eb31def5b01: creating
2013-06-14 15:04:59 INFO [cinder.volume.manager] volume volume-357603e5-2c4c-4acb-bf43-6a9df2387f6b: created successfully

The error do surface even after specifying the absolute path in /etc/tgt/target.conf
Change content of the /etc/tgt/targets.conf from "include /etc/tgt/conf.d/*.conf" to: include /etc/tgt/conf.d/cinder_tgt.conf. Then restart tgt and cinder-* services so they pick up the new configuration.

EXPECTED BEHAVIOR: The volume creation must either error out on all or work successfully for all. The reason being the respective configuration file exist and set correctly.

HOW-TO-REPRODUCE:
Set up OpenStack grizzly environment using Install openstack using https://github.com/mseknibilel/OpenStack-Grizzly-Install-Guide/blob/OVS_MultiNode/OpenStack_Grizzly_Install_Guide.rst
Changed the metadatasize of the physical disk of cinder volume group to 1020 to accomodate 300+ cinder volumes.
Launching a provisioning job that creates 50 instances and 50 disks (1 GB each) and attaches in 1 batch.

ENVIRONMENT: (Hardware, OS, OS Version, Browser, etc)
Cinder is deployed in the simplest mode. 1 cinder volume group, no scheduler no volume type feature is utilized.

See original description

Tags:

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-06-15:

cinder-volume log and configuration Edit (18.8 KiB, application/x-7z-compressed)

description:

updated

Revision history for this message

John Griffith (john-griffith) wrote on 2013-06-16:

I'm not sure why you would expect "if one fails they all should fail" ?

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-06-16:

Because of the error message indicating that there is something missing in the configuration file:

"Please ensure your tgtd config file contains 'include /var/lib/cinder/volumes/*"

Had there been something missing, the volumes that successfully got created before this error message and after this error message should have also failed. Hence, the expectation here is if something is missing in the configuration file all of the volume creation must have failed.

Regardless, am sure there is timing issue where one of the volumes creation failed with this error.

Revision history for this message

John Griffith (john-griffith) wrote on 2013-06-16:

I see... that error message is provided due to what was once a common failure case and is just a suggestion and is meant as such. I'd rather have recommendations like that one than nothing at all.

You're correct, your case is most likely a timing related issue and probably doesn't have anything to do with the error message provided. I suppose we could be more explicit and change the message to "Possible causes include: missing %s entry in tgtd config file."

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-06-17:

Apart from changing the error message, I would be interested in understanding the cause of this issue and a possible fix. Could this be fixed by throttling the requests coming to cinder?

Revision history for this message

Vincent Hou (houshengbo) wrote on 2013-06-17:

Hrushikesh, are you expecting a set of requests as a atomic transaction or something similar? If so, I think currently I should be done by the application built over cinder.

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-06-17:

Firstly, I would like to conclude whether the issue reported here is a cause of overloading requests reaching to cinder. My test results indicate that this issues is influenced by two factors:
1. No. of concurrent requests reaching to cinder - in my test 50 volume creation requests are being submitted in a batch. For first 6 batches, i didn't see error. But, subsequent batches it started falling apart. Sometimes work sometimes not. Seems influenced by factor#2.

2. No. of active volumes - cinder list api response time get badly affected by this no. of volumes already in play. May be the response from Cinder APIs are timing out and leading to this issue.

Secondly, to your response - there are two aspects:
1. Throttling at the application built over cinder - What should be the value, how many requests per second/minute?
2. Tweakable parameters in Cinder that can mitigate handling overloaded requests - amqp pool size etc.

Revision history for this message

John Griffith (john-griffith) wrote on 2013-06-18:

>>> from cinderclient import exceptions
>>> from cinderclient import client
>>> cc = client.Client('1', 'demo', 'secrete', 'demo', 'http://192.168.135.17:5000/v2.0')
>>> ref_list = []
>>> for i in xrange(0,200):
... ref_list.append(cc.volumes.create(1))
...

After which I did a cinder list and all volumes are available. I realize this is not up to the scale of your current scenario. I'll bump this up by adding another VG tomorrow and see what happens when I get closer to where you're at.

This is a simple devstack single node setup with a 270 Gig Volume Group. Your notes state your setup is in accordance with https://github.com/mseknibilel/OpenStack-Grizzly-Install-Guide/blob/OVS_MultiNode/OpenStack_Grizzly_Install_Guide.rst;

However I'm assuming that's with the exception of using tgt for iscsi, a real VG (not a loop-back) and of course modified quotas.

I'll continue looking at this tomorrow.

One thing in order to isolate rabbit versus tgt/iscsi etc. We could try a fake driver that does the create/db and rabbit calls without actually doing any lvm or iscsi operations. That's something I started with this evening and didn't run into any issues (created 1K volumes using the same method above), I am running a single node config however so that may alleviate some traffic issues.

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-06-18:

It is PostGresql based install with tgt and loopback disk. The loopback disk has underlying SAN Storage of 3 TB.

John Griffith (john-griffith) on 2013-07-30

Changed in cinder:
status:	New → Invalid

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-09-24:

#10

Can I get some explanation on why the status of this defect is Invalid? I will be trying out some tests with H3 release and update my finding in 2 weeks.

Revision history for this message

John Griffith (john-griffith) wrote on 2013-09-24:

#11

It was marked invalid at the time due to inability to reproduce etc.

Regardless I believe I found the issue and it's a duplicate of: Bug #1223469 which has merged. It won't be H2, but it's in Trunk right now and will be in RC1.

Thanks,
John