Volume hangs in "creating" status even though scheduler raises "No valid host" exception

Bug #1053931 reported by Dmitry Galkin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Unassigned
OpenStack Compute (nova)
Fix Released
Medium
Russell Bryant

Bug Description

When the volume creation process fails during scheduling (i.e. there is no appropriate host) the status in DB (and in nova volume-list output as a result) hangs with a "creating..." value.

In such case to figure out that volume creation failed one should go and see /var/log/nova/nova-scheduler.log (which is not an obvious action to do). Moreover, volume stuck with "creating..." status cannot be deleted with nova volume-delete. To delete it one have to change it's status to error in DB.

Simple scheduler is being used (nova.conf):

--scheduler_driver=nova.scheduler.simple.SimpleScheduler

Here is a sample output from DB:

*************************** 3. row ***************************
         created_at: 2012-09-21 09:55:42
         updated_at: NULL
         deleted_at: NULL
            deleted: 0
                 id: 15
             ec2_id: NULL
            user_id: b0aadfc80b094d94b78d68dcdc7e8757
         project_id: 3b892f660ea2458aa9aa9c9a21352632
               host: NULL
               size: 1
  availability_zone: nova
        instance_id: NULL
         mountpoint: NULL
        attach_time: NULL
             status: creating
      attach_status: detached
       scheduled_at: NULL
        launched_at: NULL
      terminated_at: NULL
       display_name: NULL
display_description: NULL
  provider_location: NULL
      provider_auth: NULL
        snapshot_id: NULL
     volume_type_id: NULL

Here is a part of interest in nova-scheduler.log:

    pic': u'volume', u'filter_properties': {u'scheduler_hints': {}}, u'snapshot_id': None, u'volume_id': 16}, u'_context_auth_token': '<SANITIZED>', u'_context_is_admin': True, u'_context_project_id': u'3b 892f660ea2458aa9aa9c9a21352632', u'_context_timestamp': u'2012-09-21T10:15:47.091307', u'_context_user_id': u'b0aadfc80b094d94b78d68dcdc7e8757', u'method': u'create_volume', u'_context_remote_address': u'172.18.67.146'} from (pid=11609) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
 15 2012-09-21 10:15:47 DEBUG nova.rpc.amqp [req-01f7dd30-0421-4ef3-a675-16b0cf1362eb b0aadfc80b094d94b78d68dcdc7e8757 3b892f660ea2458aa9aa9c9a21352632] unpacked context: {'user_id': u'b0aadfc80b094d94b78d 68dcdc7e8757', 'roles': [u'admin'], 'timestamp': '2012-09-21T10:15:47.091307', 'auth_token': '<SANITIZED>', 'remote_address': u'172.18.67.146', 'is_admin': True, 'request_id': u'req-01f7dd30-0421-4ef3- a675-16b0cf1362eb', 'project_id': u'3b892f660ea2458aa9aa9c9a21352632', 'read_deleted': u'no'} from (pid=11609) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
 14 2012-09-21 10:15:47 WARNING nova.scheduler.manager [req-01f7dd30-0421-4ef3-a675-16b0cf1362eb b0aadfc80b094d94b78d68dcdc7e8757 3b892f660ea2458aa9aa9c9a21352632] Failed to schedule_create_volume: No vali d host was found. Is the appropriate service running?
 13 2012-09-21 10:15:47 ERROR nova.rpc.amqp [req-01f7dd30-0421-4ef3-a675-16b0cf1362eb b0aadfc80b094d94b78d68dcdc7e8757 3b892f660ea2458aa9aa9c9a21352632] Exception during message handling
 12 2012-09-21 10:15:47 TRACE nova.rpc.amqp Traceback (most recent call last):
 11 2012-09-21 10:15:47 TRACE nova.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 253, in _process_data
 10 2012-09-21 10:15:47 TRACE nova.rpc.amqp rval = node_func(context=ctxt, **node_args)
  9 2012-09-21 10:15:47 TRACE nova.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/scheduler/manager.py", line 97, in _schedule
  8 2012-09-21 10:15:47 TRACE nova.rpc.amqp context, ex, *args, **kwargs)
  7 2012-09-21 10:15:47 TRACE nova.rpc.amqp File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
  6 2012-09-21 10:15:47 TRACE nova.rpc.amqp self.gen.next()
  5 2012-09-21 10:15:47 TRACE nova.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/scheduler/manager.py", line 92, in _schedule
  4 2012-09-21 10:15:47 TRACE nova.rpc.amqp return driver_method(*args, **kwargs)
  3 2012-09-21 10:15:47 TRACE nova.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/scheduler/simple.py", line 227, in schedule_create_volume
  2 2012-09-21 10:15:47 TRACE nova.rpc.amqp raise exception.NoValidHost(reason=msg)
  1 2012-09-21 10:15:47 TRACE nova.rpc.amqp NoValidHost: No valid host was found. Is the appropriate service running?
  0 2012-09-21 10:15:47 TRACE nova.rpc.amqp

Revision history for this message
Russell Bryant (russellb) wrote :

What version of nova?

Changed in nova:
status: New → Incomplete
Revision history for this message
Russell Bryant (russellb) wrote :

Read through current nova master and appears to still be a problem. It looks like it applies to cinder, as well.

Changed in nova:
status: Incomplete → Confirmed
importance: Undecided → Medium
Revision history for this message
Russell Bryant (russellb) wrote :

Actually, this does not appear to affect Cinder as of this commit:

commit f758bde47439be52a743b2b4181d4900f2c1bc8a
Author: ZhuRongze <email address hidden>
Date: Sat Jul 28 08:14:03 2012 +0000

    Scheduler-clean-up

    Some instance related functions and opts were still present in Cinder
    Scheduler.

    Remove the functions and opts and modify some computer related comments.

    Change-Id: Ia01099e66bcbb129493870dfbf212d5e6669ebe1

Changed in cinder:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/13479

Changed in nova:
assignee: nobody → Russell Bryant (russellb)
status: Confirmed → In Progress
tags: added: folsom-rc-potential
Changed in nova:
milestone: none → folsom-rc2
tags: removed: folsom-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (milestone-proposed)

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/13496

Changed in cinder:
milestone: none → folsom-rc2
milestone: folsom-rc2 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/13479
Committed: http://github.com/openstack/nova/commit/75fa03557fd6f1e7c62079e9e89556f1af139202
Submitter: Jenkins
Branch: master

commit 75fa03557fd6f1e7c62079e9e89556f1af139202
Author: Russell Bryant <email address hidden>
Date: Fri Sep 21 13:51:15 2012 -0400

    Set volume status to error if scheduling fails.

    Fix bug 1053931.

    When scheduling volume creation fails, the volume was left with a status
    of 'creating'. This patch changes the scheduler manager to set the
    status to 'error' if scheduling fails. This matches the behavior of the
    cinder scheduler manager in this case.

    This particular issue was addressed in Cinder as a part of commit
    f758bde47439be52a743b2b4181d4900f2c1bc8a.

    Change-Id: Ieb453ab05b3b84de53f72323c536a9567555df1e

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (milestone-proposed)

Reviewed: https://review.openstack.org/13496
Committed: http://github.com/openstack/nova/commit/fb1f3502da1839d7bb700af16cc3a107da5d8624
Submitter: Jenkins
Branch: milestone-proposed

commit fb1f3502da1839d7bb700af16cc3a107da5d8624
Author: Russell Bryant <email address hidden>
Date: Fri Sep 21 13:51:15 2012 -0400

    Set volume status to error if scheduling fails.

    Fix bug 1053931.

    When scheduling volume creation fails, the volume was left with a status
    of 'creating'. This patch changes the scheduler manager to set the
    status to 'error' if scheduling fails. This matches the behavior of the
    cinder scheduler manager in this case.

    This particular issue was addressed in Cinder as a part of commit
    f758bde47439be52a743b2b4181d4900f2c1bc8a.

    Change-Id: Ieb453ab05b3b84de53f72323c536a9567555df1e
    (cherry picked from commit 75fa03557fd6f1e7c62079e9e89556f1af139202)

Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
Dmitry Galkin (galkindmitrii) wrote :

I faced it with a stable/essex nova. (without any milestones, however I was unable to find such a bug in milestones 2012.1.1 or 2012.1.2).

Thanks a lot for the fix!

Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-rc2 → 2012.2
Revision history for this message
Dafna Ron (dron-3) wrote :
Download full text (5.5 KiB)

This bug is happening again in Havana release.
we had a power outage and I sent a create command while the storage was unavailable and also had some other volume related commands running.

2013-10-07 18:41:06.136 8288 WARNING cinder.scheduler.host_manager [req-462656bc-4bb2-478a-8fa4-90ac89e1c39e c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] volume service is down or disabled. (host: cougar06.scl.lab.tlv.redhat.com)
2013-10-07 18:41:06.152 8288 ERROR cinder.volume.flows.create_volume [req-462656bc-4bb2-478a-8fa4-90ac89e1c39e c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] Failed to schedule_create_volume: No valid host was found.
2013-10-07 18:44:31.280 8288 WARNING cinder.scheduler.host_manager [req-65c2f4e1-71da-4340-b9f0-afdd05ccdaa9 c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] volume service is down or disabled. (host: cougar06.scl.lab.tlv.redhat.com)
2013-10-07 18:44:31.281 8288 ERROR cinder.volume.flows.create_volume [req-65c2f4e1-71da-4340-b9f0-afdd05ccdaa9 c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] Failed to schedule_create_volume: No valid host was found.
2013-10-07 18:44:50.730 8288 WARNING cinder.scheduler.host_manager [req-1c132eb5-ca74-4ab5-91dc-73c25b305165 c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] volume service is down or disabled. (host: cougar06.scl.lab.tlv.redhat.com)
2013-10-07 18:44:50.731 8288 ERROR cinder.volume.flows.create_volume [req-1c132eb5-ca74-4ab5-91dc-73c25b305165 c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] Failed to schedule_create_volume: No valid host was found.
2013-10-07 18:47:01.577 8288 WARNING cinder.scheduler.host_manager [req-538ad552-0e19-4307-bea8-10e0a35a8a36 c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] volume service is down or disabled. (host: cougar06.scl.lab.tlv.redhat.com)
2013-10-07 18:47:01.578 8288 ERROR cinder.volume.flows.create_volume [req-538ad552-0e19-4307-bea8-10e0a35a8a36 c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] Failed to schedule_create_volume: No valid host was found.
2013-10-07 18:47:18.421 8288 WARNING cinder.scheduler.host_manager [req-3a788eb8-56f5-45f6-b4fd-ade01a05cf9d c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] volume service is down or disabled. (host: cougar06.scl.lab.tlv.redhat.com)
2013-10-07 18:47:18.422 8288 ERROR cinder.volume.flows.create_volume [req-3a788eb8-56f5-45f6-b4fd-ade01a05cf9d c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] Failed to schedule_create_volume: No valid host was found.
2013-10-07 18:48:27.732 8288 WARNING cinder.scheduler.host_manager [req-1ef1b47a-27b8-4667-9823-2f91dcc0f29e c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] volume service is down or disabled. (host: cougar06.scl.lab.tlv.redhat.com)
2013-10-07 18:48:27.733 8288 ERROR cinder.volume.flows.create_volume [req-1ef1b47a-27b8-4667-9823-2f91dcc0f29e c02995f25ba44cfab1a3cbd419f045a1 c77235c29fd0431a8e6628ef6d18e07f] Failed to schedule_create_volume: No valid host was found.
2013-10-07 18:48:51.125 8288 WARNING cinder.scheduler.host_manager [re...

Read more...

Changed in cinder:
status: Invalid → New
Revision history for this message
John Griffith (john-griffith) wrote :

Dafna,
Could you confirm that the volume was stuck in "creating" or did it go to "error". The bug was actually that the scheduler failed but never marked the volume as such and left it sitting in creating state. Let me know if this is what you're still seeing, I believe this is still fixed based on the code and some quick tests.

Revision history for this message
gabriel staicu (gabriel-staicu) wrote :

Hi,

I can confirm that in havana happens again. I tried to spawn machine from a volume snapshot with the option create new volume. First interesting thing is that cinder-scheduler chose to create the corresponding volume on the agent that was down although there were other cinder-volume agents up. So the status of the volume is stuck in creating:
 cinder list
+--------------------------------------+----------------+------------------+------+-------------+----------+--------------------------------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+----------------+------------------+------+-------------+----------+--------------------------------------+
| 0c7332a2-4ddd-413c-828b-3ba93b2f98da | in-use | bla_ubuntu | 10 | None | false | 55840e8b-7cc8-43ea-b8c0-bfcdf82c769d |
| 20e0debb-55b8-4bde-9395-2e4179559536 | available | debian73 | 10 | None | false | |
| 46b9846c-c8ad-4847-a6b9-e8bc086b41b0 | available | ubuntu1204 | 10 | None | false | |
| 62b58aff-ce6e-4b7f-883e-589103ecaa87 | in-use | bla_00 | 100 | None | false | 55840e8b-7cc8-43ea-b8c0-bfcdf82c769d |
| 9a68af42-e73f-4fb7-8f87-621b2b4375f4 | error_deleting | ssdf | 1 | None | false | |
| a7728dbf-3752-438a-9395-f14efad98ccf | available | ubuntu1204_unity | 10 | None | false | |
| e9247a44-cb65-4164-8fa2-9c2433949278 | creating | | 10 | None | false | |
+--------------------------------------+----------------+------------------+------+-------------+----------+--------------------------------------+

Revision history for this message
Huang Zhiteng (zhiteng-huang) wrote :

Gabriel, creating a volume from a snapshot doesn't go through cinder-scheduler at all. In your case, if the back-end serving the snapshot is down, and the RPC message that Cinder API service sent to Cinder Volume is left in message queue. That's why volume status stuck in 'creating'. It's different.

Revision history for this message
Huang Zhiteng (zhiteng-huang) wrote :

Dafna,

I think this bug has been fixed.

Let me explain a little bit more on the workflow of creating a volume: 1) User sends request to Cinder API service; 2) API creates a DB entry for the volume and marks its status to 'creating' (https://github.com/openstack/cinder/blob/stable/havana/cinder/volume/flows/create_volume/__init__.py#L545) and sends a RPC message to scheduler; 3) scheduler picks up the message and makes placement decision and if a back-end is available, it sends the request via RPC to volume service; 4) volume service picks up the message to perform the real job creating a volume for user.

There are multiple cases in which a volume's status can be stuck in 'creating':

a) something wrong happened during RPC message being processed by scheduler (e.g. scheduler service is down - related to this change & bug: https://review.openstack.org/#/c/64014/, message is lost, scheduler service goes down while scheduler processing the message);

b) something went wrong AFTER backend is chosen, which means scheduler successfully sends out the message to target back-end, but somehow the message isn't picked up by target volume service or there is unhandled exception while volume service handling the request.

If somehow this bug happened again, can you describe the steps how to reproduce it?

Changed in cinder:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.