OpenStack Compute (nova)

boot from volume instance failed,because when reschedule delete the volume

Bug #1427179 reported by YaoZheng_ZTE on 2015-03-02

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Expired	Low	Unassigned

Bug Description

1. Create a volume "nova volume-create --display-name test_volume 1"
[root@controller51 nova(keystone_admin)]# nova volume-list
+--------------------------------------+-----------+-------------------------+------+-------------+---------------------------------------------------------------------------+
| ID | Status | Display Name | Size | Volume Type | Attached to |
+--------------------------------------+-----------+-------------------------+------+-------------+---------------------------------------------------------------------------+
| a740ca7b-6881-4e28-9fdb-eb0d80336757 | available | test_volume | 1 | None | |
| 1f1c19c7-a5f9-4683-a1f6-e339f02e1410 | in-use | NFVO_system_disk2 | 30 | None | 6fa391f8-bd8b-483d-9286-3cebc9a93d55 |
| d868710e-30d4-4095-bd8f-fea9f16fe8ea | in-use | NFVO_data_software_disk | 30 | None | a07abdd5-07a6-4b41-a285-9b825f7b5623;6fa391f8-bd8b-483d-9286-3cebc9a93d55 |
| b03a39ca-ebc1-4472-9a04-58014e67b37c | in-use | NFVO_system_disk1 | 30 | None | a07abdd5-07a6-4b41-a285-9b825f7b5623 |
+--------------------------------------+-----------+-------------------------+------+-------------+---------------------------------------------------------------------------+
2. use The following command will boot a new instance and attach a volume at the same time：
[root@controller51 nova(keystone_admin)]# nova boot --flavor 1 --image 1736471c-3530-49f2-ad34-6ef7da285050 --block-device-mapping vdb=a740ca7b-6881-4e28-9fdb-eb0d80336757:blank:1:1 --nic net-id=31fce69e-16b9-4114-9fa9-589763e58fb0 test
+--------------------------------------+-----------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+-----------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | - |
| OS-EXT-SRV-ATTR:hypervisor_hostname | - |
| OS-EXT-SRV-ATTR:instance_name | instance-00000082 |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | sWTuKqzrpS32 |
| config_drive | |
| created | 2015-03-02T11:34:29Z |
| flavor | m1.tiny (1) |
| hostId | |
| id | 868cfd12-eb36-4140-b7b3-98cfcec627cd |
| image | VMB_X86_64_LX_2.6.32_64_REL_2014_12_26.img (1736471c-3530-49f2-ad34-6ef7da285050) |
| key_name | - |
| metadata | {} |
| name | test |
| os-extended-volumes:volumes_attached | [{"id": "547aae0e-455e-4d18-9c3c-e86bdc6c62e7"}] |
| progress | 0 |
| security_groups | default |
| serial_type | file |
| status | BUILD |
| tenant_id | df86efb4c5264f3c9bbe3df6717f8654 |
| updated | 2015-03-02T11:34:30Z |
| user_id | 7d376e69fc5d4697a1edb2600815de3f |
+--------------------------------------+-----------------------------------------------------------------------------------+
3、if the instance are scheduled the host1, but, if the host1 network service is inactive, then will reschedule the other host,
before reschedule ,as for create instance command the parameter delete-on-terminate is 1, so will run delete volume.
but, the issue is after reschedule another host, the volume is deleted ,the instance cannot build success.

Tags:

Revision history for this message

YaoZheng_ZTE (zheng-yao1) wrote on 2015-03-02:

this bug is the parameter delete-on-terminate is 1, when scheduler one host allocate resources failed,then reschedule another host,the volume should not be deleted.

Revision history for this message

YaoZheng_ZTE (zheng-yao1) wrote on 2015-03-04:

The parameter delete-on-terminate:"A boolean to indicate whether the volume should be deleted when the instance is terminated. True can be specified as True or 1. False can be specified as False or 0." ,so only when the instance will be terminated,the volume should be deleted. but now,when create instance, the schedule error first, then will reschedule another host, the volume till need be used. so , should not delete the volume.

Revision history for this message

YaoZheng_ZTE (zheng-yao1) wrote on 2015-03-04:

I use icehouse2014.1.3 version, but I review code in K version, this issue is also present

Revision history for this message

jichenjc (jichenjc) wrote on 2015-03-06:

I am wondering whether K release already fix this because the code delete the volume is

2474 def _cleanup_volumes(self, context, instance_uuid, bdms, raise_exc=True):
2475 exc_info = None
2476
2477 for bdm in bdms:
2478 LOG.debug("terminating bdm %s", bdm,
2479 instance_uuid=instance_uuid)
2480 if bdm.volume_id and bdm.delete_on_termination:
2481 try:
2482 self.volume_api.delete(context, bdm.volume_id)

this function either be called when instance is deleted or build instance failed ,it will be used to clean the env
I think current function _reschedule_or_error is not used so do you have a stack trace for kilo or
where do you think the volume is deleted? Thanks

Changed in nova:
status:	New → Incomplete
status:	Incomplete → New

Revision history for this message

YaoZheng_ZTE (zheng-yao1) wrote on 2015-03-12:

Hi jichenjc:
I reviewed the latest version is kilo-2, The code calls the relationship:
_build_instance()------->_reschedule_or_error()-------->_cleanup_volumes()
I added: Whether reschedule or not ,depending on Configuration item "scheduler_max_attempts=3",3 is default value.

def _build_instance(self, context, request_spec, filter_properties, requested_networks, injected_files, admin_password, is_first_time, node, instance, image_meta, legacy_bdm_in_spec):
original_context = context
context = context.elevated()

try:

     except Exception:
         exc_info = sys.exc_info()
         # try to re-schedule instance:
         # Make sure the async call finishes
         if network_info is not None:
             network_info.wait(do_raise=False)
             rescheduled = self._reschedule_or_error(original_context, instance,
                    exc_info, requested_networks, admin_password,
                    injected_files_orig, is_first_time, request_spec,
                    filter_properties, bdms, legacy_bdm_in_spec)

def _reschedule_or_error(self, context, instance, exc_info,
            requested_networks, admin_password, injected_files, is_first_time,
            request_spec, filter_properties, bdms=None,
            legacy_bdm_in_spec=True):
        """Try to re-schedule the build or re-raise the original build error to
        error out the instance.
        """
        original_context = context
        context = context.elevated()
try:
            LOG.debug("Clean up resource before rescheduling.",
                      instance=instance)
            if bdms is None:
                bdms = objects.BlockDeviceMappingList.get_by_instance_uuid(
                        context, instance.uuid)

            self._shutdown_instance(context, instance,
                                    bdms, requested_networks)
            self._cleanup_volumes(context, instance.uuid, bdms)
        except Exception:
            # do not attempt retry if clean up failed:
            with excutils.save_and_reraise_exception():
                self._log_original_error(exc_info, instance_uuid)

Hi jichenjc:
 I reviewed the latest version is kilo-2,  The code calls the relationship:
 _build_instance()------->_reschedule_or_error()-------->_cleanup_volumes()
I added: Whether reschedule or not ,depending on Configuration item "scheduler_max_attempts=3",3 is default value.

def _build_instance(self, context, request_spec, filter_properties, requested_networks, injected_files, admin_password,                           is_first_time, node, instance, image_meta, legacy_bdm_in_spec):
     original_context = context
     context = context.elevated()

try:
    
     except Exception:
         exc_info = sys.exc_info()
         # try to re-schedule instance:
         # Make sure the async call finishes
         if network_info is not None:
             network_info.wait(do_raise=False)
             rescheduled = self._reschedule_or_error(original_context, instance,
                    exc_info, requested_networks, admin_password,
                    injected_files_orig, is_first_time, request_spec,
                    filter_properties, bdms, legacy_bdm_in_spec)

Revision history for this message

Sean Dague (sdague) wrote on 2015-03-30:

Looks like a race on the delete of the volume getting queued but not executed until the guest is started.

tags:	added: volumes
Changed in nova:
status:	New → Confirmed
importance:	Undecided → Low

Revision history for this message

jichenjc (jichenjc) wrote on 2015-03-30:

ok, I got it , thanks YaoZheng_ZTE for your explanation bdm.delete_on_termination can't be always used here

Sean , I think it's not a race , we can' t delete the volume if it's going to be rescheduled otherwise next reschedule action will not be able to use the volume

jichenjc (jichenjc) on 2015-03-30

Changed in nova:
assignee:	nobody → jichenjc (jichenjc)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-30: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/169097

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

YaoZheng_ZTE (zheng-yao1) wrote on 2015-03-31:

Hi Sean :

it's not a race, when reschedule action happened, if bdm.delete_on_termination is True, this question will be sure to happen.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-10: Change abandoned on nova (master)

#10

Change abandoned by jichenjc (<email address hidden>) on branch: master
Review: https://review.openstack.org/169097
Reason: this whole function is gone, no need to update it now

Revision history for this message

jichenjc (jichenjc) wrote on 2015-08-10:

#11

the whole function you talked about is gone now , should I submit a patch for Kilo instead?

Davanum Srinivas (DIMS) (dims-v) on 2016-03-06

Changed in nova:
assignee:	jichenjc (jichenjc) → nobody
status:	In Progress → Confirmed

Revision history for this message

Dan Smith (danms) wrote on 2018-02-22:

#12

Can someone confirm if this is still a bug in master (rocky)? If not, we should close this.

Changed in nova:
status:	Confirmed → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-04-24:

#13

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.