A mistack caused by temporary_mutation reentry

Bug #1941819 reported by fionali
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
Stein
New
Undecided
Unassigned

Bug Description

nova.virt.libvirt./driver.py:LibvirtDriver._live_migration() spawn a thread to execute _live_migration_operation(called after threadA).Original thread execute _live_migration_monitor (called after threadB).
Assignment statement inst_type=instance.flavor call nova/objects/instance.py:obj_load_attr in function _live_migration_operation.
Function _live_migration_monitor call func _live_migration_data_gb。Assignment statement ram_gb = instance.flavor.memory_mb * units.Mi / units.Gi also call nova/objects/instance.py:obj_load_attr.
Function temporary_mutation is called by obj_load_attr. The mistack caused by the temporary_mutation is called by two threads simultaneously。
Time0: self._context[‘read.deleted’] is ‘no’
Time1: ThreadA called temporary_mutation, self._context[‘read.deleted’] is assigned a value of ‘yes’. Old value is ‘no’.
Time2: ThreadB called temporary_mutation, self._context[‘read.deleted’] is assigned a value of ‘yes’. Old value is ‘yes’.
Time3: ThreadA executing finally code of temporary_mutation, the value of self._context[‘read.deleted’] is restored to ‘no’.
Time3: ThreadA executing finally code of temporary_mutation, the value of self._context[‘read.deleted’] is restored to ‘yes’.
Result : Two calls to temporary_mutation cause the value of self._context[‘read.deleted’] to change from ‘no’ to ‘yes’. When Source host calling update_available_resource(ctxt) in _post_live_migration, Grabbing all instances assigned to this node will read deleted instances, which is time-consuming.

fionali (fionali)
affects: fuel-plugin-contrail → nova
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Do you have a way to reliably reproduce the issue? That would help a lot to fix the issue.

Just to avoid confusion in general nova uses eventlet for parallelism (there are some exceptions). Eventlet / greenlet implements cooperate parallelism. So there is no preemption between threads. There are certain points in the execution when eventlet schedules threads. Mostly around external IO operations. As obj_load_attr potentially calls RPC back to controller node to read data from the database, and that RPC is an IO, thread switching can happen in a place you described. So I do think what you described is a potential bug. But we would need a reproduction to actually fix it. And probably such reproduction would be then useful to find other similar parallel issues.

Revision history for this message
fionali (fionali) wrote (last edit ):

I encountered this bug while running the nova host-evacuate-live command on a host of about 20-30 vms。 Openstack version is Stein

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

First I tried to reproduce the issue on master. But it turned out that instance.flavor is not lazy loaded at all in the nova-compute any more after [1] added a instance.flavor call as part of the compute API live_migrate[2] call in the block_accelerator decorator[3]. This run in the nova-api service a lot before the live migration hits the compute service.

Right now on master there is no double lazy load on the instance during the live migration so the problem cannot be reproduced there any more. It has been fixed by accident since Ussuri. The older than Ussuri nova branches are in extended maintenance mode [4]. You can still propose fixes but there won't be any point release out from Stein any more.

I think fix that you can try is to simply trigger the instance.flavor lazy load before the _live_migration() spawns the new thread.

[1] https://review.opendev.org/c/openstack/nova/+/674726
[2] https://github.com/openstack/nova/blob/edaaa97d9911b849f3b5731746274b44a08ce14c/nova/compute/api.py#L5241
[3] https://github.com/openstack/nova/blob/edaaa97d9911b849f3b5731746274b44a08ce14c/nova/compute/api.py#L328
[4] https://docs.openstack.org/project-team-guide/stable-branches.html

Changed in nova:
status: New → Invalid
Revision history for this message
fionali (fionali) wrote (last edit ):
Download full text (18.2 KiB)

Steps to reproduce
==================
1、Deployment Stein-stable version openstack. The number of compute nodes is greater than 2.
2、Add logs on nova-compute of src-node
Edit nova\objects\instance.py
     def obj_load_attr(self, attrname):
        # NOTE(danms): We can't lazy-load anything without a context and a uuid
        if not self._context:
            raise exception.OrphanedObjectError(method='obj_load_attr',
                                                objtype=self.obj_name())
        if 'uuid' not in self:
            raise exception.ObjectActionError(
                action='obj_load_attr',
                reason=_('attribute %s not lazy-loadable') % attrname)

   -- LOG.debug("Lazy-loading '%(attr)s' on %(name)s uuid %(uuid)s",
   ++ LOG.debug("Lazy-loading '%(attr)s' on %(name)s uuid %(uuid)s context %(context)s",
                  {'attr': attrname,
                   'name': self.obj_name(),
                   'uuid': self.uuid,
                   'context' : self._context,
                   })

        with utils.temporary_mutation(self._context, read_deleted='yes'):
            self._obj_load_attr(attrname)
  ++ LOG.debug("Exit obj_load_attr context &(context)s"
                  {'context' : self._context,
                  })
3、Open the debug output in src-node
edit nova.conf
    debug=true
4、Restart nova-compute service of src-node.
5、Launch 25 instances in src-node.
6、nova host-evacuate-live --target-host dest-node src-node

==================
Logs
2021-09-09 17:22:15.170 17078 INFO nova.compute.manager [req-6d7e1979-2d0c-4df0-809c-ba9af187532b - - - - -] [instance: 37633020-5f24-4098-9468-f356015dae39] Took 14.54 seconds for pre_live_migration on destination host computer2.
2021-09-09 17:22:16.264 17078 DEBUG nova.compute.manager [req-6d7e1979-2d0c-4df0-809c-ba9af187532b - - - - -] live_migration data is LibvirtLiveMigrateData(bdms=[LibvirtLiveMigrateBDMInfo],block_migration=False,disk_available_mb=453632,disk_over_commit=<?>,dst_numa_info=<?>,dst_supports_numa_live_migration=<?>,dst_wants_file_backed_memory=False,file_backed_memory_discard=False,filename='tmpD3AV46',graphics_listen_addr_spice=127.0.0.1,graphics_listen_addr_vnc=0.0.0.0,image_type='default',instance_relative_path='37633020-5f24-4098-9468-f356015dae39',is_shared_block_storage=True,is_shared_instance_path=False,is_volume_backed=True,migration=Migration(5fbbf27a-f41d-4b1e-b90b-0137f296143f),old_vol_attachment_ids={},serial_listen_addr=None,serial_listen_ports=[],src_supports_native_luks=True,src_supports_numa_live_migration=<?>,supported_perf_events=[],target_connect_addr=None,vifs=[VIFMigrateData],wait_for_vif_plugged=True) _do_live_migration /usr/lib/python2.7/site-packages/nova/compute/manager.py:7161
2021-09-09 17:22:16.269 17078 DEBUG nova.objects.instance [req-6d7e1979-2d0c-4df0-809c-ba9af187532b - - - - -] Lazy-loading 'flavor' on Instance uuid 37633020-5f24-4098-9468-f356015dae39 context <Context {'domain': None, 'project_name': u'admin', 'global_request_id': u'req-b136c56a-c216-4ffc-99fa-e24956ddb7f6', 'project_domain': u'default', 'timestamp': '2021-09-09T09:21:10.608536', 'auth_token': u'gAAAAABhOdIDE-y03KqlKdyroMg7LJHVjH7Ot-B3PqLTz5fg...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.