Comment 2 for bug 1329546

Revision history for this message
Alex Xu (xuhj) wrote :

@Salvatore, I feel there is race between nova and neutron l2 agent.

Rebuild destroy the instance first and then spawn at, the code as below:

        if not recreate:
            self.driver.destroy(context, instance, network_info,
                                block_device_info=block_device_info)
        instance.task_state = task_states.REBUILD_BLOCK_DEVICE_MAPPING
        instance.save(expected_task_state=[task_states.REBUILDING])

        new_block_device_info = attach_block_devices(context, instance, bdms)

        instance.task_state = task_states.REBUILD_SPAWNING
        instance.save(
            expected_task_state=[task_states.REBUILD_BLOCK_DEVICE_MAPPING])

        self.driver.spawn(context, instance, image_meta, injected_files,
                          admin_password, network_info=network_info,
                          block_device_info=new_block_device_info)

So the race is happened between destroy and spawn.

For example:
I use linuxbridge agent at here, and currently the vif's active is false in network info cache.

In normal for rebuild will be:
1. nova destroy the instance.
2. The tap device is remove from bridge.
3. Agent poll the devices, found the removed device, and update the port status to down
4. nova spawn the instance
5. The tap device is add to bridge
6. Agent poll the devices, found the new device, and set the port status to up.
7. Neutron send the network vif plugged event.
8. nova finish the rebuilding.

But if the interval time of neutron agent polling is too long, then the problem will show at here.
At step 3, if the agent didn't poll the devices beween step1 and step4, agent didn't know the device is removed.
After step 5 the device is added back. So agent never know the device has been removed. So the port status won't be updated.
Then nova won't receive the network vif plugged event. So the instance is stuck at building status.

This is also can be reproduce for a stopped instance.

When an instance is stopped, the port status become down, vif's active in network info cache will be false. Then if you rebuild the instance, the instance will stuck at rebuilding status.

So we may need neutron send the network vif unplugged event, then nova should waiting for this event when destroy instance. Then this can ensure the neutron knowing the port have been removed.