Bug #1329546 “Upon rebuild instances might never get to Active s...” : Bugs : neutron

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-06-12:

#1

I'm not sure why gerrit did not update it. Commit msg looks fine.

https://review.openstack.org/#/c/99182/

Changed in nova:
assignee:	nobody → Salvatore Orlando (salvatore-orlando)
status:	New → In Progress

Revision history for this message

Alex Xu (xuhj) wrote on 2014-06-16:

#2

@Salvatore, I feel there is race between nova and neutron l2 agent.

Rebuild destroy the instance first and then spawn at, the code as below:

        if not recreate:
            self.driver.destroy(context, instance, network_info,
                                block_device_info=block_device_info)
        instance.task_state = task_states.REBUILD_BLOCK_DEVICE_MAPPING
        instance.save(expected_task_state=[task_states.REBUILDING])

new_block_device_info = attach_block_devices(context, instance, bdms)

        instance.task_state = task_states.REBUILD_SPAWNING
        instance.save(
            expected_task_state=[task_states.REBUILD_BLOCK_DEVICE_MAPPING])

        self.driver.spawn(context, instance, image_meta, injected_files,
                          admin_password, network_info=network_info,
                          block_device_info=new_block_device_info)

So the race is happened between destroy and spawn.

For example:
I use linuxbridge agent at here, and currently the vif's active is false in network info cache.

In normal for rebuild will be:
1. nova destroy the instance.
2. The tap device is remove from bridge.
3. Agent poll the devices, found the removed device, and update the port status to down
4. nova spawn the instance
5. The tap device is add to bridge
6. Agent poll the devices, found the new device, and set the port status to up.
7. Neutron send the network vif plugged event.
8. nova finish the rebuilding.

But if the interval time of neutron agent polling is too long, then the problem will show at here.
At step 3, if the agent didn't poll the devices beween step1 and step4, agent didn't know the device is removed.
After step 5 the device is added back. So agent never know the device has been removed. So the port status won't be updated.
Then nova won't receive the network vif plugged event. So the instance is stuck at building status.

This is also can be reproduce for a stopped instance.

When an instance is stopped, the port status become down, vif's active in network info cache will be false. Then if you rebuild the instance, the instance will stuck at rebuilding status.

So we may need neutron send the network vif unplugged event, then nova should waiting for this event when destroy instance. Then this can ensure the neutron knowing the port have been removed.

@Salvatore, I feel there is race between nova and neutron l2 agent.

Rebuild destroy the instance first and then spawn at, the code as below:

if not recreate:
            self.driver.destroy(context, instance, network_info,
                                block_device_info=block_device_info)
        instance.task_state = task_states.REBUILD_BLOCK_DEVICE_MAPPING
        instance.save(expected_task_state=[task_states.REBUILDING])

new_block_device_info = attach_block_devices(context, instance, bdms)

instance.task_state = task_states.REBUILD_SPAWNING
        instance.save(
            expected_task_state=[task_states.REBUILD_BLOCK_DEVICE_MAPPING])

self.driver.spawn(context, instance, image_meta, injected_files,
                          admin_password, network_info=network_info,
                          block_device_info=new_block_device_info)

So the race is happened between destroy and spawn.

For example:
I use linuxbridge agent at here, and currently the vif's active is false in network info cache.

In normal for rebuild will be:
1. nova destroy the instance.
2. The tap device is remove from bridge.
3. Agent poll the devices, found the removed device, and update the port status to down
4. nova spawn the instance
5. The tap device is add to bridge
6. Agent poll the devices, found the new device, and set the port status to up.
7. Neutron send the network vif plugged event.
8. nova finish the rebuilding.

But if the interval time of neutron agent polling is too long, then the problem will show at here.
At step 3, if the agent didn't poll the devices beween step1 and step4, agent didn't know the device is removed.
After step 5 the device is added back. So agent never know the device has been removed. So the port status won't be updated.
Then nova won't receive the network vif plugged event. So the instance is stuck at building status.

This is also can be reproduce for a stopped instance.

When an instance is stopped, the port status become down, vif's active in network info cache will be false. Then if you rebuild the instance, the instance will stuck at rebuilding status.

So we may need neutron send the network vif unplugged event, then nova should waiting for this event when destroy instance. Then this can ensure the neutron knowing the port have been removed.

Revision history for this message

Alex Xu (xuhj) wrote on 2014-06-20:

#3

@Salvatore, please previous comment, I rethink again. I think your fix is right. sorry...

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-06-20:

#4

no problem.
Anyway your analysis to me looks the same as https://bugs.launchpad.net/nova/+bug/1240849 which was for the ovs agent.
If you reckon the lb agent might be affected by the same issue, then we should push a neutron patch.

Attila Fazekas (afazekas) on 2014-06-24

tags:

added: icehouse-backport-potential

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-06-24:

#5

The same situation also appears in the neutron full job.
Adding the neutron-full-job tag

tags:

added: neutron-full-job

Revision history for this message

Attila Fazekas (afazekas) wrote on 2014-07-02:

#6

message:"failed to reach VERIFY_RESIZE status" AND message:"Current task state\: resize_finish" AND tags:"console.html"

http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiZmFpbGVkIHRvIHJlYWNoIFZFUklGWV9SRVNJWkUgc3RhdHVzXCIgQU5EIG1lc3NhZ2U6XCJDdXJyZW50IHRhc2sgc3RhdGVcXDogcmVzaXplX2ZpbmlzaFwiIEFORCB0YWdzOlwiY29uc29sZS5odG1sXCIiLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MDQyODA3NTYxNDN9

Is the above failure caused by this bug ? (137 hit/week), the build name always contains neutron.
117/Week happened on the neutron full job.

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-07-02:

#7

I checked a few occurrences and there is always something like the following: http://logs.openstack.org/75/103775/2/check/check-tempest-dsvm-neutron-full/a3984cd/logs/screen-n-cpu.txt.gz#_2014-07-01_12_43_38_687 which makes me think the failures retrieved with Attila's query have this root cause.

I am updating the patch. I will push it soon.

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-07-07:

#8

Contrary to what claimed in the bug description, the actual root cause is instead a different one, and it's in neutron.

For events like rebuilding or rebooting an instance a VIF disappears and reappears rather quickly.
In this case the OVS agent loop starts processing the VIF, and then it skips processing when it realizes it's not anymore on the integration bridge.

However it keeps it into the set of 'current' VIFs. This means that when the VIF is plugged again it's not processed and hence the problem.

Removing nova from affected projects. Patch will follow up soon.

no longer affects:

nova

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-07-07:

#9

Also, it might be argued that a VIF should not just 'disappear' and then 'appear' again all of a sudden when the network-vif-* events are enabled. This might actually happen instead because the VIF status on the nova side is not updated when these events are received. See bug 1338672. Neutron should be resilient to this situation.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-07: Fix proposed to neutron (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/105239

Changed in neutron:
assignee:	nobody → Salvatore Orlando (salvatore-orlando)
status:	New → In Progress

Kyle Mestery (mestery) on 2014-07-07

Changed in neutron:
milestone:	none → juno-2
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-09: Fix merged to neutron (master)

#11

Reviewed: https://review.openstack.org/105239
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=90fedbe44ca6bfccce5d71465532fbdc85ee3814
Submitter: Jenkins
Branch: master

commit 90fedbe44ca6bfccce5d71465532fbdc85ee3814
Author: Salvatore Orlando <email address hidden>
Date: Sat Jul 5 09:17:54 2014 -0700

Do not mark device as processed if it wasn't

    Currently treat_devices_added_or_updated in the OVS agent skips
    processing devices which disappeared from the integration bridge
    during the agent loop.
    This is fine, however the agent should not mark these devices as
    processed. Otherwise they won't be processed, should they appear
    again on the bridge.

This patch ensures these devices are not added to the current
device set.

    The patch also changes treat_devices_added_or_updated. The
    function now will return the list of skipped devices and not
    anymore a flag signalling whether a resync is required.
    With the current logic a resync would be required if retrieval
    of device details fails. With this change, the function
    treat_devices_added_or_updated will raise in this case and the
    exception will be handled in process_network_ports.

    For the sake of consistency, this patch also updates the
    similar function treat_ancillary_devices_added in order to
    use the same logic.

Finally, this patch amends an innaccurate related comment.

Closes-Bug: #1329546

Change-Id: Icc744f32494c7a76004ff161536316924594fbdb

Changed in neutron:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-23: Fix proposed to neutron (stable/icehouse)

#12

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/108991

Russell Bryant (russellb) on 2014-07-23

Changed in neutron:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-25: Fix merged to neutron (stable/icehouse)

#13

Reviewed: https://review.openstack.org/108991
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=231010bdf22a30b2034d8221048c958544f19547
Submitter: Jenkins
Branch: stable/icehouse

commit 231010bdf22a30b2034d8221048c958544f19547
Author: Salvatore Orlando <email address hidden>
Date: Sat Jul 5 09:17:54 2014 -0700

Do not mark device as processed if it wasn't

    Currently treat_devices_added_or_updated in the OVS agent skips
    processing devices which disappeared from the integration bridge
    during the agent loop.
    This is fine, however the agent should not mark these devices as
    processed. Otherwise they won't be processed, should they appear
    again on the bridge.

This patch ensures these devices are not added to the current
device set.

    The patch also changes treat_devices_added_or_updated. The
    function now will return the list of skipped devices and not
    anymore a flag signalling whether a resync is required.
    With the current logic a resync would be required if retrieval
    of device details fails. With this change, the function
    treat_devices_added_or_updated will raise in this case and the
    exception will be handled in process_network_ports.

    For the sake of consistency, this patch also updates the
    similar function treat_ancillary_devices_added in order to
    use the same logic.

Finally, this patch amends an innaccurate related comment.

Closes-Bug: #1329546

Conflicts:
neutron/plugins/openvswitch/agent/ovs_neutron_agent.py

    Required changes:
    - fetch all device details first before proceeding with handling ports
      to reflect Juno behaviour.
    - unit test was modified to run with get_device_details since
      get_devices_details_list is not available in Icehouse.
    - fixed E128 violation in the backported code.

Change-Id: Icc744f32494c7a76004ff161536316924594fbdb
(cherry picked from commit 90fedbe44ca6bfccce5d71465532fbdc85ee3814)

Reviewed:  https://review.openstack.org/108991
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=231010bdf22a30b2034d8221048c958544f19547
Submitter: Jenkins
Branch:    stable/icehouse

commit 231010bdf22a30b2034d8221048c958544f19547
Author: Salvatore Orlando <salv.orlando@gmail.com>
Date:   Sat Jul 5 09:17:54 2014 -0700

Do not mark device as processed if it wasn't
    
    Currently treat_devices_added_or_updated in the OVS agent skips
    processing devices which disappeared from the integration bridge
    during the agent loop.
    This is fine, however the agent should not mark these devices as
    processed. Otherwise they won't be processed, should they appear
    again on the bridge.
    
    This patch ensures these devices are not added to the current
    device set.
    
    The patch also changes treat_devices_added_or_updated. The
    function now will return the list of skipped devices and not
    anymore a flag signalling whether a resync is required.
    With the current logic a resync would be required if retrieval
    of device details fails. With this change, the function
    treat_devices_added_or_updated will raise in this case and the
    exception will be handled in process_network_ports.
    
    For the sake of consistency, this patch also updates the
    similar function treat_ancillary_devices_added in order to
    use the same logic.
    
    Finally, this patch amends an innaccurate related comment.
    
    Closes-Bug: #1329546
    
    Conflicts:
    	neutron/plugins/openvswitch/agent/ovs_neutron_agent.py
    
    Required changes:
    - fetch all device details first before proceeding with handling ports
      to reflect Juno behaviour.
    - unit test was modified to run with get_device_details since
      get_devices_details_list is not available in Icehouse.
    - fixed E128 violation in the backported code.
    
    Change-Id: Icc744f32494c7a76004ff161536316924594fbdb
    (cherry picked from commit 90fedbe44ca6bfccce5d71465532fbdc85ee3814)

tags:

added: in-stable-icehouse

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-21: Fix proposed to neutron (stable/havana)

#14

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/115984

Alan Pevec (apevec) on 2014-09-17

tags:

removed: icehouse-backport-potential in-stable-icehouse

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-18: Fix merged to neutron (stable/havana)

#15

Reviewed: https://review.openstack.org/115984
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3c44da1a968d157eeea7677b39d5d9f9d84365c7
Submitter: Jenkins
Branch: stable/havana

commit 3c44da1a968d157eeea7677b39d5d9f9d84365c7
Author: Salvatore Orlando <email address hidden>
Date: Sat Jul 5 09:17:54 2014 -0700

Do not mark device as processed if it wasn't

    Currently treat_devices_added_or_updated in the OVS agent skips
    processing devices which disappeared from the integration bridge
    during the agent loop.
    This is fine, however the agent should not mark these devices as
    processed. Otherwise they won't be processed, should they appear
    again on the bridge.

This patch ensures these devices are not added to the current
device set.

    The patch also changes treat_devices_added_or_updated. The
    function now will return the list of skipped devices and not
    anymore a flag signalling whether a resync is required.
    With the current logic a resync would be required if retrieval
    of device details fails. With this change, the function
    treat_devices_added_or_updated will raise in this case and the
    exception will be handled in process_network_ports.

    For the sake of consistency, this patch also updates the
    similar function treat_ancillary_devices_added in order to
    use the same logic.

Finally, this patch amends an innaccurate related comment.

Closes-Bug: #1329546

    Conflicts:
     neutron/plugins/openvswitch/agent/ovs_neutron_agent.py
     neutron/tests/unit/openvswitch/test_ovs_neutron_agent.py

    Required changes:
    - fetch all device details first before proceeding with handling ports
      to reflect Juno behaviour.
    - unit test was modified to run with get_device_details since
      get_devices_details_list is not available in Icehouse.
    - fixed E128 violation in the backported code.

    Additional changes in Havana:
    - modified patch not to pass ovs_restarted argument into
      treat_devices_added_or_updated() since it's not present in Havana.
    - disabled test_schedule_pool_with_down_agent that fails in gate.

    Change-Id: Icc744f32494c7a76004ff161536316924594fbdb
    (cherry picked from commit 90fedbe44ca6bfccce5d71465532fbdc85ee3814)
    (cherry picked from commit 231010bdf22a30b2034d8221048c958544f19547)

Reviewed:  https://review.openstack.org/115984
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3c44da1a968d157eeea7677b39d5d9f9d84365c7
Submitter: Jenkins
Branch:    stable/havana

commit 3c44da1a968d157eeea7677b39d5d9f9d84365c7
Author: Salvatore Orlando <salv.orlando@gmail.com>
Date:   Sat Jul 5 09:17:54 2014 -0700

Do not mark device as processed if it wasn't
    
    Currently treat_devices_added_or_updated in the OVS agent skips
    processing devices which disappeared from the integration bridge
    during the agent loop.
    This is fine, however the agent should not mark these devices as
    processed. Otherwise they won't be processed, should they appear
    again on the bridge.
    
    This patch ensures these devices are not added to the current
    device set.
    
    The patch also changes treat_devices_added_or_updated. The
    function now will return the list of skipped devices and not
    anymore a flag signalling whether a resync is required.
    With the current logic a resync would be required if retrieval
    of device details fails. With this change, the function
    treat_devices_added_or_updated will raise in this case and the
    exception will be handled in process_network_ports.
    
    For the sake of consistency, this patch also updates the
    similar function treat_ancillary_devices_added in order to
    use the same logic.
    
    Finally, this patch amends an innaccurate related comment.
    
    Closes-Bug: #1329546
    
    Conflicts:
    	neutron/plugins/openvswitch/agent/ovs_neutron_agent.py
    	neutron/tests/unit/openvswitch/test_ovs_neutron_agent.py
    
    Required changes:
    - fetch all device details first before proceeding with handling ports
      to reflect Juno behaviour.
    - unit test was modified to run with get_device_details since
      get_devices_details_list is not available in Icehouse.
    - fixed E128 violation in the backported code.
    
    Additional changes in Havana:
    - modified patch not to pass ovs_restarted argument into
      treat_devices_added_or_updated() since it's not present in Havana.
    - disabled test_schedule_pool_with_down_agent that fails in gate.
    
    Change-Id: Icc744f32494c7a76004ff161536316924594fbdb
    (cherry picked from commit 90fedbe44ca6bfccce5d71465532fbdc85ee3814)
    (cherry picked from commit 231010bdf22a30b2034d8221048c958544f19547)

Thierry Carrez (ttx) on 2014-10-16

Changed in neutron:
milestone:	juno-2 → 2014.2

norman shen (jshen28) on 2021-01-15

description:

updated

neutron

Upon rebuild instances might never get to Active state

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
neutron	Fix Released	High	Salvatore Orlando	neutron 2014.2 "juno"
Havana	Fix Released	High	Ihar Hrachyshka	neutron 2013.2.4
Icehouse	Fix Released	High	Ihar Hrachyshka	neutron 2014.1.2