Upon rebuild instances might never get to Active state

Bug #1329546 reported by Salvatore Orlando
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Salvatore Orlando
Havana
Fix Released
High
Ihar Hrachyshka
Icehouse
Fix Released
High
Ihar Hrachyshka

Bug Description

VMware mine sweeper for Neutron (*) recently showed a 100% failure rate on tempest.api.compute.v3.servers.test_server_actions

Logs for two instances of these failures are available at [1] and [2]
The failure manifested as an instance unable to go active after a rebuild.
A bit of instrumentation and log analysis revealed no obvious error on the neutron side - and also that the instance was actually in "running" state even if its task state was "rebuilding/spawning"

N-API logs [3] revealed that the instance spawn was timing out on a missed notification from neutron regarding VIF plug - however the same log showed such notification was received [4]

It turns out that, after rebuild, the instance network cache had still 'active': False for the instance's VIF, even if the status for the corresponding port was 'ACTIVE'. This happened because after the network-vif-plugged event was received, nothing triggered a refresh of the instance network info. For this reason, the VM, after a rebuild, kept waiting for an even which obviously was never sent from neutron.

While this manifested only on mine sweeper - this appears to be a nova bug - manifesting in vmware minesweeper only because of the way the plugin synchronizes with the backend for reporting the operational status of a port.
A simple solution for this problem would be to reload the instance network info cache when network-vif-plugged events are received by nova. (But as the reporter knows nothing about nova this might be a very bad idea as well)

[1] http://208.91.1.172/logs/neutron/98278/2/413209/testr_results.html
[2] http://208.91.1.172/logs/neutron/73234/34/413213/testr_results.html
[3] http://208.91.1.172/logs/neutron/73234/34/413213/logs/screen-n-cpu.txt.gz?level=WARNING#_2014-06-06_01_46_36_219
[4] http://208.91.1.172/logs/neutron/73234/34/413213/logs/screen-n-cpu.txt.gz?level=DEBUG#_2014-06-06_01_41_31_767

(*) runs libvirt/KVM + NSX

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I'm not sure why gerrit did not update it. Commit msg looks fine.

https://review.openstack.org/#/c/99182/

Changed in nova:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
status: New → In Progress
Revision history for this message
Alex Xu (xuhj) wrote :

@Salvatore, I feel there is race between nova and neutron l2 agent.

Rebuild destroy the instance first and then spawn at, the code as below:

        if not recreate:
            self.driver.destroy(context, instance, network_info,
                                block_device_info=block_device_info)
        instance.task_state = task_states.REBUILD_BLOCK_DEVICE_MAPPING
        instance.save(expected_task_state=[task_states.REBUILDING])

        new_block_device_info = attach_block_devices(context, instance, bdms)

        instance.task_state = task_states.REBUILD_SPAWNING
        instance.save(
            expected_task_state=[task_states.REBUILD_BLOCK_DEVICE_MAPPING])

        self.driver.spawn(context, instance, image_meta, injected_files,
                          admin_password, network_info=network_info,
                          block_device_info=new_block_device_info)

So the race is happened between destroy and spawn.

For example:
I use linuxbridge agent at here, and currently the vif's active is false in network info cache.

In normal for rebuild will be:
1. nova destroy the instance.
2. The tap device is remove from bridge.
3. Agent poll the devices, found the removed device, and update the port status to down
4. nova spawn the instance
5. The tap device is add to bridge
6. Agent poll the devices, found the new device, and set the port status to up.
7. Neutron send the network vif plugged event.
8. nova finish the rebuilding.

But if the interval time of neutron agent polling is too long, then the problem will show at here.
At step 3, if the agent didn't poll the devices beween step1 and step4, agent didn't know the device is removed.
After step 5 the device is added back. So agent never know the device has been removed. So the port status won't be updated.
Then nova won't receive the network vif plugged event. So the instance is stuck at building status.

This is also can be reproduce for a stopped instance.

When an instance is stopped, the port status become down, vif's active in network info cache will be false. Then if you rebuild the instance, the instance will stuck at rebuilding status.

So we may need neutron send the network vif unplugged event, then nova should waiting for this event when destroy instance. Then this can ensure the neutron knowing the port have been removed.

Revision history for this message
Alex Xu (xuhj) wrote :

@Salvatore, please previous comment, I rethink again. I think your fix is right. sorry...

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

no problem.
Anyway your analysis to me looks the same as https://bugs.launchpad.net/nova/+bug/1240849 which was for the ovs agent.
If you reckon the lb agent might be affected by the same issue, then we should push a neutron patch.

tags: added: icehouse-backport-potential
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

The same situation also appears in the neutron full job.
Adding the neutron-full-job tag

tags: added: neutron-full-job
Revision history for this message
Attila Fazekas (afazekas) wrote :

message:"failed to reach VERIFY_RESIZE status" AND message:"Current task state\: resize_finish" AND tags:"console.html"

http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiZmFpbGVkIHRvIHJlYWNoIFZFUklGWV9SRVNJWkUgc3RhdHVzXCIgQU5EIG1lc3NhZ2U6XCJDdXJyZW50IHRhc2sgc3RhdGVcXDogcmVzaXplX2ZpbmlzaFwiIEFORCB0YWdzOlwiY29uc29sZS5odG1sXCIiLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MDQyODA3NTYxNDN9

Is the above failure caused by this bug ? (137 hit/week), the build name always contains neutron.
117/Week happened on the neutron full job.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I checked a few occurrences and there is always something like the following: http://logs.openstack.org/75/103775/2/check/check-tempest-dsvm-neutron-full/a3984cd/logs/screen-n-cpu.txt.gz#_2014-07-01_12_43_38_687 which makes me think the failures retrieved with Attila's query have this root cause.

I am updating the patch. I will push it soon.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Contrary to what claimed in the bug description, the actual root cause is instead a different one, and it's in neutron.

For events like rebuilding or rebooting an instance a VIF disappears and reappears rather quickly.
In this case the OVS agent loop starts processing the VIF, and then it skips processing when it realizes it's not anymore on the integration bridge.

However it keeps it into the set of 'current' VIFs. This means that when the VIF is plugged again it's not processed and hence the problem.

Removing nova from affected projects. Patch will follow up soon.

no longer affects: nova
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Also, it might be argued that a VIF should not just 'disappear' and then 'appear' again all of a sudden when the network-vif-* events are enabled. This might actually happen instead because the VIF status on the nova side is not updated when these events are received. See bug 1338672. Neutron should be resilient to this situation.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/105239

Changed in neutron:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
status: New → In Progress
Kyle Mestery (mestery)
Changed in neutron:
milestone: none → juno-2
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/105239
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=90fedbe44ca6bfccce5d71465532fbdc85ee3814
Submitter: Jenkins
Branch: master

commit 90fedbe44ca6bfccce5d71465532fbdc85ee3814
Author: Salvatore Orlando <email address hidden>
Date: Sat Jul 5 09:17:54 2014 -0700

    Do not mark device as processed if it wasn't

    Currently treat_devices_added_or_updated in the OVS agent skips
    processing devices which disappeared from the integration bridge
    during the agent loop.
    This is fine, however the agent should not mark these devices as
    processed. Otherwise they won't be processed, should they appear
    again on the bridge.

    This patch ensures these devices are not added to the current
    device set.

    The patch also changes treat_devices_added_or_updated. The
    function now will return the list of skipped devices and not
    anymore a flag signalling whether a resync is required.
    With the current logic a resync would be required if retrieval
    of device details fails. With this change, the function
    treat_devices_added_or_updated will raise in this case and the
    exception will be handled in process_network_ports.

    For the sake of consistency, this patch also updates the
    similar function treat_ancillary_devices_added in order to
    use the same logic.

    Finally, this patch amends an innaccurate related comment.

    Closes-Bug: #1329546

    Change-Id: Icc744f32494c7a76004ff161536316924594fbdb

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/108991

Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/icehouse)

Reviewed: https://review.openstack.org/108991
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=231010bdf22a30b2034d8221048c958544f19547
Submitter: Jenkins
Branch: stable/icehouse

commit 231010bdf22a30b2034d8221048c958544f19547
Author: Salvatore Orlando <email address hidden>
Date: Sat Jul 5 09:17:54 2014 -0700

    Do not mark device as processed if it wasn't

    Currently treat_devices_added_or_updated in the OVS agent skips
    processing devices which disappeared from the integration bridge
    during the agent loop.
    This is fine, however the agent should not mark these devices as
    processed. Otherwise they won't be processed, should they appear
    again on the bridge.

    This patch ensures these devices are not added to the current
    device set.

    The patch also changes treat_devices_added_or_updated. The
    function now will return the list of skipped devices and not
    anymore a flag signalling whether a resync is required.
    With the current logic a resync would be required if retrieval
    of device details fails. With this change, the function
    treat_devices_added_or_updated will raise in this case and the
    exception will be handled in process_network_ports.

    For the sake of consistency, this patch also updates the
    similar function treat_ancillary_devices_added in order to
    use the same logic.

    Finally, this patch amends an innaccurate related comment.

    Closes-Bug: #1329546

    Conflicts:
     neutron/plugins/openvswitch/agent/ovs_neutron_agent.py

    Required changes:
    - fetch all device details first before proceeding with handling ports
      to reflect Juno behaviour.
    - unit test was modified to run with get_device_details since
      get_devices_details_list is not available in Icehouse.
    - fixed E128 violation in the backported code.

    Change-Id: Icc744f32494c7a76004ff161536316924594fbdb
    (cherry picked from commit 90fedbe44ca6bfccce5d71465532fbdc85ee3814)

tags: added: in-stable-icehouse
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/115984

Alan Pevec (apevec)
tags: removed: icehouse-backport-potential in-stable-icehouse
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/havana)

Reviewed: https://review.openstack.org/115984
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3c44da1a968d157eeea7677b39d5d9f9d84365c7
Submitter: Jenkins
Branch: stable/havana

commit 3c44da1a968d157eeea7677b39d5d9f9d84365c7
Author: Salvatore Orlando <email address hidden>
Date: Sat Jul 5 09:17:54 2014 -0700

    Do not mark device as processed if it wasn't

    Currently treat_devices_added_or_updated in the OVS agent skips
    processing devices which disappeared from the integration bridge
    during the agent loop.
    This is fine, however the agent should not mark these devices as
    processed. Otherwise they won't be processed, should they appear
    again on the bridge.

    This patch ensures these devices are not added to the current
    device set.

    The patch also changes treat_devices_added_or_updated. The
    function now will return the list of skipped devices and not
    anymore a flag signalling whether a resync is required.
    With the current logic a resync would be required if retrieval
    of device details fails. With this change, the function
    treat_devices_added_or_updated will raise in this case and the
    exception will be handled in process_network_ports.

    For the sake of consistency, this patch also updates the
    similar function treat_ancillary_devices_added in order to
    use the same logic.

    Finally, this patch amends an innaccurate related comment.

    Closes-Bug: #1329546

    Conflicts:
     neutron/plugins/openvswitch/agent/ovs_neutron_agent.py
     neutron/tests/unit/openvswitch/test_ovs_neutron_agent.py

    Required changes:
    - fetch all device details first before proceeding with handling ports
      to reflect Juno behaviour.
    - unit test was modified to run with get_device_details since
      get_devices_details_list is not available in Icehouse.
    - fixed E128 violation in the backported code.

    Additional changes in Havana:
    - modified patch not to pass ovs_restarted argument into
      treat_devices_added_or_updated() since it's not present in Havana.
    - disabled test_schedule_pool_with_down_agent that fails in gate.

    Change-Id: Icc744f32494c7a76004ff161536316924594fbdb
    (cherry picked from commit 90fedbe44ca6bfccce5d71465532fbdc85ee3814)
    (cherry picked from commit 231010bdf22a30b2034d8221048c958544f19547)

Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-2 → 2014.2
norman shen (jshen28)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.