Tacker enabled with health monitoring fails to respawn vnf instance with error: port vdu 1 still in use

Bug #1509465 reported by Sripriya on 2015-10-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tacker
Undecided
Sripriya

Bug Description

This is an intermittent issue observed aprrox. 1 in 15 iterations.

Workflow:

Create a vnfd with a template configured with monitoring policy and user_data with ifdown eth0 ( see https://review.openstack.org/#/c/234543/ to get the template)
Create a vnf using the given vnfd which brings up the VNF in ACTIVE state.
The user_data immediately brings the eth0 interface down
Tacker health monitoring immediately kicks in to respawn the VNF
VNF respawn fails with Heat reporting poirt vdu1 still in use

NOTE: This issue is happening intermittently ( 1 in 15-20 times) observed on the gate and also when running tox -e functional in a continuous iteration

Sripriya (sseetha) wrote :

Observations:

1. A new instance is created in nova and request sent to neutron to create all 3 ports
2. Once the instance spawns successfully, we disable the Ethernet interface which will cause tacker health-monitor to delete the instance and immediately respawn it
3. During instance deletion, nova sends a request to neutron to unplug the ports and eventually to delete the interfaces
4. Once the request is sent to neutron, it receives a 204 response
5. Nova goes ahead with termination of instance and has the below log at
2015-10-14 04:12:24.031 [00;36mINFO nova.compute.manager [[01;36mreq-64f878fe-fa84-4a91-8e04-7eb65540e80f [00;36mtacker service[00;36m] [01;35m[instance: ee87f1c6-e99d-4b28-81bc-5d68711a1903] [00;36mTook 0.60 seconds to deallocate network for instance.
6. Nova immediately spins up a new instance and sends requests to neutron to create 3 ports
7. While the requests are coming through to neutron to create 3 ports, neutron is still deleting the previous ports and sends a request of ‘network-vif-deleted’ at:
2015-10-14 04:12:25.838 [00;32mDEBUG neutron.notifiers.nova [[00;36m-[00;32m] [01;35m[00;32mSending events: [{'tag': u'2bb6db9f-cea5-4b1b-8a64-4a92d701c1b2', 'name': 'network-vif-deleted', 'server_uuid': u'ee87f1c6-e99d-4b28-81bc-5d68711a1903'}, {'tag': u'94ace298-7b17-4c14-83cf-0826e30d0e34', 'name': 'network-vif-deleted', 'server_uuid': u'ee87f1c6-e99d-4b28-81bc-5d68711a1903'}, {'name': 'network-changed', 'server_uuid': u'33168619-ae37-4e68-a7cb-0c70f1ed8ce3'}][00m [00;33mfrom (pid=19524) send_events /opt/stack/neutron/neutron/notifiers/nova.py:244[00m

This all happened in one second as highlighted in red
8. On the nova side, it is still yet to update the cache with new network info for the 3 port creation requests it sent and hence still not locked the refresh-cache to update the network info.
9. While it sends the requests, neutron’s network-vif-deleted network-changed event is received by nova and it assumes it needs to update its network info, and immediately acquires the refresh_cache.
10. Nova sends out a request again to query networks and ports. Once the response is received, it updates the network info cache and releases the lock
11. In parallel, the original 3 ports creation request would then be completed and tacker’s request would acquire the refresh-cache lock and again send out neutron query to fetch networks and ports info for the new ports along with existing cached ports
12. Eventually, nova network info cache would end up having two sets of network info and builds a qemu xml with that
13. Since the ports are now duplicated, libvirt: complains resource is busy. In this case, it can be the very first management port
14. Heat api, interprets the resource or tap device busy belonging to a port as vdu1: port still in use and hence the VNF ERROR

Sripriya (sseetha) on 2015-10-23
Changed in tacker:
assignee: nobody → Sripriya (sseetha)
status: New → Confirmed

Fix proposed to branch: master
Review: https://review.openstack.org/239104

Changed in tacker:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/239104
Committed: https://git.openstack.org/cgit/openstack/tacker/commit/?id=7ae2388903fb998f467c8b57846544c6864c3c5b
Submitter: Jenkins
Branch: master

commit 7ae2388903fb998f467c8b57846544c6864c3c5b
Author: Sripriya <email address hidden>
Date: Fri Oct 23 13:30:45 2015 -0700

    Fixes port still in use for monitoring workflow

    Nova neutron interaction for port deletion of old instance and immediate
    port addition of new instance is failing the vnf respawn by throwing
    libvirt error: tap device or resource busy.

    Introduced a sleep after instance deletion giving nova enough time
    to clean up the old instance before spinning up a new instance.

    Change-Id: I385d2e6e19da2ad2bfe8c77aba51dae2b922242b
    Closes-Bug: #1509465

Changed in tacker:
status: In Progress → Fix Committed
Changed in tacker:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers