DPDK: Not able to launch any VM after add/del of Vm's in a loop

Bug #1495002 reported by Vinod Nair
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.0
New
Medium
Hari Prasad Killi
R3.1
New
Medium
Hari Prasad Killi
Trunk
New
Medium
Hari Prasad Killi

Bug Description

Not able to launch any vm's after add / del VM's in a loop

VIF's leask after each add /del operation

Get the below error

root@cs-scale-4:/store# nova list
+--------------------------------------+------------+--------+------------+-------------+--------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------------+--------+------------+-------------+--------------------+
| 8db99cfc-7624-4006-8035-c869c9d4a929 | CS8-VM1 | ACTIVE | - | Running | ADM-NET1=144.1.0.3 |
| c9e85302-26e5-4dd3-a1cd-ffc246aebf74 | DELETE-VM2 | ERROR | - | NOSTATE | |
+--------------------------------------+------------+--------+------------+-------------+--------------------+
root@cs-scale-4:/store# nova show c9e85302-26e5-4dd3-a1cd-ffc246aebf74
+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | cs-scale-7 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | cs-scale-7 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000995 |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | error |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2015-09-11T22:30:44Z |
| fault | {"message": "internal error: process exited while connecting to monitor: libust[19463/19463]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) |
| | 2015-09-11T22:30:47.237168Z qemu-system-x86_", "code": 500, "details": " File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 293, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 2114, in run_instance |
| | do_run_instance() |
| | File \"/usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py\", line 254, in inner |
| | return f(*args, **kwargs) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 2113, in do_run_instance |
| | legacy_bdm_in_spec) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 1227, in _run_instance |
| | notify(\"error\", fault=e) # notify that build failed |
| | File \"/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py\", line 68, in __exit__ |
| | six.reraise(self.type_, self.value, self.tb) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 1211, in _run_instance |
| | instance, image_meta, legacy_bdm_in_spec) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 1331, in _build_instance |
| | set_access_ip=set_access_ip) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 396, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 1743, in _spawn |
| | LOG.exception(_('Instance failed to spawn'), instance=instance) |
| | File \"/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py\", line 68, in __exit__ |
| | six.reraise(self.type_, self.value, self.tb) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 1740, in _spawn |
| | block_device_info) |
| | File \"/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py\", line 2297, in spawn |
| | block_device_info) |
| | File \"/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py\", line 3721, in _create_domain_and_network |
| | power_on=power_on) |
| | File \"/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py\", line 3622, in _create_domain |
| | domain.XMLDesc(0)) |
| | File \"/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py\", line 68, in __exit__ |
| | six.reraise(self.type_, self.value, self.tb) |
| | File \"/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py\", line 3617, in _create_domain |
| | domain.createWithFlags(launch_flags) |
| | File \"/usr/lib/python2.7/dist-packages/eventlet/tpool.py\", line 179, in doit |
| | result = proxy_call(self._autowrap, f, *args, **kwargs) |
| | File \"/usr/lib/python2.7/dist-packages/eventlet/tpool.py\", line 139, in proxy_call |
| | rv = execute(f,*args,**kwargs) |
| | File \"/usr/lib/python2.7/dist-packages/eventlet/tpool.py\", line 77, in tworker |
| | rv = meth(*args,**kwargs) |
| | File \"/usr/lib/python2.7/dist-packages/libvirt.py\", line 900, in createWithFlags |
| | if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) |
| | ", "created": "2015-09-11T22:30:48Z"} |
| flavor | 2G (9) |
| hostId | 51797437a6890da1318cde0f1ff3ed4ae569cf588fa55f1ce98cdd1d |
| id | c9e85302-26e5-4dd3-a1cd-ffc246aebf74 |
| image | SRV (5ea4a477-b305-4ad0-b4d9-d45f19251448) |
| key_name | - |
| metadata | {} |
| name | DELETE-VM2 |
| os-extended-volumes:volumes_attached | [] |
| status | ERROR |
| tenant_id | 8e1b50878f9148ee99358af10e8641f5 |
| updated | 2015-09-11T22:30:48Z |
| user_id | d0ca1b46438141ebb3002b0ad51f1cb1 |
+--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

root@cs-scale-8:/var/log/nova# vif --list | grep PMD | wc -l
883
root@cs-scale-7:/var/log/nova# vif --list | grep PMD | wc -l
2

root@cs-scale-7:/var/log/nova# cat /proc/meminfo | grep -i Huge
AnonHugePages: 116736 kB
HugePages_Total: 25792
HugePages_Free: 25503
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

Version: 3.0-2645 Icehouse 14.04

Tags: dpdk
Vinod Nair (vinodnair)
Changed in juniperopenstack:
milestone: none → r3.0-fcs
Vinod Nair (vinodnair)
tags: added: beta-blocker
Revision history for this message
Vinod Nair (vinodnair) wrote :
Download full text (4.2 KiB)

Seen with latest binary (beta) also

root@cs-scale-7:/var/crashes# virsh list
 Id Name State
----------------------------------------------------

root@cs-scale-7:/var/crashes# virsh list --ll
error: command 'list' doesn't support option --ll
root@cs-scale-7:/var/crashes# virsh list --all
 Id Name State
----------------------------------------------------

root@cs-scale-7:/var/crashes# vif --list
Vrouter Interface Table

Flags: P=Policy, X=Cross Connect, S=Service Chain, Mr=Receive Mirror
       Mt=Transmit Mirror, Tc=Transmit Checksum Offload, L3=Layer 3, L2=Layer 2
       D=DHCP, Vp=Vhost Physical, Pr=Promiscuous, Vnt=Native Vlan Tagged
       Mnp=No MAC Proxy, Dpdk=DPDK PMD Interface, Rfl=Receive Filtering Offload, Mon=Interface is Monitored
       Uuf=Unknown Unicast Flood, Vof=VLAN insert/strip offload

vif0/0 PCI: 0:129:0.0 (Speed 10000, Duplex 1)
            Type:Physical HWaddr:90:e2:ba:4c:63:f0 IPaddr:0
            Vrf:0 Flags:TcL3L2VpVof MTU:1514 Ref:32
            RX device packets:1042372 bytes:1289801315 errors:1247
            RX port packets:11757 errors:0
            RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
            RX packets:11757 bytes:2495683 errors:0
            TX packets:14506 bytes:4400725 errors:0
            TX port packets:14506 errors:0
            TX device packets:326815 bytes:214375497 errors:0

vif0/1 PMD: vhost0
            Type:Host HWaddr:90:e2:ba:4c:63:f0 IPaddr:d010007
            Vrf:0 Flags:L3L2 MTU:1514 Ref:18
            RX port packets:14455 errors:0
            RX queue packets:14455 errors:0
            RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
            RX packets:14455 bytes:4398583 errors:0
            TX packets:11757 bytes:2495683 errors:0
            TX queue packets:11757 errors:0
            TX port packets:11757 errors:0

vif0/2 Socket: unix
            Type:Agent HWaddr:00:00:5e:00:01:00 IPaddr:0
            Vrf:65535 Flags:L3 MTU:1514 Ref:2
            RX port packets:121 errors:0
            RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
            RX packets:121 bytes:9922 errors:0
            TX packets:124 bytes:12400 errors:0
            TX queue packets:124 errors:0
            TX port packets:124 errors:0 syscalls:138

vif0/3 PMD: tap916dbaab-e0
            Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:0
            Vrf:65535 Flags:L3L2D MTU:9160 Ref:16
            RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
            RX packets:0 bytes:0 errors:0
            TX packets:0 bytes:0 errors:0

vif0/4 PMD: tap7282c161-fa
            Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:0
            Vrf:65535 Flags:L3L2D MTU:9160 Ref:16
            RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
            RX packets:0 bytes:0 errors:0
            TX packets:0 bytes:0 errors:0

vif0/11 PMD: tap38f25f2f-4f
            Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:0
            Vrf:65535 Flags:L3L2D MTU:9160 Ref:16
            RX queue errors to lcore 0 0 0...

Read more...

Revision history for this message
Andriy Berestovskyy (aber-k) wrote :

The bug is confirmed. vRouter does not get the vif delete message from the Agent.

Here is an example:
root@l3:~/log# cat contrail-vrouter-dpdk-stdout.log | grep "vif 38"
// This is a normal add/delete sequence:
2015-12-15 17:25:43,318 VROUTER: Adding vif 38 (gen. 405) virtual device tapb3f27625-22
2015-12-15 17:25:43,319 UVHOST: Adding vif 38 virtual device tapb3f27625-22
2015-12-15 17:25:43,319 UVHOST: vif 38 socket tapb3f27625-22 FD is 167
2015-12-15 17:25:47,465 VROUTER: Deleting vif 38 virtual device
// Another one:
2015-12-15 17:27:23,847 VROUTER: Adding vif 38 (gen. 423) virtual device tapda29cbdb-f3
2015-12-15 17:27:23,847 UVHOST: Adding vif 38 virtual device tapda29cbdb-f3
2015-12-15 17:27:23,847 UVHOST: vif 38 socket tapda29cbdb-f3 FD is 167
2015-12-15 17:27:27,989 VROUTER: Deleting vif 38 virtual device
// No delete message for a reason:
2015-12-15 17:28:14,291 VROUTER: Adding vif 38 (gen. 432) virtual device tapbed8df27-35
2015-12-15 17:28:14,291 UVHOST: Adding vif 38 virtual device tapbed8df27-35
2015-12-15 17:28:14,291 UVHOST: vif 38 socket tapbed8df27-35 FD is 165
// After a restart, the Agent adds the interface back again:
2015-12-16 09:06:50,221 VROUTER: Adding vif 38 (gen. 39) virtual device tapbedb6892-36
2015-12-16 09:06:50,223 UVHOST: Adding vif 38 virtual device tapbedb6892-36
2015-12-16 09:06:50,223 UVHOST: vif 38 socket tapbedb6892-36 FD is 90

So basically, there are two issues here:
1) The vif delete message gets lost, so the "orphan" vifs appear.

2) After restart, Agent keeps adding those "orphan" vifs back to vRouter.
In kernel version we do not add vif if there is no tap in the kernel, but there is no such a check for the DPDK version.

Revision history for this message
Andriy Berestovskyy (aber-k) wrote :

The only difference in logs between "normal" vif add/delete and "orphan" vif add/delete is the SANDESH Queue Drop message:

/var/log/contrail/contrail-control.log:2015-12-17 Thu 17:13:18:091.419 CET l3.contrail.semihalf.com [Thread 140644162922240, Pid 4501]: SANDESH: Queue Drop: IFMap [SYS_DEBUG]: IFMapClientSendInfo: Sent Update of link <virtual-machine-interface-routing-instance:attr(default-domain:admin:f62dc606-2609-46d4-a92c-00ee0b602edc,default-domain:admin:net:net),routing-instance:default-domain:admin:net:net> to vRouter default-global-system-config:l3.contrail.semihalf.com l3.contrail.semihalf.com:172.16.0.1 controller/src/ifmap/ifmap_update_sender.cc 318

Revision history for this message
Ashok Singh (ashoksr) wrote :
Download full text (3.3 KiB)

Nova-compute has not sent the delete to agent in this case. The file for the orphan vif was still observed under /var/lib/contrail/ports

Hi Andriy,

Thanks for your analysis. Since the file for orphan vif tapbfe87886-e0 is still present in /var/lib/contrail/ports/, nova-compute has not sent delete to agent and that is the reason why delete has not been sent to vrouter. We need to check as to why nova-compute didn’t send the delete.

Regards,
Ashok

On 2/10/16, 2:46 PM, "Andriy Berestovskyy" <email address hidden> wrote:

Hi Ashok,
Thanks for the very detailed instructions. Answering your questions:

(1)
I run the script and got one orphan vif tapbfe87886-e0.
Sandesh Trace Request Config shows just add messages for this vif
while for other interfaces there are adds and deletes.

(2)
Sure, the file for the vif is present in /var/lib/contrail/ports. The
instance id is 1cd58568-a830-4f45-ac0a-a42e88229516

In the nova-compute.log for the instance id there are few strange logs
(I cut out the dates):

WARNING nova.compute.manager [-] [instance:
1cd58568-a830-4f45-ac0a-a42e88229516] Instance shutdown by itself.
Calling the stop API. Current vm_state: active, current task_state:
None, current DB power_state: 4, current VM power_state: 4

INFO nova.compute.manager [req-d0242ecf-6bf8-4cc1-9836-edb3ca8c0c29
None] [instance: 1cd58568-a830-4f45-ac0a-a42e88229516] Instance is
already powered off in the hypervisor when stop is called.

WARNING nova.virt.libvirt.driver [-] [instance:
1cd58568-a830-4f45-ac0a-a42e88229516] During wait destroy, instance
disappeared

(3)
I guess due to (2) there are no delete messages sent to the vRouter.

Andriy

On Tue, Feb 9, 2016 at 3:30 PM, Ashok Singh R <email address hidden> wrote:
Hi Andriy,

I didn’t get chance to run your script.

Would it be possible for you to check the following in your setup?

(1)
http://<compute-node-ip>:8085/Snh_SandeshTraceRequest?x=Config

This page should have logs which indicate whether the agent is receiving add/delete requests

(2) For Orphan vif, you can also check whether entry is present in directory.
/var/log/contrail/ports/
This directory should have one file for each port. In most cases this file is created/deleted by nova-compute on port-add/delete events. It should create/delete the file even when agent is dead

If the entry for the orphan vif is still present, please check the log file /var/log/nova/nova-compute.log to see if there are any errors/exceptions etc.

(3)
http://<compute-node-ip>:8085/Snh_SandeshTraceRequest?x=KSync%20Interface

This page should have information on whether agent sent the request to vrouter and whether vrouter acked it or not.

Regards,
Ashok

On 2/8/16, 9:19 PM, "Andriy Berestovskyy" <email address hidden> wrote:

Hi Ashok,
I just run my old script and got the "orphan" vifs again, so the issue
is still there. Please find the script attached.

How do we check the messages Agent receives/sends?

Andriy

On Sun, Feb 7, 2016 at 9:32 AM, Ashok Singh R <email address hidden> wrote:
Hi Andriy,

I am working on bug
https://bugs.launchpad.net/juniperopenstack/+bug/1495002. One of issues
mentioned in the bug is vif delete message gets lost. Do you have more
i...

Read more...

Revision history for this message
Vinod Nair (vinodnair) wrote :

Still See this I have attached the sandesh trace logs and vif list there are no active VM's

Revision history for this message
Ashok Singh (ashoksr) wrote :

Nova-compute does not send delete to agent when add/delete is performed continuously. Script to recreate the issue is attached. The script does add and delete of VMs repeatedly. (net-id has to replaced appropriately when reusing the script)

Some logs associated with instance UUID of interface for which delete is not recevied are pasted below.

15522:2016-03-01 14:37:12.426 22068 AUDIT nova.compute.manager [req-4879a34c-ecc5-4a59-943c-d98d65b25eda None] [instance: 025fab58-a8e2-40a9-978c-bef78d1d1d3c] Terminating instance
15538:2016-03-01 14:37:13.079 22068 INFO nova.virt.libvirt.driver [-] [instance: 025fab58-a8e2-40a9-978c-bef78d1d1d3c] Instance destroyed successfully.
15650:2016-03-01 14:37:14.444 22068 DEBUG nova.compute.manager [-] [instance: 025fab58-a8e2-40a9-978c-bef78d1d1d3c] Synchronizing instance power state after lifecycle event "Started"; current vm_state: active, current task_state: None, current DB power_state: 1, VM power_state: 1 handle_lifecycle_event /usr/lib/python2.7/dist-packages/nova/compute/manager.py:1126
15835:2016-03-01 14:37:16.744 22068 INFO nova.compute.manager [-] [instance: 025fab58-a8e2-40a9-978c-bef78d1d1d3c] VM Stopped (Lifecycle Event)
15862:2016-03-01 14:37:18.640 22068 WARNING nova.compute.manager [-] [instance: 025fab58-a8e2-40a9-978c-bef78d1d1d3c] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, current DB power_state: 4, current VM power_state: 4

Revision history for this message
Vinod Nair (vinodnair) wrote :

Issue is still seem in 3.0 GA image also .
Workaround to clean up the leaked vifs is as below

Stop / delete all the VM's on the compute
delete the files in /var/lib/contrail/ports
Reboot the compute

Jeba Paulaiyan (jebap)
tags: added: blocker
information type: Proprietary → Public
Revision history for this message
Raja Sivaramakrishnan (raja-u) wrote :

This is a race condition in nova and is seen with kernel mode router also. It is a day-one issue, so it probably shouldn't be a blocker.

Jeba Paulaiyan (jebap)
tags: removed: blocker
tags: removed: beta-blocker
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.