NFV: After VM migration or resize VM become unavailable.

Bug #1619565 reported by Alexander Koryagin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Roman Podoliaka

Bug Description

NFV: After VM migration or resize VM become unavailable.

TESTS:

[VLAN] Check resize vm which uses cpu pinning (# 838339)
    https://mirantis.testrail.com/index.php?/tests/view/14598027
    http://paste.openstack.org/show/565235/

    Steps:
    1. Create net1 with subnet, net2 with subnet and router1 with interfaces to both nets
    2. Launch vm1 using m1.small.performance-1 flavor on compute-1 and vm2 on compute-2 with m1.small.old flavor.
    3. Resize vm1 to m1.small.performance-2
    4. Ping vm1 from vm2
    ERR:
    Exception: Can't connect to server

[VLAN] Check connectivity between VMs with flavor for 1 NUMA-node after cold migration on node with feature (# 838338)
    https://mirantis.testrail.com/index.php?/tests/view/14598028
    http://paste.openstack.org/show/565236/

    Steps:
    1. Create net1 with subnet, net2 with subnet and router1 with interfaces to both nets
    2. Launch vm1 using m1.small.performance flavor on compute-1 and vm2 on compute-2.
    3. Migrate vm1 from compute-1
    4. Check CPU Pinning
    5. Ping vm1 from vm2
    ERR:
    Exception: Can't connect to server

Revision history for this message
Alexander Koryagin (akoryagin) wrote :
Revision history for this message
Alexander Koryagin (akoryagin) wrote :
Changed in mos:
assignee: nobody → Sergey Nikitin (snikitin)
importance: Undecided → Medium
Changed in mos:
status: New → Confirmed
milestone: none → 9.2
tags: added: area-nova
description: updated
Changed in mos:
assignee: Sergey Nikitin (snikitin) → Roman Podoliaka (rpodolyaka)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Alexander,

I checked the logs and do not see any obvious problems with Nova. As this is a connectivity issue, it would help a lot, if you could pass an environment for RCA to us.

Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

Is this only reproducible when CPU pinning is used and the flavor is changed?

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The priority of the issue changed to high, we need to fix it.

Changed in mos:
importance: Medium → High
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Timur,

Could you please provide any reasoning behind that decision?

At this point we are not even sure it *is* reproduced at all. I asked Alexander to give it another try and hand an environment to us, should he hit the same error again (we don't have hardware with NUMA and can only emulate it in VMs).

Changed in mos:
status: Confirmed → Incomplete
assignee: Roman Podoliaka (rpodolyaka) → Timur Nurlygayanov (tnurlygayanov)
milestone: 9.2 → 9.1
Revision history for this message
Gregory Elkinbard (gelkinbard) wrote :

There are two types of workloads for NFV.
1) SR-IOV. These workloads cannot be migrated.
2) DPDK. These workloads can be migrated and it is important for making DPDK production instead of experimental.

In addition to NFV I expect that pinning will be used for other important workloads such as databases. There both migration and resize would be very important.

If there is a question about reproducability of this bug. then please try to reproduce it, but if this is reproducable it needs to be fixed

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Roman, the priority is High because any migration or resize of VMs with enabled NFV features will break the VM connectivity. It is not acceptable for production environments and it will be an important issue for customers who will use SR-IOV / DPDK.

Changed in mos:
status: Incomplete → Confirmed
assignee: Timur Nurlygayanov (tnurlygayanov) → MOS Nova (mos-nova)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Gregory,

SR-IOV is a different problem. There's a patch series on review to upstream master to fix cold migration, but those patches are not ready yet - https://review.openstack.org/#/q/topic:bug/1512880+status:open - merging them to 9.1 at this point is risky (it's not even clear whether the patches help or not). We'll be working on getting them merged to 9.2, though.

Changed in mos:
assignee: MOS Nova (mos-nova) → Roman Podoliaka (rpodolyaka)
Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I'm still waiting for an env to be prepared by the QA team.

I checked the logs from the recent failure (http://cz7776.bud.mirantis.net:8080/jenkins/job/9.x_SR-IOV_Ceph_baremetal/24/testReport/junit/mos_tests.nfv.test_cpu_pinning/TestCpuPinningResize/test_cpu_pinning_resize__838339__/) and everything seems to be fine (from nova perspective - instance becomes ACTIVE, qemu process is started successfully - http://paste.openstack.org/show/573546/). Unfortunately, it's console logs are missing in the diagnostic snapshot, so it's not clear e.g. whether a VM received an IP address or not.

It's not clear why there is no connectivity to the instance. Having SSH access to the environment, so that networking stack can be introspected would probably help.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Based on Nova and Neutron logs the VM was resized properly:

http://paste.openstack.org/show/573572/

^ the VM is restarted on the new host (node-5) and the port is correctly bound.

dnsmasq logs look suspicious, though:

http://paste.openstack.org/show/573571/

^ there is a DHCPDISCOVER on start of the VM and dnsmasq actually issues an DHCPOFFER, but for some reason the guest ignores it.

Unfortunately, console logs are missing, so it's not clear why that happens.

tags: added: blocker-for-qa
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

QA env was provided, status changed to confirmed.

Changed in mos:
status: Incomplete → Confirmed
Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I asked Kristina to leave a comment here: we did not manage to reproduce the problem on the environment.

I suggest we add logging to mos-integation-tests, so that we save the instance console log when the connectivity check fails.

I provided the results of my investigation above: from Nova and Neutron stand point the instance is booted correctly, but for some reason the guest OS ignores the DHCPOFFER from dnsmasq DHCP server - it's not clear why without VM console logs - that's why we need a change to mos-integration-tests, as we are not able to reproduce the problem manually.

I ran a similar test case locally - after all resizes the connectivity check succeeds. I suggest we add more logging and continue working on this in 9.2. At the same time, I don't see how this is a blocker for you, and there is an easy workaround - just reboot the instance (note: that both resize and cold migration power the instance off anyway).

Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

I checked this cases on environment manually and it works fine. Change status to invalid

Changed in mos:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.