change vm fixed ips will cause unable to communicate to vm in other network

Bug #1512199 reported by yujie on 2015-11-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Medium
Brian Haley
Kilo
Undecided
Unassigned

Bug Description

I use dvr+kilo, vxlan. The environment is like:

vm2-2<- compute1 ----------vxlan------------- comupte2 ->vm2-1
vm3-1<-

vm2-1<- net2 ---------router1--------- net3 ->vm3-1
vm2-2<-

vm2-1(192.168.2.3) and vm2-2(192.168.2.4) are in the same net(net2 192.168.2.0/24) but not assigned to the same compute node. vm3-1 is in net3(192.168.3.0/24). net2 and net3 are connected by router1. The three vms are in default security-group. Not use firewall.

1. Using command below to change the ip of vm2-1.
neutron port-update portID --fixed-ip subnet_id=subnetID,ip_address=192.168.2.10 --fixed-ip subnet_id=subnetID,ip_address=192.168.2.20
In vm2-1 using "sudo udhcpc"(carrios) to get ip, the dhcp message is correct but the ip not changed.
Then reboot vm2-1. The ip of vm2-1 turned to be 192.168.2.20.

2. Using vm2-2 could ping 192.168.2.20 successfully . But vm3-1 could not ping 192.168.2.20 successfully.

By capturing packets and looking for related information, the reason maybe:
1. newIP(192.168.2.20) and MAC of vm2-1 was not wrote to arp cache in the namespace of router1 in compute1 node.
2. In dvr mode, the arp request from gw port(192.168.2.1) from compute1 to vm2-1 was dropped by flowtable in compute2. So the arp request(192.168.2.1->192.168.2.20) could not arrive at vm2-1.
3. For vm2-2, the arp request(192.168.2.4->192.168.2.20) was not dropped and could connect with vm2-1.

In my opinion, if both new fixed IPs of vm2-1(192.168.2.10 and 102.168.2.20) and MAC is wrote to arp cache in namespace of router1 in compute1 node, the problem will resolved. But only one ip(192.168.2.10) and MAC is wrote.

BTW, if only set one fixed ip for vm2-1, it works fine. But if set two fixed ips for vm2-1, the problem above most probably happens.

yujie (16189455-d) on 2015-11-02
description: updated
description: updated
description: updated
description: updated
description: updated
yujie (16189455-d) on 2015-11-02
description: updated
Gary Kotton (garyk) on 2015-11-03
tags: added: l3-dvr-backlog

will triage it and update the results.

Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)

Not able to reproduce I could see the arp table update on the router namespaces on both nodes.

I tried to modify the ports on both the subnet 10.2.0.X and 10.0.0.X.
In this example I have change the 10.2.0.4 to 10.2.0.25 and 10.0.0.8 10.0.0.20. In both cases I saw that the arp entry was updated.

There is one thing that is true on both our testing is, the VM is not able to get the new IP until I reboot the VM. ( This might be filed as a different bug in nova)

ARP output from Node 2:
root@ubuntu-new-compute:~/devstack# arp -a
? (10.2.0.4) at fa:16:3e:7e:0b:48 [ether] PERM on qr-b25bad4f-5f
? (10.0.0.6) at fa:16:3e:7a:78:fe [ether] PERM on qr-66c29926-29
? (10.2.0.3) at fa:16:3e:b6:19:da [ether] PERM on qr-b25bad4f-5f
? (10.0.0.2) at fa:16:3e:91:1a:d2 [ether] PERM on qr-b2b8c9a4-68
? (10.0.0.2) at fa:16:3e:91:1a:d2 [ether] PERM on qr-66c29926-29
? (10.0.0.6) at fa:16:3e:7a:78:fe [ether] PERM on qr-b2b8c9a4-68
? (10.2.0.25) at fa:16:3e:7e:0b:48 [ether] PERM on qr-b25bad4f-5f ( changed arp info)
? (10.0.0.7) at fa:16:3e:5d:12:fd [ether] PERM on qr-66c29926-29
? (10.2.0.2) at fa:16:3e:b6:84:91 [ether] PERM on qr-b25bad4f-5f
? (10.0.0.8) at fa:16:3e:a1:cc:87 [ether] PERM on qr-66c29926-29
? (10.0.0.8) at fa:16:3e:a1:cc:87 [ether] PERM on qr-b2b8c9a4-68
? (10.0.0.20) at fa:16:3e:a1:cc:87 [ether] PERM on qr-66c29926-29 ( changed arp info)
? (10.0.0.7) at fa:16:3e:5d:12:fd [ether] PERM on qr-b2b8c9a4-68
? (10.0.0.3) at fa:16:3e:fd:a1:d6 [ether] PERM on qr-66c29926-29
root@ubuntu-new-compute:~/devstack#

ARP Info from Node 1:
root@ubuntu-ctlr:~/devstack# arp -a
? (10.0.0.3) at fa:16:3e:fd:a1:d6 [ether] PERM on qr-66c29926-29
? (10.2.0.3) at fa:16:3e:b6:19:da [ether] PERM on qr-b25bad4f-5f
? (10.0.0.7) at fa:16:3e:5d:12:fd [ether] PERM on qr-66c29926-29
? (10.0.0.2) at fa:16:3e:91:1a:d2 [ether] PERM on qr-b2b8c9a4-68
? (10.0.0.2) at fa:16:3e:91:1a:d2 [ether] PERM on qr-66c29926-29
? (10.2.0.4) at fa:16:3e:7e:0b:48 [ether] PERM on qr-b25bad4f-5f
? (10.0.0.6) at fa:16:3e:7a:78:fe [ether] PERM on qr-66c29926-29
? (10.2.0.25) at fa:16:3e:7e:0b:48 [ether] PERM on qr-b25bad4f-5f
? (10.2.0.5) at <incomplete> on qr-b25bad4f-5f
? (10.0.0.5) at <incomplete> on qr-66c29926-29
? (10.0.0.20) at fa:16:3e:a1:cc:87 [ether] PERM on qr-66c29926-29
? (10.0.0.8) at fa:16:3e:a1:cc:87 [ether] PERM on qr-66c29926-29
? (10.2.0.2) at fa:16:3e:b6:84:91 [ether] PERM on qr-b25bad4f-5f
? (10.0.0.4) at <incomplete> on qr-66c29926-29
root@ubuntu-ctlr:~/devstack#

Changed in neutron:
status: New → Invalid

Please let me know if you have further question on this. This was tested on top of the branch today.

Kevin Benton (kevinbenton) wrote :

Swami, when you say the VM was not able to get a new IP, do you mean that its DHCP client could not get a lease? If so, that sounds like we might have a bug on the Neutron side.

yujie (16189455-d) wrote :

The dhcp agent get this change. And the file /var/lib/neutron/dhcp/uuid/host changes as we expect.
 When vm send dhcp discovery, the dhcp offer the right new ip to the vm. But the vm not change its ip correctly.

Besides, the bug described tested 3 times, and MAC not wrote to arp cache happened two times.

yujie (16189455-d) wrote :

If assign one fixed ip the bug won't happen.

Kevin yes, I was not waiting long enough for the DHCP lease to timeout and refresh, I thought the expected behaviour is if you change the VM port fixed ip, that should be reflected in the running instance, but that may be tied to the dhcp lease.

yujie (16189455-d) wrote :

1. If not enable dhcp in a subnet, after the vm created in the subnet it gets no ip.
2. Then enable dhcp in the subnet, let the vm execute "sudo udhcpc", but the vm get no ip either.

This situation have no idea with dhcp lease.

Kevin Benton (kevinbenton) wrote :

@Swami, yeah, we don't currently have anything that can force the VM to refresh its IP address.

yujie (16189455-d) wrote :

@Swaminathan, I used code from stable/kilo. Maybe it has some difference from yours.
When i change the vm1 fixed ip from 10.0.20.4 to 10.0.20.22+10.0.20.24, the arp cache in local compute node A is:
# ip netns exec qrouter-ddda3633-da7e-4fb1-9bc9-7f762a40f962 arp -n
Address HWtype HWaddress Flags Mask Iface
10.0.30.11 ether fa:16:3e:4c:3c:2e CM qr-08365e9f-76
10.0.20.22 ether fa:16:3e:38:bb:ea CM qr-8cb7b03f-77
10.0.20.7 ether fa:16:3e:43:06:0f CM qr-8cb7b03f-77
10.0.30.2 (incomplete) qr-08365e9f-76
10.0.30.10 ether fa:16:3e:1f:50:9d CM qr-08365e9f-76
10.0.20.6 ether fa:16:3e:7a:82:03 CM qr-8cb7b03f-77
10.0.20.4 ether fa:16:3e:38:bb:ea CM qr-8cb7b03f-77
10.0.20.2 ether fa:16:3e:8b:c3:4b CM qr-8cb7b03f-77
10.0.20.24 ether fa:16:3e:38:bb:ea C qr-8cb7b03f-77

the arp cache in another compute node B is:
# ip netns exec qrouter-ddda3633-da7e-4fb1-9bc9-7f762a40f962 arp -n
Address HWtype HWaddress Flags Mask Iface
10.0.30.11 ether fa:16:3e:4c:3c:2e CM qr-08365e9f-76
10.0.20.24 (incomplete) qr-8cb7b03f-77
10.0.20.6 ether fa:16:3e:7a:82:03 CM qr-8cb7b03f-77
10.0.20.4 ether fa:16:3e:38:bb:ea CM qr-8cb7b03f-77
10.0.20.22 ether fa:16:3e:38:bb:ea CM qr-8cb7b03f-77
10.0.20.2 ether fa:16:3e:8b:c3:4b CM qr-8cb7b03f-77
10.0.20.20 (incomplete) qr-8cb7b03f-77
10.0.30.10 ether fa:16:3e:1f:50:9d CM qr-08365e9f-76
10.0.20.7 ether fa:16:3e:43:06:0f CM qr-8cb7b03f-77
10.0.20.10 (incomplete) qr-8cb7b03f-77
10.0.30.2 (incomplete) qr-08365e9f-76

The mac for ip 10.0.20.24 could not learnt by arp request in dvr. So vm from other subnet in compute B could connect to vm1, but vm which in the same subnet with vm1 in compute B could connect to vm1.

Besides, if I assign one more ip address to vm using allowed_address_pairs, the results is the same that new assigned ip could not use.

yujie (16189455-d) on 2015-11-10
Changed in neutron:
status: Invalid → New

I have tested it in the master branch and not able to reproduce. But if this is seen in kilo, then we need to tag it to kilo bugs.
I will try to triage this in kilo and will update the bug report based on my findings.

Miguel Angel Ajo (mangelajo) wrote :

Thanks Swami, ping me via IRC if you need the bug targeted to kilo finally.

Changed in neutron:
importance: Undecided → Medium
yujie (16189455-d) wrote :

Thanks Swaminathan, when you test in kilo, could you help to test function allowed_address_pairs. The phenomenon looks like the same.

John Schwarz (jschwarz) wrote :

I can confirm this is happening in stable/kilo.

Changed in neutron:
status: New → Confirmed
John Schwarz (jschwarz) on 2015-12-31
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → John Schwarz (jschwarz)
John Schwarz (jschwarz) wrote :

This reproduced for me on a latest master multinode deployment. I will try to reproduce it again next week and figure out a simplified test-case in which it consistently reproduces.

Fix proposed to branch: master
Review: https://review.openstack.org/263772

Changed in neutron:
status: Confirmed → In Progress
John Schwarz (jschwarz) wrote :

The patch fixes the upstream master issue I've noticed last week. I'm yet to see if it fixes the case for stable/kilo.

John Schwarz (jschwarz) wrote :

The patch I posted fixes the case for stable/kilo (as far as manual testing show).

yujie (16189455-d) wrote :

Thanks John.
When dvr kilo environment is available in our company , I will test.

Changed in neutron:
assignee: John Schwarz (jschwarz) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)

Reviewed: https://review.openstack.org/263772
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5535a71e753d7c6ef679437ee93faffc6bc31f62
Submitter: Jenkins
Branch: master

commit 5535a71e753d7c6ef679437ee93faffc6bc31f62
Author: John Schwarz <email address hidden>
Date: Tue Jan 5 17:21:30 2016 +0200

    DVR: when updating port's fixed_ips, update arp

    Currently, when updating a port's fixed_ips, the l3 agents fail to
    update the arp tables of this change, which can lead to east-west
    connectivity issues when a router is connected to more than one tenant
    network.

    Closes-Bug: #1512199
    Change-Id: Ic7a4bbfca8b477c41b233235d2e2a2864f7af411

Changed in neutron:
status: In Progress → Fix Released

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

Reviewed: https://review.openstack.org/270459
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=df0609f923765b08a50b7a3fc688a16762f90d6e
Submitter: Jenkins
Branch: stable/liberty

commit df0609f923765b08a50b7a3fc688a16762f90d6e
Author: John Schwarz <email address hidden>
Date: Tue Jan 5 17:21:30 2016 +0200

    DVR: when updating port's fixed_ips, update arp

    Currently, when updating a port's fixed_ips, the l3 agents fail to
    update the arp tables of this change, which can lead to east-west
    connectivity issues when a router is connected to more than one tenant
    network.

    Conflicts:
     neutron/db/l3_dvr_db.py
     neutron/db/l3_dvrscheduler_db.py

    Closes-Bug: #1512199
    Change-Id: Ic7a4bbfca8b477c41b233235d2e2a2864f7af411
    (cherry picked from commit 5535a71e753d7c6ef679437ee93faffc6bc31f62)

tags: added: in-stable-liberty

This issue was fixed in the openstack/neutron 7.0.3 release.

Reviewed: https://review.openstack.org/270463
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=38dc0818f20f6b0995618ad151090f2e645ea7e5
Submitter: Jenkins
Branch: stable/kilo

commit 38dc0818f20f6b0995618ad151090f2e645ea7e5
Author: John Schwarz <email address hidden>
Date: Tue Jan 5 17:21:30 2016 +0200

    DVR: when updating port's fixed_ips, update arp

    Currently, when updating a port's fixed_ips, the l3 agents fail to
    update the arp tables of this change, which can lead to east-west
    connectivity issues when a router is connected to more than one tenant
    network.

    Conflicts:
     neutron/db/l3_dvr_db.py
     neutron/db/l3_dvrscheduler_db.py

    Closes-Bug: #1512199
    Change-Id: Ic7a4bbfca8b477c41b233235d2e2a2864f7af411
    (cherry picked from commit 5535a71e753d7c6ef679437ee93faffc6bc31f62)

tags: added: in-stable-kilo
John Schwarz (jschwarz) wrote :

@yujie, the patch that was merged 30 minutes ago (https://review.openstack.org/#/c/270463/) fixes this problem for stable/kilo (the version for which the original bug report was opened). Please let us know if there are more issues with this :)

yujie (16189455-d) wrote :

Thanks John, this patch give much help. But I do not have a test environment available now, this patch will be given a try until next time shared environment upgrade.
By reviewing code, it seems change in fixed_ips will cause dvr_vmarp_table_update, how about a port update its property of allowed_address_pairs?
: )

This issue was fixed in the openstack/neutron 2015.1.4 release.

This issue was fixed in the openstack/neutron 2015.1.4 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers