L3-agent restart causes VM connectivity loss

Bug #1519926 reported by Stephen Ma
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Hong Hui Xiao

Bug Description

L3-agent restart causes VM connectivity loss

To test whether a the L3-agent on a network node can recover after a it was stopped and then restarted. I ran this test on a devstack setup using the latest neutron code on the master branch. The L3-agent is running in legacy mode.

1. Create a network, subnetwork.
2. Create a router, tie the router to the subnetwork and the external network.
3. Create a VM using the network and assign a floating IP to the VM. The VM can be pinged and ssh'ed using the floating IP.
4. On the controller node, kill the L3 agent.
5. Delete the qrouter namespace of the router created in (2) on the controller node.
6. Start up the L3-agent again.
7. Now the VM can no longer be ssh'ed using the FIP.

The VM connectivity is lost to the VM because the L3-agent failed to reconstruct all the interfaces in the qrouter namespace. For example:

Before running steps 4-6, the qrouter namespace on the controller node looks like (router-id=e86b277a-5f49-4fcb-8d85-241594db418e, VM's FIP=10.127.10.5):
stack@Ubuntu-38:~/DEVSTACK/demo$ sudo ip netns exec qrouter-e86b277a-5f49-4fcb-8d85-241594db418e ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
33: qr-50b99abf-a4: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether fa:16:3e:17:3e:b0 brd ff:ff:ff:ff:ff:ff
    inet 10.1.2.1/24 brd 10.1.2.255 scope global qr-50b99abf-a4
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe17:3eb0/64 scope link
       valid_lft forever preferred_lft forever
34: qg-3d1a888a-33: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether fa:16:3e:60:9a:43 brd ff:ff:ff:ff:ff:ff
    inet 10.127.10.4/24 brd 10.127.10.255 scope global qg-3d1a888a-33
       valid_lft forever preferred_lft forever
    inet 10.127.10.5/32 brd 10.127.10.5 scope global qg-3d1a888a-33
       valid_lft forever preferred_lft forever
    inet6 2001:db8::3/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe60:9a43/64 scope link
       valid_lft forever preferred_lft forever

After deleting the qrouter-e86b277a-5f49-4fcb-8d85-241594db418e namespace and then restarting the L3-agent on the controller node, the L3-agent did recreate the namespace again, however, not all the interfaces and IP addresses are created:

stack@Ubuntu-38:~/DEVSTACK/demo$ sudo ip netns exec qrouter-e86b277a-5f49-4fcb-8d85-241594db418e ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

So the VM can't be ssh'ed because all the required plumbing is not re-created.

When the L3 agent is running in dvr-snat mode on the controller and dvr on the compute node, if I do steps 4-6 on the compute node, the VM will no longer be ssh'ed either. The qrouter namespace doesn't have all the needed interfaces either.

Revision history for this message
Manjeet Singh Bhatia (manjeet-s-bhatia) wrote :

Hi i tried this on master I am able to ping and ssh if
- i bring down l3
- delete qrouter namespace
-restart l3-agent

I was able to ping ssh on l3-agt restart
but weird thing i am noticing is when it recovered qrouter

ip a i qrouternamespace shows http://paste.openstack.org/show/480037/

before i stopped l3-agt it was http://paste.openstack.org/show/480036/

Only issue here i see is it does not update new router namespace completely.

l3-agt logs

http://paste.openstack.org/show/480038/

Revision history for this message
Manjeet Singh Bhatia (manjeet-s-bhatia) wrote :

i am using single node devstack where i was able to ping ssh vm
after bringing down l3-agt, deleting qrouter namespace and restarting l3 agent

Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

I can reproduce it in my devstack with latest code, I will look into it.

Changed in neutron:
assignee: nobody → Hong Hui Xiao (xiaohhui)
Changed in neutron:
status: New → Confirmed
importance: Undecided → Medium
Changed in neutron:
importance: Medium → High
tags: added: l3-ipam-dhcp
Revision history for this message
Stephen Ma (stephen-ma) wrote :

I tried to reproduce this problem by rebooting the network node instead of killing the L3-agent and deleting the qrouter namespace. I cannot reproduce the problem. This time I am able to ping and ssh the VMs after the L3-agent is started up again.

The steps taken are:

Repeat steps 1-3. The node has the neutron-server, neutron-dhcp-agent, neutron-metadata-agent, neutron-openvswitch-agent, as well as the neutron-l3-agent running.

Then
4. reboot the node.
5. After the node comes back up, start all the neutron components that were previously running:
    a. start the neutron-server,
    b. start the neutron-openvswitch-agent
    c. start the neutron-dhcp-agent
    d. start the neutron-metadata-agent
    e. start the neutron-l3-agent.
6. Afterwards, ping and ssh to the VM works. The qrouter namespace has the same interfaces and IP addresses as before the node reboot.

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

I'm going to reduce this severity to Medium based on Stephen's report. To me, it feels a bit contrived though I agree that the L3 agent should handle this case.

Changed in neutron:
importance: High → Medium
Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

Sorry to miss the discussion about the bug in IRC last night, I did a basic investigation last week. I think this bug is an avoidable issue, the root cause might be some ovs command don't take effect when rebuild the port of router. But I still don't have time to continue working on it since last investigation.
Since the severity is lowered, I would work on it in next week.

Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

I just tested it in a kilo env follow the steps in description, and it will have the same problem. the log is at [1], which is the same as @manjeet-s-bhatia posted in comment #1.

So, @stephen-ma, can you confirm that it will not happen in kilo?

[1] http://paste.openstack.org/show/481028/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/254579

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Stephen Ma (stephen-ma) wrote :

@xiaohhui Yes, the same problem is reproduced with stable/kilo.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/254579
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8b7e5997dae54a03ad2850c43b7070bc00c90273
Submitter: Jenkins
Branch: master

commit 8b7e5997dae54a03ad2850c43b7070bc00c90273
Author: Hong Hui Xiao <email address hidden>
Date: Tue Dec 8 01:17:54 2015 -0500

    Separate the command for replace_port to delete and add

    When a port has been added to router namespace, trying to replace the
    port by adding del-port and add-port in one command, will not bring
    the new port to kernel. Even if the port is updated in ovs db and can
    be found on br-int, system can't see the new port. This will break
    the following actions, which will manipulate the new port by ip
    commands. A mail list has been filed to discuss this issue at [1].

    The problem here will break the scenario that namespace is deleted
    unexpectedly, and l3-agent tries to rebuild the namespace at restart.

    Separating replace_port to two commands: del-port and add-port,
    matches the original logic and has been verified that it can resolve
    the problem here.

    [1] http://openvswitch.org/pipermail/discuss/2015-December/019667.html

    Change-Id: If36bcf5a0cccb667f3087aea1e9ea9f20eb3a563
    Closes-Bug: #1519926

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b2

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/270271

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/270271
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=54f8819935fa2439220704cb128d1e4335422dd0
Submitter: Jenkins
Branch: stable/liberty

commit 54f8819935fa2439220704cb128d1e4335422dd0
Author: Hong Hui Xiao <email address hidden>
Date: Tue Dec 8 01:17:54 2015 -0500

    Separate the command for replace_port to delete and add

    When a port has been added to router namespace, trying to replace the
    port by adding del-port and add-port in one command, will not bring
    the new port to kernel. Even if the port is updated in ovs db and can
    be found on br-int, system can't see the new port. This will break
    the following actions, which will manipulate the new port by ip
    commands. A mail list has been filed to discuss this issue at [1].

    The problem here will break the scenario that namespace is deleted
    unexpectedly, and l3-agent tries to rebuild the namespace at restart.

    Separating replace_port to two commands: del-port and add-port,
    matches the original logic and has been verified that it can resolve
    the problem here.

    [1] http://openvswitch.org/pipermail/discuss/2015-December/019667.html

    Conflicts:
     neutron/tests/functional/agent/l3/test_legacy_router.py

    Change-Id: If36bcf5a0cccb667f3087aea1e9ea9f20eb3a563
    Closes-Bug: #1519926
    (cherry picked from commit 8b7e5997dae54a03ad2850c43b7070bc00c90273)

tags: added: in-stable-liberty
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 7.1.0

This issue was fixed in the openstack/neutron 7.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.