VPNaaS: Active VPN connection goes down after controller shutdown/start

Bug #1506794 reported by Elena Ezhova
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Elena Ezhova

Bug Description

Ubuntu 14.04 + OpenSwan 1:2.6.38-1

Environment with 3 controllers and 2 computes

Steps to reproduce:
1. Create VPN connection between tenant1 and tenant2 and check that it's active
2. Find a controller where one of the routers-participants of VPN connection is scheduled (tenant1's router, for example)
3. Shutdown this controller, wait some time and check that tenant1's router is rescheduled successfully, and VPN connection is restored
4. Start the controller which was shut downed and wait some time while it's completely booted
5. Reschedule tenant1's router back to its origin controller, which was under shutdown/start, wait some time and check that tenant1's router is rescheduled successfully, and VPN connection is restored

Actual result: tenant1's router is rescheduled, VMs can ping external hosts, but VPN connection goes to DOWN state on tenant1's side with the following error in vpn-agent.log on a controller where tenant1's router was rescheduled back in p.5: http://paste.openstack.org/show/476459/

Analysis:
Pluto processes are running in qrouter namespace (or snat in case of DVR). When a controller is being shut down all namespaces get deleted (as they are stored in tmpfs), but pluto .pid and .ctl files remain as they are stored in /opt/stack/data/neutron/ipsec/<router-id>/var/run/.

Then, when router is rescheduled back to the origin controller, vpn agent attempts to start pluto process and pluto fails when it finds that a .pid file already exists. Such behavior of pluto is determined by the flags that are used to open this file [1],[2] and it is most probably a defense against accidental rewriting of .pid file .

As it is not a pluto bug, the solution might be to add a workaround to VPNaaS that will clean-up .ctl and .pid files on start-up.
Essentially, the same approach was used for LibreSwan driver [3] so we just need to do some refactoring to make this approach shared for OpenSwan and LibreSwan .

[1] https://github.com/xelerance/Openswan/blob/master/programs/pluto/plutomain.c#L258-L259
[2] https://github.com/libreswan/libreswan/blob/master/programs/pluto/plutomain.c#L231-L232
[3] https://github.com/openstack/neutron-vpnaas/commit/00b633d284f0f21aa380fa47a270c612ebef0795

P.S.
Another way to reproduce this failure is to replace steps 3-5 with:
3. Send kill -9 to the pluto process on that controller
4. Remove tenant1's router from agent running on that controller and then schedule it back.

Tags: vpnaas
Elena Ezhova (eezhova)
tags: added: vpnaas
Elena Ezhova (eezhova)
Changed in neutron:
assignee: nobody → Elena Ezhova (eezhova)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-vpnaas (master)

Fix proposed to branch: master
Review: https://review.openstack.org/235817

Changed in neutron:
status: New → In Progress
Ryan Moats (rmoats)
Changed in neutron:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron-vpnaas (master)

Reviewed: https://review.openstack.org/235817
Committed: https://git.openstack.org/cgit/openstack/neutron-vpnaas/commit/?id=a71f30b232c7b2f44cb1d7512407bb9ec75564c4
Submitter: Jenkins
Branch: master

commit a71f30b232c7b2f44cb1d7512407bb9ec75564c4
Author: Elena Ezhova <email address hidden>
Date: Fri Oct 16 12:31:48 2015 +0300

    Cleanup .ctl/.pid files for both OpenSwan and LibreSwan

    Change I5c215d70c348524979b740f882029f74e400e6d7 introduced cleanup
    of pluto ctl/pid files on starting and restarting of pluto daemon
    for LibreSwan driver. But the problem with managing these files is
    also common for the OpenSwan driver: pluto daemon fails to start if
    a pid file it tries to create already exists (see bug report for
    details).

    This change moves the cleaup functionality to the OpenSwanProcess so
    that is will be used by both OpenSwan and LibreSwan drivers.
    Also fixed a typo in _cleanup_control_files where it was attempted to
    remove pluto.ctl.ctl file instead of pluto.ctl

    Changed the name of 'libreswan' configuration section to 'pluto'.

    DocImpact

    Change-Id: I717e8fcc1add35b7099c977235e4eff5da9e093b
    Closes-Bug: #1506794

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron-vpnaas 8.0.0.0b1

This issue was fixed in the openstack/neutron-vpnaas 8.0.0.0b1 development milestone.

Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.