A hostname change breaks neutron-openvswitch-agent / neutron-server tunneling updates.

Bug #1464178 reported by Miguel Angel Ajo on 2015-06-11
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
Miguel Angel Ajo

Bug Description

When using tunnelling, if one of the hosts changed the hostname and tries to sync tunnels to neutron-server, this will throw an exception due to an unnecessary constraint, breaking the network.

Hostname changes are something neutron-server may survive to. Probably a log warning is enough, and the old hostname endpoint should be deleted.

This was found in HA deployments with pacemaker, where the hostname is roamed to the active node, or it's set dynamically on the nodes based on the clone ID provided by pacemaker, that's used to allow architectures like A/A/A/P/P for neutron, where one of the active nodes could die, and a passive takes the resources of the old active by roaming it's hostname (which is the logical ID where neutron agent resources are tied to).

neutron-server log:

015-06-10 05:44:48.151 24546 ERROR oslo_messaging._drivers.common [req-751f3392-9915-49b9-bb0b-2dec63a6649a ] Returning exception Invalid input for operation: (u'Tunnel IP %(ip)s in use with host %(host)s', {'ip': u'192.168.16.105', 'host': u'neutron-n-2'}). to caller
2015-06-10 05:44:48.152 24546 ERROR oslo_messaging._drivers.common [req-751f3392-9915-49b9-bb0b-2dec63a6649a ] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n executor_callback))\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n executor_callback)\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/type_tunnel.py", line 248, in tunnel_sync\n raise exc.InvalidInput(error_message=msg)\n', "InvalidInput: Invalid input for operation: (u'Tunnel IP %(ip)s in use with host %(host)s', {'ip': u'192.168.16.105', 'host': u'neutron-n-2'}).\n"]
2015-06-10 05:44:52.152 24546 ERROR oslo_messaging.rpc.dispatcher [req-751f3392-9915-49b9-bb0b-2dec63a6649a ] Exception during message handling: Invalid input for operation: (u'Tunnel IP %(ip)s in use with host %(host)s', {'ip': u'192.168.16.105', 'host': u'neutron-n-2'}).
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher executor_callback))
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher executor_callback)
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/type_tunnel.py", line 248, in tunnel_sync
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher raise exc.InvalidInput(error_message=msg)
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher InvalidInput: Invalid input for operation: (u'Tunnel IP %(ip)s in use with host %(host)s', {'ip': u'192.168.16.105', 'host': u'neutron-n-2'}).
2015-06-10 05:44:52.152 24546 TRACE oslo_messaging.rpc.dispatcher
2015-06-10 05:44:52.152 24546 ERROR oslo_messaging._drivers.common [req-751f3392-9915-49b9-bb0b-2dec63a6649a ] Returning exception Invalid input for operation: (u'Tunnel IP %(ip)s in use with host %(host)s', {'ip': u'192.168.16.105', 'host': u'neutron-n-2'}). to caller
2015-06-10 05:44:52.153 24546 ERROR oslo_messaging._drivers.common [req-751f3392-9915-49b9-bb0b-2dec63a6649a ] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n executor_callback))\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n executor_callback)\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/type_tunnel.py", line 248, in tunnel_sync\n raise exc.InvalidInput(error_message=msg)\n', "InvalidInput: Invalid input for operation: (u'Tunnel IP %(ip)s in use with host %(host)s', {'ip': u'192.168.16.105', 'host': u'neutron-n-2'}).\n"]

How to reproduce:

1) Install a single node AIO with tunnelling for tenant networks.
2) openstack-config --set /etc/neutron/neutron.conf DEFAULT host newhostname
3) service neutron-openvswitch-agent restart
4) The exceptions keep happening for neutron-server in a loop as the agent tries to sync the tunnel, and fail.

This new behaviour was introduced in Kilo by this patch:
https://github.com/openstack/neutron/commit/3db0e619c83892a7aab61807969205253833ff8d

Changed in neutron:
assignee: nobody → Miguel Angel Ajo (mangelajo)
status: New → Confirmed

Fix proposed to branch: master
Review: https://review.openstack.org/190595

Changed in neutron:
status: Confirmed → In Progress

Change abandoned by Miguel Angel Ajo (<email address hidden>) on branch: feature/qos
Review: https://review.openstack.org/192624
Reason: Wrong branch, mess up...

Oguz Yarimtepe (oguzy) wrote :

I am having the same problem at my Kilo environment, any workaround for this issue, since applying the patch didn't help me.

Download full text (93.9 KiB)

Reviewed: https://review.openstack.org/196097
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1cfed745d54a6ce9cb3dd4e6f454666d9e6676c2
Submitter: Jenkins
Branch: feature/qos

commit ba7d673d1ddd5bfa5aa1be5b26a59e9a8cd78a9f
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:31:38 2015 -0700

    Remove duplicated call to setup_coreplugin

    The test case for vlan_transparent was calling setup_coreplugin
    before calling the super setUp method which already calls
    setup_coreplugin. This was causing duplicate core plugin fixtures
    which resulted in patching the dhcp periodic check twice.

    Change-Id: Ide4efad42748e799d8e9c815480c8ffa94b27b38
    Partial-Bug: #1468998

commit e64062efa3b793f7c4ce4ab9e62918af4f1bfcc9
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:29:37 2015 -0700

    Remove double mock of dhcp agent periodic check

    The test case for the periodic check was patching a target
    that the core plugin fixture already patched out. This removes
    that and exposes the mock from the fixture so the test case
    can reference it.

    Change-Id: I3adee6a875c497e070db4198567b52aa16b81ce8
    Partial-Bug: #1468998

commit 25ae0429a713143d42f626dd59ed4514ba25820c
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:24:10 2015 -0700

    Remove double fanout mock

    The test_mech_driver was duplicating a fanout mock already setup
    in the setUp routine.

    Change-Id: I5b88dff13113d55c72241d3d5025791a76672ac2
    Partial-Bug: #1468998

commit 993771556332d9b6bbf7eb3f0300cf9d8a2cb464
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 17:55:16 2015 -0700

    Remove double callback manager mocks

    setup_test_registry_instance() in the base test case class gives
    each test its own registry by mocking out the get_callback_manager.
    The L3 agent test cases were duplicating this.

    Partial-Bug: #1468998
    Change-Id: I7356daa846524611e9f92365939e8ad15d1e1cd8

commit 0be1efad93734f11cd63fb3b7bd2983442ce1268
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 16:57:30 2015 -0700

    Remove ensure_dirs double-patch

    test_spawn_radvd called mock.patch on ensure_dirs after the
    setup method already patched it out. This causes issues when
    mock.patch.stopall() is called because the mocks are stored
    as a set and are unwound in a non-deterministic fashion.[1]
    So some of the time they will be undone correctly, but others
    will leave a monkey-patched in mock, causing the ensure_dir
    test to fail.

    1. http://bugs.python.org/issue21239

    Closes-Bug: #1467908
    Change-Id: I321b5fed71dc73bd19b5099311c6f43640726cd4

commit 0a2238e34e72c17ca8a75e36b1f56e41a3ece74e
Author: Sukhdev Kapur <email address hidden>
Date: Thu Jun 25 15:11:28 2015 -0700

    Fix tenant-id in Arista ML2 driver to support HA router

    When HA router is created, the framework creates a network and does
    not specify the tenant-id. This casuse Arista ML2 driver to fail.
    This patch sets the tenant-id when it is not passed explicitly by
    by the network_create() call from the HA r...

tags: added: in-feature-qos

Change abandoned by Kyle Mestery (<email address hidden>) on branch: feature/pecan
Review: https://review.openstack.org/196701
Reason: This is lacking the functional fix [1], so I'll propose a new merge commit which includes that one.

[1] https://review.openstack.org/#/c/196711/

Ashraf (ashru-moh-misc) wrote :

Looks like I am hitting the same issue. I am on Ubuntu 14.04 and trting to setup Kilo ( with Neutron networking). Is there a work around for this issue? Also, any ETA on when resolution for this defect would be available for kilo release?

thanks
ashraf

Salman (salman-toor-d) wrote :

Hi,

I also hit by the same problem:

2015-08-11 10:50:38.804 6584 TRACE oslo_messaging.rpc.dispatcher
2015-08-11 10:50:38.804 6584 ERROR oslo_messaging._drivers.common [req-6fd71f06-3a83-4341-b4ac-89ee7c18300e ] Returning exception Invalid input for operation: (u'Tunnel IP %(ip)s in use with host %(host)s', {'ip': u'10.0.102.2', 'host': None}). to caller
2015-08-11 10:50:38.805 6584 ERROR oslo_messaging._drivers.common [req-6fd71f06-3a83-4341-b4ac-89ee7c18300e ] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n executor_callback))\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n executor_callback)\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/type_tunnel.py", line 248, in tunnel_sync\n raise exc.InvalidInput(error_message=msg)\n', "InvalidInput: Invalid input for operation: (u'Tunnel IP %(ip)s in use with host %(host)s', {'ip': u'10.0.102.2', 'host': None}).\n"]
2015-08-11 10:50:39.085 6584 INFO neutron.wsgi [-] (6584) accepted ('10.0.109.204', 60124)
….

Any workaround?

/Salman,

Ran into this issue as well in Kilo where my node initially came up with the hostname "localhost.localdomain", and then I corrected it to the proper name.

These are the steps I did to correct my setup, but may not be complete, so use with caution.

I first ran "neutron agent-list" and then "neutron agent-delete $id" of the id associated with localhost.localdomain.
That didn't correct it fully, and not sure if this step was needed.

So I then accessed the neutron database and ran

MariaDB [neutron]> select * from ml2_gre_endpoints;
+--------------+-----------------------+
| ip_address | host |
+--------------+-----------------------+
| 172.20.20.70 | localhost.localdomain |

this was the incorrect entry mapping the ip to localhost.localdomain instead of the correct entry.

So i ran
delete from ml2_gre_endpoints where host='localhost.localdomain';

Next, on the bad compute node did " systemctl restart neutron-openvswitch-agent.service"

And that made everything work for me. The database table was updated after the restart by openstack to contain the new correct entry.

MariaDB [neutron]> select * from ml2_gre_endpoints;
+--------------+--------------+
| ip_address | host |
+--------------+--------------+
| 172.20.20.70 | icbm70.mgmt |

You may have to look in ml2_vxlan_endpoints, depending on your setup.

Salman (salman-toor-d) wrote :

Hi,

Thanks Keith!

I confirmed that your solution works for me too. I had "NULL" in the table and once I set the correct host name its working fine.

Thanks for your suggestion.

Regards..
Salman.

tags: added: kilo-backport-potential

Hello,

FWIW, I can confirm this worked using ml2_vxlan_endpoints also on Kilo with CentOS 7.1

Thanks for the fix.

tags: removed: in-feature-qos

Reviewed: https://review.openstack.org/190595
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=61d26b8b745ae431e21c22d1d82688708098171b
Submitter: Jenkins
Branch: master

commit 61d26b8b745ae431e21c22d1d82688708098171b
Author: Miguel Angel Ajo <email address hidden>
Date: Thu Jun 11 13:15:17 2015 +0200

    Fix hostname roaming for ml2 tunnel endpoints.

    Change I75c6581fcc9f47a68bde29cbefcaa1a2a082344e introduced
    a bug where host name changes broke tunneling endpoint updates.
    Tunneling endpoint updates roaming a hostname from IP to IP
    are a common method for active/passive HA with pacemaker and
    should happen automatically without the need for API/CLI calls [1].

    delete_endpoint_by_host_or_ip is introduced to allow cleanup of
    endpoints that potentially belonged to the newly registered agent,
    while preventing the race condition found when deleting ip1 & ip2
    in the next situation at step 4:

    1) we have hostA: ip1
    2) hostA goes offline
    3) hostB goes online, with ip1, and registers
    4) hostA goes online, with ip2, and registers

    [1] https://bugs.launchpad.net/python-neutronclient/+bug/1381664

    Change-Id: I04d08d5b82ce9911f3af555b5776fc9823e0e5b6
    Closes-Bug: #1464178

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2015-09-24
Changed in neutron:
milestone: none → liberty-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-10-15
Changed in neutron:
milestone: liberty-rc1 → 7.0.0
Nick Jones (yankcrime) wrote :

We're also seeing this on Kilo - a backport of the fix would be hugely appreciated.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.