Comment 3 for bug 1174591

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

  I know the approach you're trying to implement, and I know it's used
in a few companies via the AT&T scripts.

    I once started writing the same implementation you're proposing,
but then I stopped it after talking to our HA experts, they gave me
very good reasons, imagine this scenario/failure mode:

    1) An agent crashes, therefore it stops answering hearbeats,
       imagine that it also has some kind of resource exhaustion that
       wont let it restart automatically even if you configured that...

    2) Neutron server (or the at&t scripts) detect that, and move
       the vrouters to a different host.

    3) The new l3 agent starts the vrouters, good...

    but now, we have the same IP addresses and Vrouters *working* in both
    nodes, the old one, which is "headless" but has all the qrouter-*
    namespaces, and the new one which actually has an agent.

    This make you having IP conflicts in the network, duplicate packets,
    etc...

    We solved this in Red Hat (for the time being), using pacemaker + active/passive
configuration, using two nodes, (but you could do several pairs).., and adding
two things to the setup:
     1) neutron-netns-cleanup --forced and neutron-ovs-cleanup (when we manually migrate
                                                                an agent from A to P host)

     2) fencing (that means using IPMI api or LOM to reboot the failed host) via pacemaker

    we use the agent "host=" parameter, that sets a logical identifier towards
neutron-server, in

in
node1: we set host=neutron-network-node
node2: we set host=neutron-network-node

Also, you need to set the exact same host= logical id into the openvswitch plugin.ini
file in the host, otherwise they won't play well together.

More details here:
https://github.com/fabbione/rhos-ha-deploy/blob/master/rhos5-rhel7/mrgcloud-setup/RHOS-RHEL-HA-how-to-mrgcloud-rhos5-on-rhel7-neutron-n-latest.txt

The solution is not perfect, for example, if you have lots of resources in the
network node, it will take some time to rebuild on the new host, but that problem
also happens with the auto router migration.