Neutron Gateway services need HA support

Bug #1379607 reported by Kaya LIU
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
quantum-gateway (Juju Charms Collection)
Fix Released
Medium
Xiang Hui

Bug Description

Support Active/Passive deployment using pacemaker and corosync.

Tags: openstack cts
Kaya LIU (kayaliu)
tags: added: cts openstack
Changed in hacluster (Juju Charms Collection):
importance: Undecided → Medium
Changed in neutron-gateway (Juju Charms Collection):
importance: Undecided → Medium
Changed in quantum-gateway (Juju Charms Collection):
importance: Undecided → Medium
summary: - Juju Charm supports Neutron Gateway HA deployment
+ Neutron Gateway services need HA support
Changed in hacluster (Juju Charms Collection):
assignee: nobody → Xiang Hui (xianghui)
Changed in neutron-gateway (Juju Charms Collection):
assignee: nobody → Xiang Hui (xianghui)
Changed in quantum-gateway (Juju Charms Collection):
assignee: nobody → Xiang Hui (xianghui)
Changed in hacluster (Juju Charms Collection):
status: New → In Progress
Changed in neutron-gateway (Juju Charms Collection):
status: New → In Progress
Changed in quantum-gateway (Juju Charms Collection):
status: New → In Progress
Revision history for this message
Xiang Hui (xianghui) wrote :

In icehouse neutron-l3-agent ha ACTIVE/PASSIVE mode could work with pacemaker/corosync, however, there are some problems should be considered:
1. whether using same hostname between active and standby nodes or directly migrate virtual router from active node to standby node for neutron-l3-agent rebuilding virtual routers/firewall rules needs to be balanced with more test cases.
reference:
related bug: https://bugs.launchpad.net/openstack-manuals/+bug/1318912
chef migrate vr code: https://github.com/stackforge/cookbook-openstack-network/blob/master/files/default/neutron-ha-tool.py

Revision history for this message
Kaya LIU (kayaliu) wrote :

Another possible solution is to change the hostname in the upstart.

Revision history for this message
Kaya LIU (kayaliu) wrote :

the sample configuration in the deployment:

ubuntu@p01-neutron-a2-d5c2f1:/etc/init$ less neutron-vpn-agent.override
# vim:set ft=upstart ts=2 et:
description "Neutron VPN Agent"
author "Chuck Short <email address hidden>"

manual

respawn

chdir /var/run

pre-start script
  mkdir -p /var/run/neutron
  chown neutron:root /var/run/neutron
  # Check to see if openvswitch plugin in use by checking
  # status of cleanup upstart configuration
  hostname network-controller
  if status neutron-ovs-cleanup; then
    start wait-for-state WAIT_FOR=neutron-ovs-cleanup WAIT_STATE=running WAITER=neutron-vpn-agent
  fi
end script

exec start-stop-daemon --start --chuid neutron --exec /usr/bin/neutron-vpn-agent -- \
       --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/vpn_agent.ini \
       --config-file=/etc/neutron/l3_agent.ini --config-file=/etc/neutron/fwaas_driver.ini \
       --log-file=/var/log/neutron/vpn_agent.log

post-stop script
  hostname p01-neutron-a2-d5c2f1
end script

ubuntu@p01-neutron-a2-d5c2f1:/etc/init$ less neutron-dhcp-agent.override
# vim:set ft=upstart ts=2 et:
description "Neutron DHCP Agent"
author "Chuck Short <email address hidden>"

manual

respawn

chdir /var/run

pre-start script
  mkdir -p /var/run/neutron
  chown neutron:root /var/run/neutron
  # Check to see if openvswitch plugin in use by checking
  # status of cleanup upstart configuration
  hostname network-controller
  if status neutron-ovs-cleanup; then
    start wait-for-state WAIT_FOR=neutron-ovs-cleanup WAIT_STATE=running WAITER=neutron-dhcp-agent
  fi
end script

exec start-stop-daemon --start --chuid neutron --exec /usr/bin/neutron-dhcp-agent -- --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/dhcp_agent.ini --log-file=/var/log/neutron/dhcp-agent.log

post-stop script
  hostname p01-neutron-a2-d5c2f1
end script

ubuntu@p01-neutron-a2-d5c2f1:/etc/init$ less neutron-metadata-agent.override
# vim:set ft=upstart ts=2 et:
description "Neutron Metadata Plugin Agent"
author "Yolanda Robla <email address hidden>"

manual

respawn

chdir /var/run

pre-start script
  mkdir -p /var/run/neutron
  chown neutron:root /var/run/neutron
  hostname network-controller
end script

exec start-stop-daemon --start --chuid neutron --exec /usr/bin/neutron-metadata-agent -- \
       --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/metadata_agent.ini \
       --log-file=/var/log/neutron/metadata-agent.log

post-stop script
  hostname p01-neutron-a2-d5c2f1
end script

ubuntu@p01-neutron-a2-d5c2f1:/etc/init.d$ less pacemaker
.....
start()
{
        hostname p01-neutron-a2-d5c2f1
        echo -n "Starting $desc: "

........

Revision history for this message
Xiang Hui (xianghui) wrote :

@kayaliu
Currently we still keep the change of hostname in ocf files, since changing upstart script needs to package quantum-gateway and maintain that package by ourself, so a tiny change is always good option which can be controlled by charm.

no longer affects: hacluster (Juju Charms Collection)
no longer affects: neutron-gateway (Juju Charms Collection)
Revision history for this message
Kaya LIU (kayaliu) wrote :

that's fine..just wondering if the cluster will be confused with the same hostname in some scenarios...for example, stop the pacemaker and start it again on the active node.

Revision history for this message
Xiang Hui (xianghui) wrote : Re: [Bug 1379607] Re: Neutron Gateway services need HA support

Yeah, that would be a problem, so what we are doing is to test for covering
all the cases, so finally could decide which way should be used most
gracefully, e.g for hostname in ocf/upstart or use router migration.

On Wed, Oct 29, 2014 at 9:37 PM, Kaya LIU <email address hidden> wrote:

> that's fine..just wondering if the cluster will be confused with the
> same hostname in some scenarios...for example, stop the pacemaker and
> start it again on the active node.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1379607
>
> Title:
> Neutron Gateway services need HA support
>
> Status in “quantum-gateway” package in Juju Charms Collection:
> In Progress
>
> Bug description:
> Support Active/Passive deployment using pacemaker and corosync.
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/charms/+source/quantum-gateway/+bug/1379607/+subscriptions
>

--
Best Regards.
Hui.

OpenStack Engineer

Revision history for this message
Hua Zhang (zhhuabj) wrote :

changing hostname will cause corosync command can't work properly like:
ubuntu@juju-zhhuabj-machine-5:~$ sudo crm node online
sudo: unable to resolve host neutron-gateway
Could not map name=neutron-gateway to a UUID
Please choose from one of the matches above and suppy the 'id' with --attr-id

this is because we change the hostname to neutron-gateway but corosync still use initial hostname juju-zhhuabj-machine-5
can also refer the data https://pastebin.canonical.com/119639/

this is only one problem I found caused by the same hostname, not sure if exist other problems, It's too expensive to use same hostname, so I think using router migration will be more better choice.

Revision history for this message
Kaya LIU (kayaliu) wrote :

we fix this issue by add the neutron-gateway hostname in /etc/hosts:

<management ip> neutron-gateway

Try testing with the following procedure to see if you have the problem to bring up the pacemaker
1. on the active node (node 1), stop pacemaker - hostname will be still neutron-gateway
2. the passive node (node 2) becomes the active node - hostname will be changed to neutron-gateway
3. go back to the node 1 and start pacemaker - same hostname will bring trouble to start the pacemaker
while it could be fixed manually by changing the node 1 hostname back.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

ubuntu@juju-zhhuabj-machine-5:~$ grep 'neutron-gateway' /etc/hosts
10.5.0.21 neutron-gateway

ubuntu@juju-zhhuabj-machine-5:~$ sudo hostname neutron-gateway

ubuntu@juju-zhhuabj-machine-5:~$ sudo crm node standby
Could not map name=neutron-gateway to a UUID
Please choose from one of the matches above and suppy the 'id' with --attr-id

ubuntu@juju-zhhuabj-machine-5:~$ sudo hostname juju-zhhuabj-machine-5

ubuntu@juju-zhhuabj-machine-5:~$ sudo crm node standby
ubuntu@juju-zhhuabj-machine-5:~$

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Find 3 bugs in the first round test, test case link http://goo.gl/wObkxQ
1, using the same hostname to help reschedule router into online l3-agent,
   in order to avoiding the error when execting the command 'crm node standby/online', restore hostname into previous value, see the fixed patch.
   http://bazaar.launchpad.net/~cts-engineering/charms/trusty/quantum-gateway/quantum-gateway-ha/revision/88
2, fix deleting namespace and qr-, qg- and their peer devices, see the fix patch
   http://bazaar.launchpad.net/~cts-engineering/charms/trusty/quantum-gateway/quantum-gateway-ha/revision/89
3, dhcp_agents_per_network doesn't equal the number of neutron-gateway

Revision history for this message
Xiang Hui (xianghui) wrote :

sum up:
[bug description: neutron-vpn-agent hostname for ha]
There are two ways for neutron-vpn-agent HA with pacemaker/corosync working: 1) all quantum-gateway node hostname should be same. 2)router-migration, currently we try to use hostname since if there are bunches of virtual routers it may consume much more time for router migration, hostname seems more easy and simple.

[bug description: pacemaker report errors if all quantum-gateway node use same hostname]
To fix this problem, we will set hostname to united name "neutron-gateway" in function "neutron-l3-agent start()" of ocf, and set hostname back to the original one assigned by juju in function "neutron-l3-agent stop()" of ocf, this may solved the problem that no quantum-gateway node has the same hostname "neutron-gateway" in the same time.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

[bug - corosync can't return to sync after directly execting the command 'kill -9 <l3-agent-process-id>' ]
this bug can be fixed by moving the script of restoring the hostname into the front of the stop method of ocf file.
http://bazaar.launchpad.net/~cts-engineering/charms/trusty/quantum-gateway/quantum-gateway-ha/revision/91

Revision history for this message
Hua Zhang (zhhuabj) wrote :

[bug description: sometimes ovs process doesn't be started in the master node ]
To fix this problem, we will restart ovs process in the start method of ocf file
http://bazaar.launchpad.net/~cts-engineering/charms/trusty/quantum-gateway/quantum-gateway-ha/revision/92

[bug description: ovs monitor will fail if restart ovs process after changing hostname ]
http://bazaar.launchpad.net/~cts-engineering/charms/trusty/quantum-gateway/quantum-gateway-ha/revision/93

Revision history for this message
Hua Zhang (zhhuabj) wrote :

comment #13 said sometimes ovs process doesn't be started, that's because ovs process is killed by SIGALRM signal . the initial status of ovs is good after applying the patch comment #13 mentioned, but ovs process dead again after half a day, I captured the log below.
Nov 7 08:47:35 juju-zhhuabj-machine-6 lrmd[18682]: notice: operation_finished: res_neutron-l3-agent_monitor_30000:9295:stderr [ /usr/l ib/ocf/resource.d/openstack/neutron-agent-l3: 30: /usr/lib/ocf/resource.d/openstack/neutron-agent-l3: source: not found ]
  1020 Nov 7 08:47:42 juju-zhhuabj-machine-6 ovs-vsctl: ovs|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
  1021 Nov 7 08:47:53 juju-zhhuabj-machine-6 ovs-vsctl: ovs|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
  1022 Nov 7 08:47:53 juju-zhhuabj-machine-6 ovs-vsctl: ovs|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed ( Protocol error)
  1023 Nov 7 08:47:53 juju-zhhuabj-machine-6 ovs-vsctl: ovs|00002|vsctl|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (Prot ocol error)
  1024 Nov 7 08:47:55 juju-zhhuabj-machine-6 ovs-vsctl: ovs|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed ( Protocol error)

Revision history for this message
Hua Zhang (zhhuabj) wrote :

[bug description: ovsdb_monitor doesn't work ]

2014-11-10 08:57:03.819 18985 CRITICAL neutron [-] Trying to re-send() an already-triggered event.
2014-11-10 08:57:07.798 30149 ERROR neutron.agent.linux.ovsdb_monitor [req-53520eed-601f-45af-ad6d-c036f7b25964 None] Error received from ovsdb monitor: sudo: unable to resolve host neutron-gateway

it still report below CRITICAL error after using this patch (http://bazaar.launchpad.net/~cts-engineering/charms/trusty/quantum-gateway/quantum-gateway-ha/revision/94) to first fix the error 'unable to resolve host neutron-gateway' above.

2014-11-10 09:59:05.183 603 ERROR neutron.agent.linux.ovsdb_monitor [-] Error received from ovsdb monitor: 2014-11-10T09:59:05Z|00001|fatal_signal|WARN|terminating with signal 15 (Terminated)
2014-11-10 09:59:05.366 603 CRITICAL neutron [-] Trying to re-send() an already-triggered event.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

[bug description: master node has two metadata-agent processes]
charm starts one metadata-agent process, and corosync starts one metadata-agent process as well. so should stop metadata-agent process in charm when one ha related joined.

[bug description: l3-agent resource and metadata-agent resource are scheduled into different nodes]
l3-agent and metadata-agent use local unix socket to communicate, so they should be in the same node.
sudo crm configure group res_group_l3_metadata res_neutron-l3-agent res_neutron-metadata-agent

Revision history for this message
Hua Zhang (zhhuabj) wrote :

[bug description: dhcp doesn't work]
today use the latest patches to create a fresh env, and find dhcp doesn't work. in previous test, dhcp function often is ok.
and can't find the log from dhcp-agent.log

Revision history for this message
Xiang Hui (xianghui) wrote :

[bug description: neutron-dhcp-agent ]
First here we are talking about neutron-dhcp-agent natively HA, steps are:
1. deploy three quantum-gateway nodes
2. create a network and boot a vm
3. confirmed that there are three dhcp agents hosting the same network
4. reboot master quantum-node
5. the master quantum-node has been moved to a different node with step 4
6. list dhcp-agent hosting networks
7. the dhcp agent on the original master quantum-gateway node failed to hosting the network

It is caused by the 'hostname' changing, since when failover, the hostname of the failed dhcp agent has changing back to its original hostname other than 'neutron-gateway', which is not existing when creating the network.

But this won't affect failover, still looking for a possible workaround.

Revision history for this message
Xiang Hui (xianghui) wrote :

[bug description: juju set neutron-gateway 'ext-port=eth1' failed]

If one of neutron-gateway node changed its hostname, juju set 'ext-port=eth1' will be failed on this node, the new external port can't be inserted into br-ex, a possible work around is to set 'ext-port=eth1' before changing hostname.

Revision history for this message
James Page (james-page) wrote :

Just for the record I'm not comfortable with this proposed feature for the gateway charm; it might fix HA but it breaks horizontal scalability.

We're due to deliver router HA (alongside DVR) in the next charm dev cycle, implemented using the neutron capabilities provided in >= juno.

Xiang Hui (xianghui)
Changed in quantum-gateway (Juju Charms Collection):
status: In Progress → Fix Released
milestone: none → 15.01
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.