neutron_dhcp_server is running on different controller than shown in crm status, metadata server will not run

Bug #1469889 reported by Timothy Browne
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Library (Deprecated)

Bug Description

fuel 6.0

3 controller setup running into the following:

Issue:

If the node running my private VLAN network's dhcp/metadata server is restarted the dhcp server will move to another node in the cluster, However if the dhcp server reported as master is not that node, the metadata server will not be listening. Additionally, rebooting the new master will place the master on previous node, but it will not work. rebooting both simultaneously will resolve.

Symptom:
Instantiating a system in horizon will fail to inject ssh keypair and repetitively display in logs:
2015-06-29 20:35:14,099 - url_helper.py[WARNING]: Calling 'http://($IPADDRESS)//latest/meta-data/instance-id' failed [24/120s]: request error [[Errno 111] Connection refused]

Steps to reproduce:

1. Find controller with dhcp/metadata server listening
2. Reboot
3. try to instantiate host and fail.

snippets:

Online: [ node-1 node-2 node-3 ]
p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started [ node-2 ]

root@node-1:~# ip netns show
haproxy

root@node-2:~# ip netns show
qdhcp-54d9418f-6d46-4800-b9b7-ee65e2146954
haproxy
root@node-2:~# ip netns exec qdhcp-54d9418f-6d46-4800-b9b7-ee65e2146954 netstat -nltp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 43435/python
tcp 0 0 10.3.17.26:53 0.0.0.0:* LISTEN 43386/dnsmasq
tcp 0 0 169.254.169.254:53 0.0.0.0:* LISTEN 43386/dnsmasq
tcp6 0 0 fe80::f816:3eff:fe90:53 :::*

root@node-3:~# ip netns show
haproxy

Instances create successfully.

---reboot node-2---

Online: [ node-1 node-2 node-3 ]
p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started [ node-3 ]

root@node-1:~# ip netns show
haproxy

root@node-2:~# ip netns show
qdhcp-652b83ac-4edc-4af3-9952-f040e5211c57
haproxy
root@node-2:~# ip netns exec qdhcp-652b83ac-4edc-4af3-9952-f040e5211c57 netstat -nltp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 10.3.19.21:53 0.0.0.0:* LISTEN 27238/dnsmasq
tcp 0 0 169.254.169.254:53 0.0.0.0:* LISTEN 27238/dnsmasq
tcp6 0 0 fe80::f816:3eff:fed0:53 :::*

root@node-3:~# ip netns show
qdhcp-652b83ac-4edc-4af3-9952-f040e5211c57
haproxy
root@node-3:~# ip netns exec qdhcp-652b83ac-4edc-4af3-9952-f040e5211c57 netstat -nltp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 31665/python
tcp 0 0 10.3.19.22:53 0.0.0.0:* LISTEN 31603/dnsmasq
tcp 0 0 169.254.169.254:53 0.0.0.0:* LISTEN 31603/dnsmasq
tcp6 0 0 fe80::f816:3eff:fe0c:53 :::* LI

Instances fail to create correctly (dhcp was provided by node-2 which has no listening metadata-server; host found in cat /var/lib/neutron/dhcp/652b83ac-4edc-4af3-9952-f040e5211c57/host on node-2)

---reboot node-2---
cluster says node-3 is dhcp
node-2 retains dhcp server but no metadata server
node-3 does not have dhcp servers running.

Instances fail to create correctly (as expected)

--reboot node-3--
no change. 3 still listed as master. 2 retains dhcp server but no metadata server

--reboot both nodes--

node-2 now listed as dhcp master.
node-2 has metadata server listening
can instantiate instances

* node-1 is master in this scenario, however this was iterated in a similar fashion with differnent master, dhcp master node combos with no change in this strange behavior.

The only manual edit was to set "enable_isolated_metadata = True" in /etc/neutron/dhcp_agent.ini becuase we have setup provider vlans and the cisco switch is our gateway for all vlans.

Is this a bug or am i just completely missing something.
I'm new to openstack, and believe i searched thoroughly , but likely have not

Tim

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Hi, please attach diagnostic snapshot.

Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 7.0
tags: added: customer-found
Changed in fuel:
status: Confirmed → Incomplete
milestone: 7.0 → 6.0.2
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Aleksandr Didenko (adidenko)
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

It's in incomplete state for a while, closing it as invalid.

Tim, if you can not provide diagnostic snapshot, please provide at least more information about your environment configuration, like is it CentOS or Ubuntu, network segmentation details (it looks like you used vlan segmentation but we still need to be sure), list of nodes and their roles. We need this info in order to reproduce the issue.

tags: removed: customer-found
Changed in fuel:
status: Incomplete → Invalid
assignee: Aleksandr Didenko (adidenko) → Fuel Library Team (fuel-library)
Changed in fuel:
milestone: 6.0.2 → 6.0.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.