Bug #1192381 “dhcp dnsmasq lost port in host config file” : Bugs : neutron

Revision history for this message

Wu Wenxiang (wu-wenxiang) wrote on 2013-06-19:

#1

With debug, I found the funciton port_update_end() didn't be called when quantum is busy, which caused the cache didn't updated when a port created with creating a VM, so that the host file (used for dnsmasq) didn't contain the new VM's MAC & IP address.

And I made a change, in dhcp-agent, I run a period thread to modify the sync flag to True every 10 mins.
Then the cache updated every 10 mins. Any VM who didn't get the IP from DHCP-agent then could got IP in 10 mins.

My solution couldn't fix this issue graceful but improvement.

Revision history for this message

Mark McClain (markmcclain) wrote on 2013-06-19:

#2

Is there a set of reproducible steps to trigger this bug?

Changed in quantum:
status:	New → Incomplete
tags:	added: l3-ipam-dhcp

Revision history for this message

Wu Wenxiang (wu-wenxiang) wrote on 2013-06-19:

#3

This issue not always happened but very often in our production environment.
I describe our production environment here:
1. Quantum with GRE Mode. DHCP use dnsmasq.
2. One L3 node(24 cores, 32G), DHCP-agent, Metadata-agent, L3-agent, OVS-plugin running on it.
3. Three Compute nodes(24Core, 96G), Nova-compute, OVS-plugin running on it.
4. VM has alreay more than 50, both VM has 2 nics belong to 2 networks.
5. 10 tenants, 2 networks for each tenant.

Then,
6. Boot 5 VMs with 2 NIC, and repeat 10 times. Check all the 50 new VMs, you will found some of them didn't got the IP from dnsmasq. Then check the related host file, you will found MAC/IP address missed here.

I think it's absolutely a bug for quantum.

Changed in quantum:
status:	Incomplete → New

Revision history for this message

Mark McClain (markmcclain) wrote on 2013-11-22:

#4

Might have been a function of refresh speed in Grizzly. Need to look at the changes in stable/havana and compare against Icehouse.

Changed in neutron:
status:	New → Triaged
importance:	Undecided → Medium

Maru Newby (maru) on 2013-11-25

Changed in neutron:
assignee:	nobody → Maru Newby (maru)

Revision history for this message

Maru Newby (maru) wrote on 2013-11-25:

#5

I can reproduce the bug on icehouse with devstack:

- get the tenant id of the demo user (keystone tenant-list)
- update the cores and instances quota to 50+ for that user (nova-manage project quota [tenant-id] --key=cores --value=50)
- boot 50+ nano (64mb) instances (will need lots of ram - 4gb+), e.g.

for i in $(seq 1 50); do
nova boot --flavor 42 --image cirros-0.3.1-x86_64-disk stress-$i &
done

- check the line count of the leases file ('ps aux | grep dnsmasq' and get the argument to dns-hostsfile)
- the number of lines should be the same as the number of vm's booted

I see anywhere from 5-10 missing leases and the vm's for those missing leases are not assigned an ip address. Restarting the agent will ensure the leases are created, but by then it is too late for the vm's.

This issue has also been reported in RHOS 3.0: https://bugzilla.redhat.com/show_bug.cgi?id=1023818

Maru Newby (maru) on 2013-12-03

Changed in neutron:
importance:	Medium → Critical
tags:	added: neutron-parallel

Revision history for this message

Maru Newby (maru) wrote on 2013-12-03:

#6

I instrumented the dhcp notifier and found that under high cpu load a call to _get_dhcp_agents() (https://github.com/openstack/neutron/blob/master/neutron/api/rpc/agentnotifiers/dhcp_rpc_agent_api.py#L79) was returning no agents, preventing port_create_end notifications being sent to the agent. _get_dhcp_agents() filters by active=True, so a failure to deliver any results indicated that the agent heartbeat was not being received by the neutron service in a timely fashion. Since the notification of port creation was never sent to the agent, the dnsmasq instance for the network was not updated with the port's ip assignment and the VM associated with the port was unable to configure itself for connectivity.

Reproduction is as per the instructions in the previous comment, booting 75 VMs instead of 50 to ensure consistent failure. This required setting the quota for cores and instances accordingly. Many of the VMs had multiple ports allocated for them (as per https://bugs.launchpad.net/neutron/+bug/1160442) up to the neutron port quota of 100. A example result was that 59 notifications were sent to the agent, and 41 were not sent due to the agent not being reported as 'active'. Of the 59 notifications sent, 43 resulted in host entries in the dhcp server, and 16 notifications were part of VM creation that failed due to Neutron timeouts. Out of the 75 boot attempts, 54 reported as successful but many of the successful VMs were subsequently unable to configure connectivity due to missing entries in the dhcp hosts file (only 43 ip addresses were configured).

The clear result of this testing is that the current approach of 'spray and pray' notification is not sufficient to ensure reliable operation. If a notification that triggers a necessary action cannot be sent or delivered in a timely fashion, an error would ideally be propagated or at least logged. This bug also suggests that there needs to be better coordination between Nova and Neutron so that VM error state can be more accurately reported (and perhaps remedied).

I instrumented the dhcp notifier and found that under high cpu load a call to _get_dhcp_agents() (https://github.com/openstack/neutron/blob/master/neutron/api/rpc/agentnotifiers/dhcp_rpc_agent_api.py#L79) was returning no agents, preventing port_create_end notifications being sent to the agent.  _get_dhcp_agents() filters by active=True, so a failure to deliver any results indicated that the agent heartbeat was not being received by the neutron service in a timely fashion.   Since the notification of port creation was never sent to the agent, the dnsmasq instance for the network was not updated with the port's ip assignment and the VM associated with the port was unable to configure itself for connectivity.

Reproduction is as per the instructions in the previous comment, booting 75 VMs instead of 50 to ensure consistent failure.  This required setting the quota for cores and instances accordingly.  Many of the VMs had multiple ports allocated for them (as per https://bugs.launchpad.net/neutron/+bug/1160442) up to the neutron port quota of 100.  A example result was that 59 notifications were sent to the agent, and 41 were not sent due to the agent not being reported as 'active'.  Of the 59 notifications sent, 43 resulted in host entries in the dhcp server, and 16 notifications were part of VM creation that failed due to Neutron timeouts.  Out of the 75 boot attempts, 54 reported as successful but many of the successful VMs were subsequently unable to configure connectivity due to missing entries in the dhcp hosts file (only 43 ip addresses were configured).

The clear result of this testing is that the current approach of 'spray and pray' notification is not sufficient to ensure reliable operation.  If a notification that triggers a necessary action cannot be sent or delivered in a timely fashion, an error would ideally be propagated or at least logged.  This bug also suggests that there needs to be better coordination  between Nova and Neutron so that VM error state can be more accurately reported (and perhaps remedied).

Revision history for this message

Snow Cherry (tuhongj) wrote on 2013-12-04:

#7

It seems with agent_down_time twice as report_interval, the problem does not occur. For example, agent_down_time = 120, report_interval = 60, the dhcp-agent will not be regarded as dead easily.

Revision history for this message

Sreedhar Nathani (sreedhar-nathani) wrote on 2013-12-05:

#8

Download full text (5.9 KiB)

Hi Maru, Here are the details about the issues which i was facing after installing Havana compared to Grizzly

Setup
  - One Physical Box with 16c, 256G memory. 2 VMs created on this Box - One for Controller and One for Network Node
  - 16x compute nodes (each has 16c, 256G memory)
  - All the systems are installed with Ubuntu Precise + Havana Bits from Ubuntu Cloud Archive

Steps to simulate the issue
1) Concurrently create 30 Instances (m1.small) using REST API with mincount=30
2) sleep for 20min and repeat the step (1)

Issue 1
-------
In Havana, once we cross 150 instances (5 batches x 30) during 6th batch some instances are going into ERROR state
due to network port not able to create and some instances are getting duplicate IP address

Per Maru Newby this issue might related to this bug
https://bugs.launchpad.net/bugs/1192381

I have done the similar with Grizzly on the same environment 2 months back, where i could able to deploy close to 240 instances with out any errors
Initially on Grizzly also seen the same behaviour but with these tunings based on this bug
https://bugs.launchpad.net/neutron/+bug/1160442, never had issues (tested more than 10 times)
       sqlalchemy_pool_size = 60
       sqlalchemy_max_overflow = 120
       sqlalchemy_pool_timeout = 2
       agent_down_time = 60
       report_internval = 20

In Havana, i have tuned the same tunables but i could never get past 150+ insatnces. Without the tunables i could not able to get past
100 instances. We are getting many timeout errors from the DHCP agent and neutron clients

NOTE: After tuning the agent_down_time to 60 and report_interval to 20, we no longer getting these error messages
   2013-12-02 11:44:43.421 28201 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
   2013-12-02 11:44:43.439 28201 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
   2013-12-02 11:44:43.452 28201 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents

In the compute node openvswitch agent logs, we see these errors repeating continously

2013-12-04 06:46:02.081 3546 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "security_group_rules_for_devices" info: "<unknown>"
and WARNING neutron.openstack.common.rpc.amqp [-] No calling threads waiting for msg_id

DHCP agent has below errors

2013-12-02 15:35:19.557 22125 ERROR neutron.agent.dhcp_agent [-] Unable to reload_allocations dhcp.
2013-12-02 15:35:19.557 22125 TRACE neutron.agent.dhcp_agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "get_dhcp_port" info: "<unknown>"

2013-12-02 15:35:34.266 22125 ERROR neutron.agent.dhcp_agent [-] Unable to sync network state.
2013-12-02 15:35:34.266 22125 TRACE neutron.agent.dhcp_agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "get_active_networks_info" info: "<unknown>"

In Havana, I have merged the code from this patch and set api_workers to 8
https://review.openstack.org/#/c/37131/

After this patch and starting 8 neutron-server worker threads, during the batch crea...

Hi Maru, Here are the details about the issues which i was facing after installing Havana compared to Grizzly

Setup
  - One Physical Box with 16c, 256G memory. 2 VMs created on this Box - One for Controller and One for Network Node
  - 16x compute nodes (each has 16c, 256G memory)
  - All the systems are installed with Ubuntu Precise + Havana Bits from Ubuntu Cloud Archive

Steps to simulate the issue
  1) Concurrently create 30 Instances (m1.small) using REST API with mincount=30
  2) sleep for 20min and repeat the step (1)

Issue 1
-------
In Havana, once we cross 150 instances (5 batches x 30) during 6th batch some instances are going into ERROR state
due to network port not able to create and some instances are getting duplicate IP address

Per Maru Newby this issue might related to this bug
https://bugs.launchpad.net/bugs/1192381

I have done the similar with Grizzly on the same environment 2 months back, where i could able to deploy close to 240 instances with out any errors
Initially on Grizzly also seen the same behaviour but with these tunings based on this bug 
https://bugs.launchpad.net/neutron/+bug/1160442, never had issues (tested more than 10 times)
       sqlalchemy_pool_size = 60 
       sqlalchemy_max_overflow = 120 
       sqlalchemy_pool_timeout = 2 
       agent_down_time = 60
       report_internval = 20
     
In Havana, i have tuned the same tunables but i could never get past 150+ insatnces. Without the tunables i could not able to get past
100 instances. We are getting many timeout errors from the DHCP agent and neutron clients

NOTE: After tuning the agent_down_time to 60 and report_interval to 20, we no longer getting these error messages
   2013-12-02 11:44:43.421 28201 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
   2013-12-02 11:44:43.439 28201 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
   2013-12-02 11:44:43.452 28201 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents

In the compute node openvswitch agent logs, we see these errors repeating continously

2013-12-04 06:46:02.081 3546 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "security_group_rules_for_devices" info: "<unknown>"
and WARNING neutron.openstack.common.rpc.amqp [-] No calling threads waiting for msg_id

DHCP agent has below errors
       
2013-12-02 15:35:19.557 22125 ERROR neutron.agent.dhcp_agent [-] Unable to reload_allocations dhcp.
2013-12-02 15:35:19.557 22125 TRACE neutron.agent.dhcp_agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "get_dhcp_port" info: "<unknown>"

2013-12-02 15:35:34.266 22125 ERROR neutron.agent.dhcp_agent [-] Unable to sync network state.
2013-12-02 15:35:34.266 22125 TRACE neutron.agent.dhcp_agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "get_active_networks_info" info: "<unknown>"

In Havana, I have merged the code from this patch and set api_workers to 8
https://review.openstack.org/#/c/37131/

After this patch and starting 8 neutron-server worker threads, during the batch creation of 240 instances with 30 concurrent requests during each batch,
238 instances became active and 2 instances went into error. Intresting these 2 instances which went into error state are from the same compute node.

Unlike earlier this time, the errors are due to 'Too Many Connections' to the MySQL database. 
2013-12-04 17:07:59.877 21286 AUDIT nova.compute.manager [req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041 9b073211dd5c4988993341cc955e200b] [instance: c14596fd-13d5-482b-85af-e87077d4ed9b] Terminating instance
2013-12-04 17:08:00.578 21286 ERROR nova.compute.manager [req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041 9b073211dd5c4988993341cc955e200b] [instance: c14596fd-13d5-482b-85af-e87077d4ed9b] Error: Remote error: OperationalError (OperationalError) (1040, 'Too many connections') None None

Need to backport the patch 'https://review.openstack.org/#/c/37131/' to address the Neutron Scaling issues in Havana.

Issue 2
--------

Grizzly :
During the concurrent instance creation in Grizzly, once we cross 210 instances, during subsequent 30 instance creation some of
the instances could not get their IP address during the first boot with in first few min. Instance MAC and IP Address details
were updated in the dnsmasq host file but with a delay. Instances are able to get thier IP address with a delay eventually.

If we reboot the instance using 'nova reboot' instance used to get IP Address.

* Amount of delay is depending on number of network ports and delay is in the range of 8seconds to 2min

Havana :
But in Havana only 81 instances could get the IP during the first boot. Port is getting createed and IP address are getting allocated
very fast, but by the time port is UP its taking quite lot of time. Once the port is UP, Instances are able to send the DHCP Request
and get the IP address.

During the network port create and network port update, there are lot of 'security_group_rules_for_devices' messages. OVS Agents in the
compute nodes are getting Timeouts during "security_group_rules_for_devices"

Even though this issue exist in Grizzly but we observed this issue only after 200+ active instances (200 network ports), but in Havana
We are having these issues with less than 100 active ports.

In Havana, if we reboot the instance its not able to get the IP Address even though its network port entry is already exist in
dnsmasq hosts file. We can't even ping the IP Address now which we were able to ping before the instance reboot. After restarting the
'neutron-dhcp-agent' service which will restart the 'dnsmasq' and reboot of the instance could get the IP

This clear shows we have performance regression in neutron/havana compared to quantum/grizzly

I am happy to share the results of my grizzly tests and logs during recent havana tests

Thierry Carrez (ttx) on 2013-12-10

Changed in neutron:
milestone:	none → icehouse-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-12-10: Fix proposed to neutron (master)

#9

Fix proposed to branch: master
Review: https://review.openstack.org/61168

Changed in neutron:
status:	Triaged → In Progress

Maru Newby (maru) on 2013-12-11

tags:

added: havana-backport-potential

Revision history for this message

Maru Newby (maru) wrote on 2014-01-09:

#10

The committed fix serves is intended to serve as a stop-gap until the dhcp agent is updated to be eventual consistency. That effort will be tracked by the following blueprint: https://blueprints.launchpad.net/neutron/+spec/eventually-consistent-dhcp-agent

Changed in neutron:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-01-09: Fix proposed to neutron (stable/havana)

#11

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/65590

Thierry Carrez (ttx) on 2014-01-22

Changed in neutron:
status:	Fix Committed → Fix Released

Alan Pevec (apevec) on 2014-02-02

tags:

removed: havana-backport-potential

Revision history for this message

Roey Dekel (rdekel) wrote on 2014-02-24:

#12

Tried to check on Havana with:
openstack-neutron-2013.2.2-1.el6ost.noarch

First Try:
=======

Reproduce Steps:
----------------
1. Setup environment with tenant-network (internal VLAN) and update qoutas for:
    instances - 100
    cores - 100
    ports - 150
2. Boot a vm to verify working setup
3. Boot 90 vms parallely:
    # for i in $(seq 1 90); do nova boot stress-${i} --flavor 1 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done
4. Verify working VM's with non-identical IP's
5. Delete 90 vms parallely:
    # for i in $(seq 1 90); do nova delete boot stress-${i} & done

Expected Results:
-----------------
Environment return to same status before stress boot.

Results:
--------
26 VM's at ERROR
3 VM's at ACTIVE

Comments:
---------
1. Step 4 - verified with the next cmd which showed 92 (91 VM'S + DHCP):
# cat /var/lib/neutron/dhcp/cef6f793-0b5f-429f-9e4b-f9e4e70dbca3/host | cut -d"," -f3 | sort -n | uniq | wc -l
2. Attached is a moment after sending the deletion cmd (step 5) - attachment 866663 [details].
3. Attached is log for nova compute which indicates a problem with deleted port - probably it was deleted before the vm was deleted.

Secound Try:
==========
Cleared the not deleted VM's and tried to clear again, sequentialy this time.

Reproduce Steps:
----------------
1. Boot 90 vms parallely:
# for i in $(seq 1 90); do nova boot stress-${i} --flavor 1 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done
2. Delete 90 vms sequently:
# for i in $(seq 1 90); do nova delete boot stress-${i} ; done

Expected Results:
-----------------
1. 90 new ACTIVE VM's with valid IP's.
2. Deleted 90 VM's

Results:
--------
1. 3 VM's stuck on BUILD.
2. 3 VM's on BUILD - weren't deleted.
1 VM - ERROR
1 VM - ACTIVE (as nothing happend)

Thierry Carrez (ttx) on 2014-04-17

Changed in neutron:
milestone:	icehouse-2 → 2014.1

neutron

dhcp dnsmasq lost port in host config file

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Critical	Maru Newby	neutron 2014.1 "icehouse"
	Havana	Fix Released	Critical	Maru Newby	neutron 2013.2.2