Hi Maru, Here are the details about the issues which i was facing after installing Havana compared to Grizzly
Setup
- One Physical Box with 16c, 256G memory. 2 VMs created on this Box - One for Controller and One for Network Node
- 16x compute nodes (each has 16c, 256G memory)
- All the systems are installed with Ubuntu Precise + Havana Bits from Ubuntu Cloud Archive
Steps to simulate the issue
1) Concurrently create 30 Instances (m1.small) using REST API with mincount=30
2) sleep for 20min and repeat the step (1)
Issue 1
-------
In Havana, once we cross 150 instances (5 batches x 30) during 6th batch some instances are going into ERROR state
due to network port not able to create and some instances are getting duplicate IP address
I have done the similar with Grizzly on the same environment 2 months back, where i could able to deploy close to 240 instances with out any errors
Initially on Grizzly also seen the same behaviour but with these tunings based on this bug https://bugs.launchpad.net/neutron/+bug/1160442, never had issues (tested more than 10 times) sqlalchemy_pool_size = 60 sqlalchemy_max_overflow = 120 sqlalchemy_pool_timeout = 2 agent_down_time = 60 report_internval = 20
In Havana, i have tuned the same tunables but i could never get past 150+ insatnces. Without the tunables i could not able to get past
100 instances. We are getting many timeout errors from the DHCP agent and neutron clients
NOTE: After tuning the agent_down_time to 60 and report_interval to 20, we no longer getting these error messages
2013-12-02 11:44:43.421 28201 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
2013-12-02 11:44:43.439 28201 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
2013-12-02 11:44:43.452 28201 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
In the compute node openvswitch agent logs, we see these errors repeating continously
2013-12-04 06:46:02.081 3546 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "security_group_rules_for_devices" info: "<unknown>"
and WARNING neutron.openstack.common.rpc.amqp [-] No calling threads waiting for msg_id
DHCP agent has below errors
2013-12-02 15:35:19.557 22125 ERROR neutron.agent.dhcp_agent [-] Unable to reload_allocations dhcp.
2013-12-02 15:35:19.557 22125 TRACE neutron.agent.dhcp_agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "get_dhcp_port" info: "<unknown>"
After this patch and starting 8 neutron-server worker threads, during the batch creation of 240 instances with 30 concurrent requests during each batch,
238 instances became active and 2 instances went into error. Intresting these 2 instances which went into error state are from the same compute node.
Unlike earlier this time, the errors are due to 'Too Many Connections' to the MySQL database.
2013-12-04 17:07:59.877 21286 AUDIT nova.compute.manager [req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041 9b073211dd5c4988993341cc955e200b] [instance: c14596fd-13d5-482b-85af-e87077d4ed9b] Terminating instance
2013-12-04 17:08:00.578 21286 ERROR nova.compute.manager [req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041 9b073211dd5c4988993341cc955e200b] [instance: c14596fd-13d5-482b-85af-e87077d4ed9b] Error: Remote error: OperationalError (OperationalError) (1040, 'Too many connections') None None
Grizzly :
During the concurrent instance creation in Grizzly, once we cross 210 instances, during subsequent 30 instance creation some of
the instances could not get their IP address during the first boot with in first few min. Instance MAC and IP Address details
were updated in the dnsmasq host file but with a delay. Instances are able to get thier IP address with a delay eventually.
If we reboot the instance using 'nova reboot' instance used to get IP Address.
* Amount of delay is depending on number of network ports and delay is in the range of 8seconds to 2min
Havana :
But in Havana only 81 instances could get the IP during the first boot. Port is getting createed and IP address are getting allocated
very fast, but by the time port is UP its taking quite lot of time. Once the port is UP, Instances are able to send the DHCP Request
and get the IP address.
During the network port create and network port update, there are lot of 'security_group_rules_for_devices' messages. OVS Agents in the
compute nodes are getting Timeouts during "security_group_rules_for_devices"
Even though this issue exist in Grizzly but we observed this issue only after 200+ active instances (200 network ports), but in Havana
We are having these issues with less than 100 active ports.
In Havana, if we reboot the instance its not able to get the IP Address even though its network port entry is already exist in
dnsmasq hosts file. We can't even ping the IP Address now which we were able to ping before the instance reboot. After restarting the
'neutron-dhcp-agent' service which will restart the 'dnsmasq' and reboot of the instance could get the IP
This clear shows we have performance regression in neutron/havana compared to quantum/grizzly
I am happy to share the results of my grizzly tests and logs during recent havana tests
Hi Maru, Here are the details about the issues which i was facing after installing Havana compared to Grizzly
Setup
- One Physical Box with 16c, 256G memory. 2 VMs created on this Box - One for Controller and One for Network Node
- 16x compute nodes (each has 16c, 256G memory)
- All the systems are installed with Ubuntu Precise + Havana Bits from Ubuntu Cloud Archive
Steps to simulate the issue
1) Concurrently create 30 Instances (m1.small) using REST API with mincount=30
2) sleep for 20min and repeat the step (1)
Issue 1
-------
In Havana, once we cross 150 instances (5 batches x 30) during 6th batch some instances are going into ERROR state
due to network port not able to create and some instances are getting duplicate IP address
Per Maru Newby this issue might related to this bug /bugs.launchpad .net/bugs/ 1192381
https:/
I have done the similar with Grizzly on the same environment 2 months back, where i could able to deploy close to 240 instances with out any errors /bugs.launchpad .net/neutron/ +bug/1160442, never had issues (tested more than 10 times)
sqlalchemy_ pool_size = 60
sqlalchemy_ max_overflow = 120
sqlalchemy_ pool_timeout = 2
agent_down_ time = 60
report_ internval = 20
Initially on Grizzly also seen the same behaviour but with these tunings based on this bug
https:/
In Havana, i have tuned the same tunables but i could never get past 150+ insatnces. Without the tunables i could not able to get past
100 instances. We are getting many timeout errors from the DHCP agent and neutron clients
NOTE: After tuning the agent_down_time to 60 and report_interval to 20, we no longer getting these error messages scheduler. dhcp_agent_ scheduler [-] No more DHCP agents scheduler. dhcp_agent_ scheduler [-] No more DHCP agents scheduler. dhcp_agent_ scheduler [-] No more DHCP agents
2013-12-02 11:44:43.421 28201 WARNING neutron.
2013-12-02 11:44:43.439 28201 WARNING neutron.
2013-12-02 11:44:43.452 28201 WARNING neutron.
In the compute node openvswitch agent logs, we see these errors repeating continously
2013-12-04 06:46:02.081 3546 TRACE neutron. plugins. openvswitch. agent.ovs_ neutron_ agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "security_ group_rules_ for_devices" info: "<unknown>" openstack. common. rpc.amqp [-] No calling threads waiting for msg_id
and WARNING neutron.
DHCP agent has below errors
2013-12-02 15:35:19.557 22125 ERROR neutron. agent.dhcp_ agent [-] Unable to reload_allocations dhcp. agent.dhcp_ agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "get_dhcp_port" info: "<unknown>"
2013-12-02 15:35:19.557 22125 TRACE neutron.
2013-12-02 15:35:34.266 22125 ERROR neutron. agent.dhcp_ agent [-] Unable to sync network state. agent.dhcp_ agent Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC method: "get_active_ networks_ info" info: "<unknown>"
2013-12-02 15:35:34.266 22125 TRACE neutron.
In Havana, I have merged the code from this patch and set api_workers to 8 /review. openstack. org/#/c/ 37131/
https:/
After this patch and starting 8 neutron-server worker threads, during the batch creation of 240 instances with 30 concurrent requests during each batch,
238 instances became active and 2 instances went into error. Intresting these 2 instances which went into error state are from the same compute node.
Unlike earlier this time, the errors are due to 'Too Many Connections' to the MySQL database. manager [req-26d64693- d1ef-40f3- 8350-659e34d5b1 d7 c4d609870d4447c 684858216da2f80 41 9b073211dd5c498 8993341cc955e20 0b] [instance: c14596fd- 13d5-482b- 85af-e87077d4ed 9b] Terminating instance manager [req-26d64693- d1ef-40f3- 8350-659e34d5b1 d7 c4d609870d4447c 684858216da2f80 41 9b073211dd5c498 8993341cc955e20 0b] [instance: c14596fd- 13d5-482b- 85af-e87077d4ed 9b] Error: Remote error: OperationalError (OperationalError) (1040, 'Too many connections') None None
2013-12-04 17:07:59.877 21286 AUDIT nova.compute.
2013-12-04 17:08:00.578 21286 ERROR nova.compute.
Need to backport the patch 'https:/ /review. openstack. org/#/c/ 37131/' to address the Neutron Scaling issues in Havana.
Issue 2
--------
Grizzly :
During the concurrent instance creation in Grizzly, once we cross 210 instances, during subsequent 30 instance creation some of
the instances could not get their IP address during the first boot with in first few min. Instance MAC and IP Address details
were updated in the dnsmasq host file but with a delay. Instances are able to get thier IP address with a delay eventually.
If we reboot the instance using 'nova reboot' instance used to get IP Address.
* Amount of delay is depending on number of network ports and delay is in the range of 8seconds to 2min
Havana :
But in Havana only 81 instances could get the IP during the first boot. Port is getting createed and IP address are getting allocated
very fast, but by the time port is UP its taking quite lot of time. Once the port is UP, Instances are able to send the DHCP Request
and get the IP address.
During the network port create and network port update, there are lot of 'security_ group_rules_ for_devices' messages. OVS Agents in the group_rules_ for_devices"
compute nodes are getting Timeouts during "security_
Even though this issue exist in Grizzly but we observed this issue only after 200+ active instances (200 network ports), but in Havana
We are having these issues with less than 100 active ports.
In Havana, if we reboot the instance its not able to get the IP Address even though its network port entry is already exist in dhcp-agent' service which will restart the 'dnsmasq' and reboot of the instance could get the IP
dnsmasq hosts file. We can't even ping the IP Address now which we were able to ping before the instance reboot. After restarting the
'neutron-
This clear shows we have performance regression in neutron/havana compared to quantum/grizzly
I am happy to share the results of my grizzly tests and logs during recent havana tests