when boot many vms with quantum, nova sometimes allocates two quantum ports rather than one

Bug #1160442 reported by dan wendlandt on 2013-03-26
38
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned
neutron
High
Jakub Libosvar
Grizzly
High
Gary Kotton
oslo-incubator
Medium
Gary Kotton
Ubuntu
Undecided
Unassigned

Bug Description

i have install grizzly-g3, but quantum does not work well, when i boot 128 instances, i found one of instances got more than one fixed ip, howerver, when i boot 64 intances, it nerver happened, besides that , sometimes i can not ping vm with floatingip, i did not find any error message in my quantum log( all the files in the /var/log/quantum), follows are the error output and configurations

| 97a93600-38e2-4700-9851-15ef56c1d628 | slave | ACTIVE | demo-int-net=172.16.100.4 |
| 99aeb6b8-4252-4839-a7d1-f87853116100 | slave | ACTIVE | demo-int-net=172.16.100.117 |
| 9aa82a35-c9f1-4f44-a108-d14e74eec231 | slave | ACTIVE | demo-int-net=172.16.100.108, 172.16.100.109 |
| 9b6b1289-c450-4614-b647-e5ebdffff80a | slave | ACTIVE | demo-int-net=172.16.100.5 |
| 9e0d3aa5-0f15-4b24-944a-6d6c3e18ce64 | slave | ACTIVE | demo-int-net=172.16.100.35 |
| 9ea62124-9128-43cc-acdd-142f1e7743d6 | slave | ACTIVE | demo-int-net=172.16.100.132 |

my setup : one db host(db service), one glance host(glance service), on api host(keystone,nova-api,nova-scheduler, nova-conductor, quantum-server,quantum-dhcp, quantum-l3-agent,quantum-plugin-openvswitch-agent), eight compute host(each host with nova-compute, quantum-plugin-openvswitch-agent), i check that all the service on all hosts works well

i used vlan type network and openvswitch plugin:

my quantum.conf

[DEFAULT]
# Default log level is INFO
# verbose and debug has the same result.
# One of them will set DEBUG log level output
debug = True

# Address to bind the API server
bind_host = 0.0.0.0

# Port the bind the API server to
bind_port = 9696

# Quantum plugin provider module
# core_plugin =
core_plugin = quantum.plugins.openvswitch.ovs_quantum_plugin.OVSQuantumPluginV2
# Advanced service modules
# service_plugins =

# Paste configuration file
api_paste_config = /etc/quantum/api-paste.ini

# The strategy to be used for auth.
# Supported values are 'keystone'(default), 'noauth'.
auth_strategy = keystone

# Modules of exceptions that are permitted to be recreated
# upon receiving exception data from an rpc call.
# allowed_rpc_exception_modules = quantum.openstack.common.exception, nova.exception
# AMQP exchange to connect to if using RabbitMQ or QPID
control_exchange = quantum

# RPC driver. DHCP agents needs it.
notification_driver = quantum.openstack.common.notifier.rpc_notifier

# default_notification_level is used to form actual topic name(s) or to set logging level
default_notification_level = INFO

# Defined in rpc_notifier, can be comma separated values.
# The actual topic names will be %s.%(default_notification_level)s
notification_topics = notifications

[QUOTAS]
# resource name(s) that are supported in quota features
# quota_items = network,subnet,port

# default number of resource allowed per tenant, minus for unlimited
# default_quota = -1

# number of networks allowed per tenant, and minus means unlimited
# quota_network = 10

# number of subnets allowed per tenant, and minus means unlimited
# quota_subnet = 10

# number of ports allowed per tenant, and minus means unlimited
quota_port = 5000
quota_floatingip = 5000

# default driver to use for quota checks
# quota_driver = quantum.quota.ConfDriver

# =========== items for agent management extension =============
# Seconds to regard the agent as down.
# agent_down_time = 5
# =========== end of items for agent management extension =====

[DEFAULT_SERVICETYPE]
# Description of the default service type (optional)
# description = "default service type"
# Enter a service definition line for each advanced service provided
# by the default service type.
# Each service definition should be in the following format:
# <service>:<plugin>[:driver]

[SECURITYGROUP]
# If set to true this allows quantum to receive proxied security group calls from nova
# proxy_mode = False

[AGENT]
root_helper = sudo quantum-rootwrap /etc/quantum/rootwrap.conf

# =========== items for agent management extension =============
# seconds between nodes reporting state to server, should be less than
# agent_down_time
# report_interval = 4

# =========== end of items for agent management extension =====

[keystone_authtoken]
auth_host = host-keystone
auth_port = 35357
auth_protocol = http
admin_tenant_name = demoTenant
admin_user = test
admin_password = 123456
signing_dir = /var/lib/quantum/keystone-signing

my dhcp_agent.ini

[DEFAULT]
# Where to store dnsmasq state files. This directory must be writable by the
# user executing the agent.
state_path = /var/lib/quantum

# OVS based plugins(OVS, Ryu, NEC, NVP, BigSwitch/Floodlight)
interface_driver = quantum.agent.linux.interface.OVSInterfaceDriver

# The agent can use other DHCP drivers. Dnsmasq is the simplest and requires
# no additional setup of the DHCP server.
dhcp_driver = quantum.agent.linux.dhcp.Dnsmasq

my ovs_quantum_plugin.ini configure file

[DATABASE]
# This line MUST be changed to actually run the plugin.
sql_connection = mysql://quantum:quantum@host-db/quantum

# Database reconnection interval in seconds - if the initial connection to the
# database fails
reconnect_interval = 2

[OVS]
# (StrOpt) Type of network to allocate for tenant networks. The
# default value 'local' is useful only for single-box testing and
# provides no connectivity between hosts. You MUST either change this
# to 'vlan' and configure network_vlan_ranges below or change this to
# 'gre' and configure tunnel_id_ranges below in order for tenant
# networks to provide connectivity between hosts. Set to 'none' to
# disable creation of tenant networks.

tenant_network_type=vlan

network_vlan_ranges = DemoNet:1:4094

bridge_mappings = DemoNet:DemoBridge

[AGENT]
# Agent's polling interval in seconds
polling_interval = 2

[SECURITYGROUP]

when i execute "quantum router-gateway-set" to add external network id as the router gateway, i found that the status of port for external network id in the router is DOWN. does it matters, if it does, how can i fixed it.

this blocked me for serveral days , can someone help me sovle it, any help will be appreciated.

dan wendlandt (danwent) wrote :

Note: most of the interesting details are in the thread of: https://answers.launchpad.net/quantum/+question/225158

Basically, when a large number of vms are booted at once, each of which should have a single nic (and thus a single quantum port), every once and a while one of the VMs gets two quantum ports allocated. This is then visible to the tenant as two fixed IPs on the VM (since both quantum ports have a device_id of the same VM).

It may be that the allocate_for_instance logic is running twice for the same VM-id. unclear if this is happening on the same nova-compute host, or multiple. Since the ports are created back-to-back, my best guess is that its on the same nova-compute.

xinxin-shu, can you also provide the quantum-server log that covers the time period when the two ports are created? I'm trying to understand if there is an error on the quantum side that prompts nova to try to create the port again.

Changed in quantum:
status: New → Incomplete
dan wendlandt (danwent) wrote :

From the info provided on the question, it seems likely that the required change will be in nova.

tags: added: grizzly-backport-potential quantum

Hi Gary & Dan,

Can you guys take a look at it? I have other stuff depending on this in
order to review/commit, etc. Can't find contacts of other core devstack
developers.

Thanks,
Robert

On 3/18/13 4:38 PM, "Robert Li (baoli)" <email address hidden> wrote:

>Public bug reported:
>
>Currently, lib/quantum supports two config files on the command line:
>quantum config file and core plugin config file. Ideally it should be
>able to support as many as indicated by the plugin script. For example,
>cisco plugin may need to load openvswitch plugin configuration file as
>well.
>
>** Affects: devstack
> Importance: Undecided
> Assignee: Baodong (Robert) Li (baoli)
> Status: New
>
>** Changed in: devstack
> Assignee: (unassigned) => Baodong (Robert) Li (baoli)
>
>--
>You received this bug notification because you are subscribed to the bug
>report.
>https://bugs.launchpad.net/bugs/1156831
>
>Title:
> support multiple config files from cmd line
>
>Status in devstack - openstack dev environments:
> New
>
>Bug description:
> Currently, lib/quantum supports two config files on the command line:
> quantum config file and core plugin config file. Ideally it should be
> able to support as many as indicated by the plugin script. For
> example, cisco plugin may need to load openvswitch plugin
> configuration file as well.
>
>To manage notifications about this bug go to:
>https://bugs.launchpad.net/devstack/+bug/1156831/+subscriptions

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Edgar Magana (emagana) wrote :
Download full text (3.4 KiB)

I am not able to reproduce this bug in my setup. I am using almost the same network configuration that the one posted in this bug report but I am using stable/grizzly with only one compute node, I am able to reach up to 200 VMs with no problems at all:

| 284a318b-84df-4008-a386-d4a7b0510dee | vm-178 | ACTIVE | None | Running | private=10.0.0.178 |
| b2e1bbec-e93c-4b48-9843-b45a45a33b87 | vm-179 | ACTIVE | None | Running | private=10.0.0.182 |
| c8f5d908-79c3-4020-adca-6348f76f44cf | vm-18 | ACTIVE | None | Running | private=10.0.0.22 |
| 2f452943-92f5-41c1-a294-f5e52ef97052 | vm-180 | ACTIVE | None | Running | private=10.0.0.184 |
| 43314b6c-7f6f-4da3-8054-5a8158a48ef9 | vm-181 | ACTIVE | None | Running | private=10.0.0.183 |
| 3821636d-c94b-4030-95d4-47c183893227 | vm-182 | ACTIVE | None | Running | private=10.0.0.187 |
| f6b27836-acf3-49a9-a960-a2a6835df534 | vm-183 | ACTIVE | None | Running | private=10.0.0.185 |
| c798b778-b471-4a1f-b43c-f9d69873354d | vm-184 | ACTIVE | None | Running | private=10.0.0.186 |
| 99a023ec-aeff-4451-9e3c-27316d1ffd78 | vm-185 | ACTIVE | None | Running | private=10.0.0.188 |
| 4475818f-801a-4a59-9a85-1113e54e0c18 | vm-186 | ACTIVE | None | Running | private=10.0.0.192 |
| 56145c8c-a645-4bdf-84b8-5078c19c39a4 | vm-187 | ACTIVE | None | Running | private=10.0.0.189 |
| f2c3074a-43aa-4506-b3a9-43f2ff3f461c | vm-188 | ACTIVE | None | Running | private=10.0.0.190 |
| f12c882a-c25b-46f2-8550-7ec807d3b0b7 | vm-189 | ACTIVE | None | Running | private=10.0.0.191 |
| 9a40b604-0f9e-44d0-a080-59bfb9adedf5 | vm-19 | ACTIVE | None | Running | private=10.0.0.20 |
| 952a25a6-501b-44de-9149-0cb25b217d96 | vm-190 | ACTIVE | None | Running | private=10.0.0.194 |
| 5012700f-83a7-4f9d-85a7-796a65043aed | vm-191 | ACTIVE | None | Running | private=10.0.0.193 |
| 9d7bbc67-55e1-49e8-8c7b-6f6bd5b2183a | vm-192 | ACTIVE | None | Running | private=10.0.0.195 |
| 13192f4b-541f-4e92-b3f5-51e51d28408a | vm-193 | ACTIVE | None | Running | private=10.0.0.198 |
| 4ff125dc-8f1d-40b1-9db6-738665cdb8e7 | vm-194 | ACTIVE | None | Running | private=10.0.0.197 |
| d8f366e2-5867-421c-ab86-c15c06802f48 | vm-195 | ACTIVE | None | Running | private=10.0.0.196 |
| 5ea852aa-02a8-409a-9dc0-8abd5dcd6dc0 | vm-196 | ACTIVE | None | Running | private=10.0.0.201 |
| 2c60b14a-d60a-4fbc-9f1c-509c5fb5ed8c | vm-197 | ACTIVE | None | Running | private=10.0.0.199 |
| eacddd5d-0e3d-4c7d-9bdb-2f9010ff83e3 | vm-198 | ACTIVE | None | Running | private=10.0.0.202 |
| 5dc1f542-abef-4d28-946c-e87b8f5cd48f | vm-199 | ACTIVE | None | Running | private=10.0.0.200 |
| bba098a3-de0a-40b0-b09e-4b6916287a97 | vm-2 | ACTIVE | None | Running | private=10.0.0.4 |
| 5192cdf0-d4ce-4c02-948e-c5659271da80 | vm-20 | ACTIVE | None | Running | private=10.0.0.23 |
| c13cccd3-9d43-4511-93ea-d023e03f8f36 | vm-200 | ACTIVE | None | Running | private=10.0.0.203 |

I am using a very simple scripts to boot all these VMs:
...

Read more...

Download full text (10.8 KiB)

Recently I upgrade my openstack setup from grizzly-g3 to grizzly stable, I have not got this bug yet.

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Edgar Magana
Sent: Thursday, May 23, 2013 5:56 AM
To: Shu, Xinxin
Subject: [Bug 1160442] Re: when boot many vms with quantum, nova sometimes allocates two quantum ports rather than one

I am not able to reproduce this bug in my setup. I am using almost the same network configuration that the one posted in this bug report but I am using stable/grizzly with only one compute node, I am able to reach up to 200 VMs with no problems at all:

| 284a318b-84df-4008-a386-d4a7b0510dee | vm-178 | ACTIVE | None | Running | private=10.0.0.178 |
| b2e1bbec-e93c-4b48-9843-b45a45a33b87 | vm-179 | ACTIVE | None | Running | private=10.0.0.182 |
| c8f5d908-79c3-4020-adca-6348f76f44cf | vm-18 | ACTIVE | None | Running | private=10.0.0.22 |
| 2f452943-92f5-41c1-a294-f5e52ef97052 | vm-180 | ACTIVE | None | Running | private=10.0.0.184 |
| 43314b6c-7f6f-4da3-8054-5a8158a48ef9 | vm-181 | ACTIVE | None | Running | private=10.0.0.183 |
| 3821636d-c94b-4030-95d4-47c183893227 | vm-182 | ACTIVE | None | Running | private=10.0.0.187 |
| f6b27836-acf3-49a9-a960-a2a6835df534 | vm-183 | ACTIVE | None | Running | private=10.0.0.185 |
| c798b778-b471-4a1f-b43c-f9d69873354d | vm-184 | ACTIVE | None | Running | private=10.0.0.186 |
| 99a023ec-aeff-4451-9e3c-27316d1ffd78 | vm-185 | ACTIVE | None | Running | private=10.0.0.188 |
| 4475818f-801a-4a59-9a85-1113e54e0c18 | vm-186 | ACTIVE | None | Running | private=10.0.0.192 |
| 56145c8c-a645-4bdf-84b8-5078c19c39a4 | vm-187 | ACTIVE | None | Running | private=10.0.0.189 |
| f2c3074a-43aa-4506-b3a9-43f2ff3f461c | vm-188 | ACTIVE | None | Running | private=10.0.0.190 |
| f12c882a-c25b-46f2-8550-7ec807d3b0b7 | vm-189 | ACTIVE | None | Running | private=10.0.0.191 |
| 9a40b604-0f9e-44d0-a080-59bfb9adedf5 | vm-19 | ACTIVE | None | Running | private=10.0.0.20 |
| 952a25a6-501b-44de-9149-0cb25b217d96 | vm-190 | ACTIVE | None | Running | private=10.0.0.194 |
| 5012700f-83a7-4f9d-85a7-796a65043aed | vm-191 | ACTIVE | None | Running | private=10.0.0.193 |
| 9d7bbc67-55e1-49e8-8c7b-6f6bd5b2183a | vm-192 | ACTIVE | None | Running | private=10.0.0.195 |
| 13192f4b-541f-4e92-b3f5-51e51d28408a | vm-193 | ACTIVE | None | Running | private=10.0.0.198 |
| 4ff125dc-8f1d-40b1-9db6-738665cdb8e7 | vm-194 | ACTIVE | None | Running | private=10.0.0.197 |
| d8f366e2-5867-421c-ab86-c15c06802f48 | vm-195 | ACTIVE | None | Running | private=10.0.0.196 |
| 5ea852aa-02a8-409a-9dc0-8abd5dcd6dc0 | vm-196 | ACTIVE | None | Running | private=10.0.0.201 |
| 2c60b14a-d60a-4fbc-9f1c-509c5fb5ed8c | vm-197 | ACTIVE | None | Running | private=10.0.0.199 |
| eacddd5d-0e3d-4c7d-9bdb-2f9010ff83e3 | vm-198 | ACTIVE | None | Running | private=10.0.0.202 |
| 5dc1f542-abef-4d28-946c-e87b8f5cd48f | vm-199 | ACTIVE | None ...

Aaron Rosen (arosen) wrote :

Edger how about :

#!/bin/bash
COUNTER=0
for i in {1..200}
do
   (
   echo "Deploy VM vm-$i"
   nova boot --flavor 42 --image tty-quantum --nic net-id=fced80ba-100f-43f0-ab2e-a2bf657d15ab vm-$i
   let COUNTER=$COUNTER+1
   if [ $COUNTER -eq 300 ]
      then
      sleep 5
      COUNTER=0
   fi
) &
done

:)

Edgar Magana (emagana) wrote :
Download full text (38.5 KiB)

When running the test pushing all VMs at the same time instead of sequencial as Aaron suggested I was able to reproduce the bug:

 22bdea7f-0215-4810-82e6-1292fbbddadd | vm-54 | ACTIVE | None | Running | private=10.0.0.56, 10.0.0.78 |
| 5041f4ce-52d4-40f9-b9cc-0e8eda5bb1bd | vm-55 | ACTIVE | None | Running | private=10.0.0.102 |
| 9f105432-a74d-4fe7-acac-8ed423f99ef8 | vm-58 | ACTIVE | None | Running | private=10.0.0.115 |

quantum port-list |grep 10.0.0.56
| b39a4766-d8e9-4f86-98a5-41697bed30db | | fa:16:3e:5e:1d:5d | {"subnet_id": "b9ded0f5-faa7-40f8-abde-a1a400b056c7", "ip_address": "10.0.0.56"} |

Both Ports have the same device_id:
emagana@rock:/opt/stack/logs/screen$ quantum port-show b39a4766-d8e9-4f86-98a5-41697bed30db
+-----------------+----------------------------------------------------------------------------------+
| Field | Value |
+-----------------+----------------------------------------------------------------------------------+
| admin_state_up | True |
| device_id | 22bdea7f-0215-4810-82e6-1292fbbddadd |
| device_owner | compute:None |
| fixed_ips | {"subnet_id": "b9ded0f5-faa7-40f8-abde-a1a400b056c7", "ip_address": "10.0.0.56"} |
| id | b39a4766-d8e9-4f86-98a5-41697bed30db |
| mac_address | fa:16:3e:5e:1d:5d |
| name | |
| network_id | aedb8668-e23a-463b-932a-7e95f0d005b1 |
| security_groups | eea80a63-eb12-4d77-bac9-95a5195c98fb |
| status | DOWN |
| tenant_id | 3bb1aac4f1a547f48af504db0a6d64e2 |
+-----------------+----------------------------------------------------------------------------------+
emagana@rock:/opt/stack/logs/screen$ quantum port-show d8aac4ec-e22b-4547-98d5-ce4c0c88ba2c
+-----------------+----------------------------------------------------------------------------------+
| Field | Value |
+-----------------+----------------------------------------------------------------------------------+
| admin_state_up | True |
| device_id | 22bdea7f-0215-4810-82e6-1292fbbddadd |
| device_owner | compute:None |
| fixed_ips | {"subnet_id": "b9ded0f5-faa7-40f8-abde-a1a400b056c7", "ip_address": "10.0.0.78"} |
| id | d8aac4ec-e22b-4547-9...

Aaron Rosen (arosen) wrote :

Awsome! (yea i as definitely guessing things would timeout trying to launch 100 instances at once :) ) So, we know that this can occur with just one nova-compute node. Is there any chance we could get a copy of the full logs uploaded to dropbox or w/e? I think if we add some print statements to allocate_for_instance() showing the time and instance id we should be able to figure out what's going on.

Edgar Magana (emagana) wrote :

Ok, I will add the print statement to allocate_for_instance(), in the meantime I have posted the logs on Dropbox:

https://www.dropbox.com/sh/rbwkqj3t9zzkjbu/Il6UKsuw9t

Aaron Rosen (arosen) on 2013-05-30
Changed in quantum:
importance: Undecided → High

Fix proposed to branch: master
Review: https://review.openstack.org/31657

Changed in quantum:
assignee: nobody → Gary Kotton (garyk)
status: Incomplete → In Progress
Gary Kotton (garyk) wrote :

After investigation: Problem would be a timout in the quantum client accessing the quantum service. Once this was increased the issues disappeared but it exposed the actual problem:

2013-06-04 10:46:20 ERROR [quantum.api.v2.resource] index failed
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/quantum/api/v2/resource.py", line 82, in resource
    result = method(request=request, **args)
  File "/usr/lib/python2.6/site-packages/quantum/api/v2/base.py", line 239, in index
    return self._items(request, True, parent_id)
  File "/usr/lib/python2.6/site-packages/quantum/api/v2/base.py", line 192, in _items
    obj_list = obj_getter(request.context, **kwargs)
  File "/usr/lib/python2.6/site-packages/quantum/plugins/openvswitch/ovs_quantum_plugin.py", line 597, in get_ports
    page_reverse)
  File "/usr/lib/python2.6/site-packages/quantum/db/db_base_plugin_v2.py", line 1433, in get_ports
    items = [self._make_port_dict(c, fields) for c in query.all()]
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2115, in all
    return list(self)
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2227, in __iter__
    return self._execute_and_instances(context)
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2240, in _execute_and_instances
    close_with_result=True)
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2231, in _connection_from_session
    **kw)
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/session.py", line 730, in connection
    close_with_result=close_with_result)
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/session.py", line 734, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/session.py", line 267, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 2490, in contextual_connect
    self.pool.connect(),
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/pool.py", line 224, in connect
    return _ConnectionFairy(self).checkout()
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/pool.py", line 387, in __init__
    rec = self._connection_record = pool._do_get()
  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/pool.py", line 729, in _do_get
    (self.size(), self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30

did you get to the bottom of why a timeout in python-quantumclient would have caused a port to be allocated twice for an instance?

Gary Kotton (garyk) wrote :

i am still investigating that. the "TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30" on the quantum service has me thinking that the service continues to proceed with the opertion. i am still investigating

Gary Kotton (garyk) wrote :

When setting the timeout to a value that is very small then the problem does not occur. for example:
_ENGINE = create_engine(sql_connection, pool_timeout=2, **engine_args)
In addition to this when the I increase the pool size and add in a overflow size the problem does not occur

For a temporary fix we can add the following configuration parameters:
   - pool_size
   - max_overflow
   - pool_timeout

I'll do this tomorrow

Gary Kotton (garyk) wrote :

When the patch is through it is a matter of tuning the QueuePool parameters.
Thanks
Gary

Changed in quantum:
milestone: none → havana-2

Reviewed: https://review.openstack.org/31657
Committed: http://github.com/openstack/quantum/commit/725d636da5959aee1954a8e8d5ff383d2a8b7bb9
Submitter: Jenkins
Branch: master

commit 725d636da5959aee1954a8e8d5ff383d2a8b7bb9
Author: Gary Kotton <email address hidden>
Date: Tue Jun 4 15:14:39 2013 +0000

    Expose most common QueuePoll parameters from SQLAlchemy

    Fixes bug 1160442

    The defualt parameters are used unless otherwise configured. This
    is applicable for:
     - pool_size
     - max_overflow
     - pool_timeout

    By increasing the pool_size and the max_overflow and lowering the timeout
    the issue reported in the bug is resolved.

    NOTE: the configuration values enable the user to make use of the default
    QueuePool tuning values.

    Change-Id: Ic258442531a2caf2db4f05c4ddd6fbfb0c180f17

Changed in quantum:
status: In Progress → Fix Committed
Mark McLoughlin (markmc) wrote :
Changed in oslo:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Gary Kotton (garyk)

Reviewed: https://review.openstack.org/31753
Committed: http://github.com/openstack/oslo-incubator/commit/ebaa578351c9c6b47c2f28ef6d74451e1483036b
Submitter: Jenkins
Branch: master

commit ebaa578351c9c6b47c2f28ef6d74451e1483036b
Author: Gary Kotton <email address hidden>
Date: Wed Jun 5 08:06:15 2013 +0000

    Enable user to configure pool_timeout

    Fixes bug 1160442

    In addition to this there are the following changes:
    1. The set_default method can configure the QueuePool parameters
    2. The max_pool_size is defaulted to the QueuePool default

    Change-Id: Ie99f7fc4edba68127c4af508290d8074d7094be9

Changed in oslo:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/32691
Committed: http://github.com/openstack/quantum/commit/8ad0295015d8d1077de4610a3e10fc25858f65b9
Submitter: Jenkins
Branch: stable/grizzly

commit 8ad0295015d8d1077de4610a3e10fc25858f65b9
Author: Gary Kotton <email address hidden>
Date: Tue Jun 4 15:14:39 2013 +0000

    Expose most common QueuePoll parameters from SQLAlchemy

    Fixes bug 1160442

    The defualt parameters are used unless otherwise configured. This
    is applicable for:
     - pool_size
     - max_overflow
     - pool_timeout

    By increasing the pool_size and the max_overflow and lowering the timeout
    the issue reported in the bug is resolved.

    NOTE: the configuration values enable the user to make use of the default
    QueuePool tuning values.

    (cherry picked from commit 725d636da5959aee1954a8e8d5ff383d2a8b7bb9)

    Change-Id: Ic258442531a2caf2db4f05c4ddd6fbfb0c180f17

tags: added: in-stable-grizzly
Gary Kotton (garyk) on 2013-06-16
tags: removed: grizzly-backport-potential
Thierry Carrez (ttx) on 2013-07-17
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2013-07-17
Changed in oslo:
milestone: none → havana-2
status: Fix Committed → Fix Released
Alan Pevec (apevec) on 2013-08-04
tags: removed: in-stable-grizzly
Thierry Carrez (ttx) on 2013-10-17
Changed in oslo:
milestone: havana-2 → 2013.2
Thierry Carrez (ttx) on 2013-10-17
Changed in neutron:
milestone: havana-2 → 2013.2
Maru Newby (maru) on 2013-12-04
Changed in neutron:
assignee: Gary Kotton (garyk) → Maru Newby (maru)
Maru Newby (maru) wrote :

It's not clear that multiple fixed ips being allocated for a VM when Nova and Neutron are experiencing heavy load is solved by the merged patch or whether it was merely masked. At the least, there needs to be documentation added that makes it clear how to set max_pool_size, max_overflow and pool_timeout according to expected load.

Suggestions via irc suggest max_pool_size should be set to x+1 where x is the expected maximum concurrent expected requests, with pool_timeout = 2.

Ian Wells (ijw-ubuntu) wrote :

We sure this isn't the issue Aaron's reported whereby if a machine is scheduled, attempts to start, fails to start and is rescheduled we can sometimes create a second port? (To be solved, but last I heard not yet solved, by creating the port before scheduling, rather than in the compute node.)

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu:
status: New → Confirmed
Changed in neutron:
assignee: Maru Newby (maru) → Jakub Libosvar (libosvar)
Maru Newby (maru) wrote :
Download full text (3.6 KiB)

I can confirm Ian Wells' suspicion that the problem is related to scheduling. If there is an error interacting with Neutron during scheduling, the vm will be rescheduled but a port that was created before the scheduling failure will persist. Nova needs to be updated to remove or reuse the port in question.

screen-n-sch.log:2013-12-18 10:00:27.458 ERROR nova.scheduler.filter_scheduler [req-2e3e4274-8348-471a-b9cb-69d0fd8f4180 demo demo] [instance: 45d132e7-bf85-4c18-b751-738767e49eb5] Error from last host: precise-devstack (node precise-devstack): Traceback (most recent call last):

  File "/opt/stack/nova/nova/compute/manager.py", line 1058, in _build_instance
    set_access_ip=set_access_ip)

  File "/opt/stack/nova/nova/compute/manager.py", line 1462, in _spawn
    LOG.exception(_('Instance failed to spawn'), instance=instance)

  File "/opt/stack/nova/nova/openstack/common/excutils.py", line 68, in __exit__
    six.reraise(self.type_, self.value, self.tb)

  File "/opt/stack/nova/nova/compute/manager.py", line 1459, in _spawn
    block_device_info)

  File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2168, in spawn
    admin_pass=admin_password)

  File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2529, in _create_image
    content=files, extra_md=extra_md, network_info=network_info)

  File "/opt/stack/nova/nova/api/metadata/base.py", line 165, in __init__
    ec2utils.get_ip_info_for_instance_from_nw_info(network_info)

  File "/opt/stack/nova/nova/api/ec2/ec2utils.py", line 149, in get_ip_info_for_instance_from_nw_info
    fixed_ips = nw_info.fixed_ips()

  File "/opt/stack/nova/nova/network/model.py", line 368, in _sync_wrapper
    self.wait()

  File "/opt/stack/nova/nova/network/model.py", line 400, in wait
    self[:] = self._gt.wait()

  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 168, in wait
    return self._exit_event.wait()

  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 120, in wait
    current.throw(*self._exc)

  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
    result = function(*args, **kwargs)

  File "/opt/stack/nova/nova/compute/manager.py", line 1243, in _allocate_network_async
    dhcp_options=dhcp_options)

  File "/opt/stack/nova/nova/network/neutronv2/api.py", line 352, in allocate_for_instance
    LOG.exception(msg, port_id)

  File "/opt/stack/nova/nova/openstack/common/excutils.py", line 68, in __exit__
    six.reraise(self.type_, self.value, self.tb)

  File "/opt/stack/nova/nova/network/neutronv2/api.py", line 329, in allocate_for_instance
    security_group_ids, available_macs, dhcp_opts))

  File "/opt/stack/nova/nova/network/neutronv2/api.py", line 180, in _create_port
    port_id = port_client.create_port(port_req_body)['port']['id']

  File "/opt/stack/python-neutronclient/neutronclient/v2_0/client.py", line 112, in with_params
    ret = self.function(instance, *args, **kwargs)

  File "/opt/stack/python-neutronclient/neutronclient/v2_0/client.py", line 317, in create_port
    return self.post(self.ports_path, body=body)

  File "/opt/stack/python-neutronclient/neutronclient/v2_0/client.py...

Read more...

Sudheendra Murthy (sudhi-vm) wrote :

There is a retry logic in httplib2 used by neutron that automatically retries requests. One instance when this kicks in is: client makes a connection —> issues request to the server —> server takes too long to respond (if busy) —> client socket timesout waiting for response and client goes for a retry. The end result is that the request may get processed twice at the server. There is a parameter RETRIES used in _conn_request(…) defined in __init__.py. This parameter is not configurable, hard-coded to 2.

One possible way to deal with this is to increase the timeout value (which is configurable) to a larger value. The default is 30 seconds and increasing it to larger value helps. This will not solve the issue, but could make it less likely to occur.

Here are Google references to the retry logic in httplib2:
https://code.google.com/p/httplib2/issues/detail?id=109
https://groups.google.com/forum/#!msg/google-api-python-client/VE1d7a7Wevg/elmt1MDsjmAJ
http://httplib2.googlecode.com/hg/doc/html/libhttplib2.html

ashwin raveendran (ashw7n) wrote :

Saw this report just now, but we had the same issue and were caused by slow quantum server. In addition to port created by scheduler retrying request on a different compute, we also have a bug in the quantum client as reported here https://bugs.launchpad.net/neutron/+bug/1267649. We were able to fix this temporarily by increasing the timeout value for the quantum_url_timeout property in nova compute to larger values, 2 minutes in our case.

Maru Newby (maru) wrote :

Preventing the timeout is a stop-gap, but a better solution might be for port creation to be idempotent such that multiple creates for the same port do not result in multiple ports. Perhaps it should be possible for Nova to pass in the maximum number of ports that should be created for an instance in the creation request so that Neutron could avoid exceeding that limit? Or Nova could provide the uuid for port creation and a duplicate index could indicate that one already exists? There are drawbacks to either approach, but I think it's worth thinking about alternatives to the current approach.

Joe Gordon (jogo) wrote :
Changed in nova:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.