OpenStack Compute (Nova)

Encountering sporadic AMQPChannelException

Reported by David Kranz on 2011-09-20
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Critical
Vish Ishaya
Diablo
Undecided
Unassigned
nova (Ubuntu)
High
Unassigned
Oneiric
High
Unassigned
Precise
High
Unassigned

Bug Description

Running one controller and one compute, using nova-network version 2011.3~rc~20110909.r1155-0ub on Oneiric . Repeating the following:

Create four instances, terminate them.

This mostly works but one in every three or four times, one of the vms does not come up and I see this error in the nova-network log. This can occur both in the controller node (also running compute) or the compute node. (contents of nova.conf follows)

2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab semaphore "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:672
2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab file lock "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:677
2011-09-20 13:22:59,296 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t filter from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
2011-09-20 13:22:59,311 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
2011-09-20 13:22:59,350 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t nat from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
2011-09-20 13:22:59,366 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
2011-09-20 13:22:59,424 ERROR nova.rpc [-] Exception during message handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 628, in _process_data
(nova.rpc): TRACE: ctxt.reply(None, None)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 673, in reply
(nova.rpc): TRACE: msg_reply(self.msg_id, *args, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 781, in msg_reply
(nova.rpc): TRACE: conn.direct_send(msg_id, msg)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 562, in __exit__
(nova.rpc): TRACE: self._done()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 547, in _done
(nova.rpc): TRACE: self.connection.reset()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 382, in reset
(nova.rpc): TRACE: self.channel.close()
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 194, in close
(nova.rpc): TRACE: (20, 41), # Channel.close_ok
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
(nova.rpc): TRACE: return self.dispatch_method(method_sig, args, content)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 115, in dispatch_method
(nova.rpc): TRACE: return amqp_method(self, args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 273, in _close
(nova.rpc): TRACE: (class_id, method_id))
(nova.rpc): TRACE: AMQPChannelException: (404, u"NOT_FOUND - no exchange '3ff1ba7e274a4ec2a2b0a217d7532c70' in vhost '/'", (60, 40), 'Channel.basic_publish')
(nova.rpc): TRACE:
2011-09-20 13:22:59,451 ERROR nova.rpc [-] Returning exception (404, u"NOT_FOUND - no exchange '3ff1ba7e274a4ec2a2b0a217d7532c70' in vhost '/'", (60, 40), 'Channel.basic_publish') to caller
2011-09-20 13:22:59,452 ERROR nova.rpc [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 628, in _process_data\n ctxt.reply(None, None)\n', ' File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 673, in reply\n msg_reply(self.msg_id, *args, **kwargs)\n', ' File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 781, in msg_reply\n conn.direct_send(msg_id, msg)\n', ' File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 562, in __exit__\n self._done()\n', ' File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 547, in _done\n self.connection.reset()\n', ' File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 382, in reset\n self.channel.close()\n', ' File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 194, in close\n (20, 41), # Channel.close_ok\n', ' File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait\n return self.dispatch_method(method_sig, args, content)\n', ' File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 115, in dispatch_method\n return amqp_method(self, args)\n', ' File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 273, in _close\n (class_id, method_id))\n', 'AMQPChannelException: (404, u"NOT_FOUND - no exchange \'3ff1ba7e274a4ec2a2b0a217d7532c70\' in vhost \'/\'", (60, 40), \'Channel.basic_publish\')\n']

I also see this at the same time in the rabbitmq log:

=ERROR REPORT==== 20-Sep-2011::13:22:59 ===
connection <0.1061.0>, channel 1 - error:
{amqp_error,not_found,
            "no exchange '3ff1ba7e274a4ec2a2b0a217d7532c70' in vhost '/'",
            'basic.publish'}

--flagfile=/etc/nova/nova-compute.conf
--use_deprecated_auth
--dhcpbridge_flagfile=/etc/nova/nova.conf
--dhcpbridge=/usr/bin/nova-dhcpbridge
--sql_connection=mysql://nova:notnova@172.18.0.141/nova
--s3_host=172.18.0.141
--rabbit_host=172.18.0.141
--glance_api_servers=172.18.0.141:9292
--logdir=/var/log/nova
--state_path=/var/lib/nova
--lock_path=/var/lock/nova
--verbose
--ec2_url=http://172.18.0.141:8773/services/Cloud
--dmz_cidr=172.18.0.141/32
--fixed_range=172.18.77.0/27
--network_size=8
--flat_network_dhcp_start=172.18.7.6
--image_service=nova.image.glance.GlanceImageService
--bridge_interface=eth1
--flat_network_bridge=br100
--network_manager=nova.network.manager.FlatDHCPManager
--public_interface=eth0
--multi_host
--osapi_host=172.18.0.141

Chris Behrens (cbehrens) wrote :

Pretty interesting. This has to be from an rpc.call. The caller creates a queue for the response from the worker. It appears the worker is not finding this queue when it goes to 'msg_reply'.

The only reasons I can think of how this could happen:

1) After the caller publishes a request for a worker.... it is getting an exception during waiting for the response. That would cause the queue to disappear. (Seems unlikely)
2) After the caller publishes a request for a worker... it gets disconnected from rabbit, which causes the queue to be deleted momentarily before it reconnects and re-declares it... and the worker tries to respond in the middle of this. (Seems unlikely)
3) rabbit is returning an ACK to the caller (amqplib) before it is actually finished creating the queue... and there's a race condition where the worker tries to reply before the queue is created. (Seems unlikely)
4) kombu+amqplib randomly doesn't send a declare_queue() ? Seems unlikely without an exception thrown. And an exception would abort the publish to the worker.

I wonder what I'm missing. What version of kombu and what version of rabbit? Have you tried falling back to carrot? You can specify: 'rpc_backend=nova.rpc.impl_carrot' to try it.

David Kranz (david-kranz) wrote :

ii python-kombu 1.0.4-2 AMQP Messaging Framework for Python
ii rabbitmq-server 2.5.0-1ubuntu2 An AMQP server written in Erlang

I added the carrot suggestion to nova.conf and it still fails about the same 1 in 20 or so vms not coming up but with a different error. I would be happy to help diagnose this if you have any other suggestions.

2011-09-21 09:17:24,546 DEBUG nova.network.manager [db549763-93f2-444f-bdd8-d4931b2fa6c7 6deb27e2-4b73-4d73-8196-6ccdbcee6859 kranzproje
ct] network allocations for instance 108 from (pid=1096) allocate_for_instance /usr/lib/pymodules/python2.7/nova/network/manager.py:431
2011-09-21 09:17:24,705 ERROR nova.rpc [-] Exception during message handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_carrot.py", line 285, in _process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/network/manager.py", line 216, in allocate_for_instance
(nova.rpc): TRACE: ips = super(FloatingIP, self).allocate_for_instance(context, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/network/manager.py", line 438, in allocate_for_instance
(nova.rpc): TRACE: requested_networks=requested_networks)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/network/manager.py", line 166, in _allocate_fixed_ips
(nova.rpc): TRACE: vpn=vpn, address=address)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/network/manager.py", line 609, in allocate_fixed_ip
(nova.rpc): TRACE: instance_id)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/db/api.py", line 343, in fixed_ip_associate_pool
(nova.rpc): TRACE: instance_id, host)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/db/sqlalchemy/api.py", line 101, in wrapper
(nova.rpc): TRACE: return f(*args, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/db/sqlalchemy/api.py", line 720, in fixed_ip_associate_pool
(nova.rpc): TRACE: raise exception.NoMoreFixedIps()
(nova.rpc): TRACE: NoMoreFixedIps: None
(nova.rpc): TRACE:
2011-09-21 09:17:24,739 ERROR nova.rpc [-] Returning exception None to caller
2011-09-21 09:17:24,739 ERROR nova.rpc [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/pymodules/python2.7/nova/rpc/impl_

David Kranz (david-kranz) wrote :

My apologies. The carrot test was not valid. The last error was caused by the fact that I was running out of fixed ip addresses. My guess is that when the instances that fail to come up are terminated, their IP addresses are not returned. I will continue to investigate on a clean install.

David Kranz (david-kranz) wrote :
Download full text (5.8 KiB)

I tried again on freshly reinstalled OS and the behavior is repeatable. I also saw one case where there was a slightly different AMPQ error that happened during instance termination:

headline: AMQPConnectionException: (503, u"COMMAND_INVALID - second 'channel.open' seen", (20, 10), 'Channel.open')

2011-09-21 11:33:05,739 DEBUG nova.utils [-] Attempting to grab semaphore "get_dhcp" for method "_get_dhcp_ip"... from (pid=1094) inner /usr/lib/pymodules/python2.7/nova/utils.py:672
2011-09-21 11:33:05,746 ERROR nova.rpc [-] Exception during message handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 626, in _process_data
(nova.rpc): TRACE: ctxt.reply(rval, None)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 673, in reply
(nova.rpc): TRACE: msg_reply(self.msg_id, *args, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 781, in msg_reply
(nova.rpc): TRACE: conn.direct_send(msg_id, msg)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 562, in __exit__
(nova.rpc): TRACE: self._done()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 547, in _done
(nova.rpc): TRACE: self.connection.reset()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 383, in reset
(nova.rpc): TRACE: self.channel = self.connection.channel()
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 99, in channel
(nova.rpc): TRACE: return self.transport.create_channel(self.connection)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 197, in create_channel
(nova.rpc): TRACE: return connection.channel()
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 130, in channel
(nova.rpc): TRACE: return Channel(self, channel_id)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 82, in __init__
(nova.rpc): TRACE: self._x_open()
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 471, in _x_open
(nova.rpc): TRACE: (20, 11), # Channel.open_ok
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 95, in wait
(nova.rpc): TRACE: self.channel_id, allowed_methods)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/connection.py", line 231, in _wait_method
(nova.rpc): TRACE: self.wait()
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
(nova.rpc): TRACE: return self.dispatch_method(method_sig, args, content)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 115, in dispatch_method
(nova.rpc): TRACE: return amqp_method(self, args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/connection.py", line 380, in _close...

Read more...

Download full text (9.2 KiB)

Looks likeyou are running out of ips. Maybe you are running /27 which only has 32 addresses?

On Sep 21, 2011, at 6:22 AM, David Kranz wrote:

> ii python-kombu 1.0.4-2 AMQP Messaging Framework for Python
> ii rabbitmq-server 2.5.0-1ubuntu2 An AMQP server written in Erlang
>
> I added the carrot suggestion to nova.conf and it still fails about the
> same 1 in 20 or so vms not coming up but with a different error. I would
> be happy to help diagnose this if you have any other suggestions.
>
>
>
> 2011-09-21 09:17:24,546 DEBUG nova.network.manager [db549763-93f2-444f-bdd8-d4931b2fa6c7 6deb27e2-4b73-4d73-8196-6ccdbcee6859 kranzproje
> ct] network allocations for instance 108 from (pid=1096) allocate_for_instance /usr/lib/pymodules/python2.7/nova/network/manager.py:431
> 2011-09-21 09:17:24,705 ERROR nova.rpc [-] Exception during message handling
> (nova.rpc): TRACE: Traceback (most recent call last):
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_carrot.py", line 285, in _process_data
> (nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/network/manager.py", line 216, in allocate_for_instance
> (nova.rpc): TRACE: ips = super(FloatingIP, self).allocate_for_instance(context, **kwargs)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/network/manager.py", line 438, in allocate_for_instance
> (nova.rpc): TRACE: requested_networks=requested_networks)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/network/manager.py", line 166, in _allocate_fixed_ips
> (nova.rpc): TRACE: vpn=vpn, address=address)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/network/manager.py", line 609, in allocate_fixed_ip
> (nova.rpc): TRACE: instance_id)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/db/api.py", line 343, in fixed_ip_associate_pool
> (nova.rpc): TRACE: instance_id, host)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/db/sqlalchemy/api.py", line 101, in wrapper
> (nova.rpc): TRACE: return f(*args, **kwargs)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/db/sqlalchemy/api.py", line 720, in fixed_ip_associate_pool
> (nova.rpc): TRACE: raise exception.NoMoreFixedIps()
> (nova.rpc): TRACE: NoMoreFixedIps: None
> (nova.rpc): TRACE:
> 2011-09-21 09:17:24,739 ERROR nova.rpc [-] Returning exception None to caller
> 2011-09-21 09:17:24,739 ERROR nova.rpc [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/pymodules/python2.7/nova/rpc/impl_
>
> --
> You received this bug notification because you are a member of Nova Bug
> Team, which is subscribed to OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/855030
>
> Title:
> Encountering sporadic AMQPChannelException
>
> Status in OpenStack Compute (Nova):
> New
>
> Bug description:
> Running one controller and one compute, using nova-network version
> 2011.3~rc~20110909.r1155-0ub on Oneiric . Repeating the following:
>
> Create four instances, terminate them.
>
> This mostly works but one in every thre...

Read more...

David Kranz (david-kranz) wrote :

Yes, I am. I did not realize that the DHCP leases were not immediately marked as reusable when an instance is terminated. It seems that the default time is 2min and when I was doing rapid testing of this problem I exceeded 32 available. Anyway, I replaced kombu with carrot and was able to start up many, many instances without problems. I then switched back to kombu and the problem happened on the first shot after that. Not sure what more I can offer at this point...

Chris Behrens (cbehrens) wrote :

Hmm.. Does this happen under load? Is it just 4 instances one after another?

Another thing to check: amqplib version:

>>>
>>> from amqplib.client_0_8 import connection
>>> connection.LIBRARY_PROPERTIES
{'library': 'Python amqplib', 'library_version': '1.0.0'} 

It's possible carrot works with an older version, but kombu wants a newer version...

Chris Behrens (cbehrens) wrote :

Oh, and I was going to also suggest trying to remove the current kombu version and do a 'pip install kombu', which will grab the latest from pypi... and see if the issue goes away. kombu is up to 1.3.* or something now.

David Kranz (david-kranz) wrote :

I'm kind of at a loss here. The ampqlib version is the same as you said. I used pip to uninstall kombu and upgrade to 1.3.5. That was the latest version this morning though the latest seems to be 1.4.0 right how. Anyway it made no difference. I am no expert on the mess that is the relationship between python package versions between pip and dpkg so it is possible nova was still using the old kombu after my upgrade. I don't think so because just invoking python from the command line showed the new version of kombu when I imported it.

This does happen under load in the sense that I used the euca command to create 4 instances at once. Has any one else done something like this with Oneiric? On the one hand it is hard to see how I am having this problem if others are not, but I don't see what I could be doing wrong either given that everything works for instance creations where this does not happen.

Vish Ishaya (vishvananda) wrote :
Download full text (7.8 KiB)

Hmm,

One further question. Are you using mysql or sqlite? Sqlite doesn't really support concurrency well, so you could be getting db locking errors that are bubbling up to the messaging layer. If you are using sqlite, could you see if you get the same error with mysql?

On Sep 22, 2011, at 11:11 AM, David Kranz wrote:

> I'm kind of at a loss here. The ampqlib version is the same as you said.
> I used pip to uninstall kombu and upgrade to 1.3.5. That was the latest
> version this morning though the latest seems to be 1.4.0 right how.
> Anyway it made no difference. I am no expert on the mess that is the
> relationship between python package versions between pip and dpkg so it
> is possible nova was still using the old kombu after my upgrade. I don't
> think so because just invoking python from the command line showed the
> new version of kombu when I imported it.
>
> This does happen under load in the sense that I used the euca command to
> create 4 instances at once. Has any one else done something like this
> with Oneiric? On the one hand it is hard to see how I am having this
> problem if others are not, but I don't see what I could be doing wrong
> either given that everything works for instance creations where this
> does not happen.
>
> --
> You received this bug notification because you are a member of Nova Bug
> Team, which is subscribed to OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/855030
>
> Title:
> Encountering sporadic AMQPChannelException
>
> Status in OpenStack Compute (Nova):
> New
>
> Bug description:
> Running one controller and one compute, using nova-network version
> 2011.3~rc~20110909.r1155-0ub on Oneiric . Repeating the following:
>
> Create four instances, terminate them.
>
> This mostly works but one in every three or four times, one of the vms
> does not come up and I see this error in the nova-network log. This
> can occur both in the controller node (also running compute) or the
> compute node. (contents of nova.conf follows)
>
> 2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab semaphore "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:672
> 2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab file lock "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:677
> 2011-09-20 13:22:59,296 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t filter from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,311 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,350 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t nat from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,366 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,424 ERROR nova.rpc [-] Exception during message handling
> (nova.rpc): TRACE: Traceback (most recent call l...

Read more...

David Kranz (david-kranz) wrote :

I am using mysql. In case it matters I am running nova-network on both nodes and also nova-api on both nodes as suggested by you here http://osdir.com/ml/openstack-cloud-computing/2011-09/msg00283.html

I did a fresh install using lucid and the same thing happens with one exception. On oneiric, the AMQP trace showing in the log is always associated with a vm never transitioning from pending to running. On lucid, I still sometimes see the error in the log but the VM does come up.

There must be some kind of race condition. Is there currently a test for nova that

1. Deploys a multinode configuration
2. Starts and terminates groups of vms for a long time, making sure they all end up running
3. Also fails if any error is written to a log file

Also, I did the latest install using ppa:openstack-release/2011.3 but I sometimes see things like this in the log. I find these messages in bug tickets that seem like they were fixed for diablo so I am not sure why I see them:

DBError: exceptions must be old-style classes or derived from BaseException, not NoneType

Chris Behrens (cbehrens) wrote :

Hm. I haven't used the multi host network stuff, myself, so I can't comment much on the above right now.

There _is_ a bug filed for the DBError above. Unclear exactly what it is, but sqlalchemy is catching an exception, doing transaction cleanup, and then doing a re-raise via a bare raise. Somewhere in sqlalchemy's transaction cleanup, the original exception is getting eaten, so the bare raise does a raise of NoneType.

BTW, there is 1 case I can think of that could happen in the real world where the queue doesn't exist when doing a reply. This can happen if a service makes an rpc.call that results in a long-running transaction.... and the original service is restarted before the long-running transaction completes. I know this probably isn't what is happening in your case, but we probably should catch the AMQPChannelException for this case...

Vish Ishaya (vishvananda) wrote :

381 def cursor(self, *args, **kwargs):
382 try:
383 c = self.connection.cursor(*args, **kwargs)
384 return _CursorFairy(self, c)
385 except Exception, e:
386 self.invalidate(e=e)
387 raise

I expect the bare raise is here (sqlalchemy/pool.py). Pretty sure self.invalidate is making socket connections, so the exception is getting eaten by eventlet. I'm trying to investigate if there is a way to make eventlet not clear exception info. This could be especially dangerous if the higher level code is trying to catch a specific type of exception. Might be worth it to try one of the following things:

1) Try to make eventlet stop eating exception info:

diff -r f30a2fa65f30 eventlet/hubs/hub.py
--- a/eventlet/hubs/hub.py Wed Jun 08 23:47:26 2011 -0700
+++ b/eventlet/hubs/hub.py Sat Sep 24 05:04:30 2011 -0700
@@ -173,7 +173,7 @@
                 cur.parent = self.greenlet
         except ValueError:
             pass # gets raised if there is a greenlet parent cycle
- clear_sys_exc_info()
+ #clear_sys_exc_info()
         return self.greenlet.switch()

     def squelch_exception(self, fileno, exc_info):

or 2) modify the above code to reraise the exception

diff --git a/Library/Python/2.7/site-packages/sqlalchemy/pool.py b/pool.py
index 632d955..21e7055 100644
--- a/Library/Python/2.7/site-packages/sqlalchemy/pool.py
+++ b/pool.py
@@ -383,8 +383,10 @@ class _ConnectionFairy(object):
             c = self.connection.cursor(*args, **kwargs)
             return _CursorFairy(self, c)
         except Exception, e:
+ import sys
+ exc = sys.exc_info()
             self.invalidate(e=e)
- raise
+ raise exc

     def __getattr__(self, key):
         return getattr(self.connection, key)

I will do some more investigation into this when I can. We need a clear repro case for this.

David Kranz (david-kranz) wrote :

I created a small script that calls euca-create-instance four times, creating one instance each time, and then waits for them all to be running. It then terminates them and tries again. I started running this against a single nova node running all services and after a few times through got this error in nova-network:

 ResourceClosedError: This result object does not return rows. It has been closed automatically.

I see that there is a recent bug 857714 about that error. At this point I am forced to conclude that either I have no idea what I am doing (I am pretty new to Nova) , or there are some serious issues with Diablo. I think what I am doing is the most elementary thing one could do as part of testing Nova so I don't know what to make of this. Has this kind of rudimentary stress test been successfully run by any one?

Chris Behrens (cbehrens) wrote :

Give this a try now. This was marked a duplicate of lp:838581 and a fix was merged earlier.

David Kranz (david-kranz) wrote :

I will, but I have been installing from oneiric packages. How do I know when an updated package is available?

David Kranz (david-kranz) wrote :
Download full text (9.6 KiB)

I built and installed packages from 2012.1~e1~20110909.1546-0ubuntu1 and now I just get these errors about some bad SQL syntax:

2011-09-30 13:36:59,829 ERROR nova.rpc [-] Exception during message handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 620, in _process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 317, in get_floating_ips_by_fixed_address
(nova.rpc): TRACE: fixed_address)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/api.py", line 295, in floating_ip_get_by_fixed_address
(nova.rpc): TRACE: return IMPL.floating_ip_get_by_fixed_address(context, fixed_address)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 120, in wrapper
(nova.rpc): TRACE: return f(*args, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 668, in floating_ip_get_by_fixed_address
(nova.rpc): TRACE: fixed_ip = fixed_ip_get_by_address(context, fixed_address, session)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 120, in wrapper
(nova.rpc): TRACE: return f(*args, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 838, in fixed_ip_get_by_address
(nova.rpc): TRACE: options(joinedload('instance')).\
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 1652, in first
(nova.rpc): TRACE: ret = list(self[0:1])
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 1560, in __getitem__
(nova.rpc): TRACE: return list(res)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 1721, in __iter__
(nova.rpc): TRACE: return self._execute_and_instances(context)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 1726, in _execute_and_instances
(nova.rpc): TRACE: mapper=self._mapper_zero_or_none())
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 724, in execute
(nova.rpc): TRACE: clause, params or {})
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1191, in execute
(nova.rpc): TRACE: params)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement
(nova.rpc): TRACE: return self.__execute_context(context)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context
(nova.rpc): TRACE: context.parameters[0], context=context)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute
(nova.rpc): TRACE: context)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute
(nova.rpc): TRACE: context)
(nova.rpc): TRACE: File ...

Read more...

David Kranz (david-kranz) wrote :

I tested again using the devstack install to create two nodes (after verifying that the fix for bug 838581 is in session.py) and still see the problem, but less frequently. It is not quite the same because this is running from source rather than packages, and using 'nova boot' instead of euca-run-instances. I don't know how to get euca-run-instances to work when using keystone to manage users. 'nova boot' doesn't seem to allow multiple instances to be created in one call but I changed the min_count default in the source to 8 for this test.

diff --git a/novaclient/v1_1/shell.py b/novaclient/v1_1/shell.py
index 6b83292..21ee0c9 100644
--- a/novaclient/v1_1/shell.py
+++ b/novaclient/v1_1/shell.py
@@ -33,7 +33,7 @@ AUTO_KEY = object()
 def _boot(cs, args, reservation_id=None, min_count=None, max_count=None):
     """Boot a new server."""
     if min_count is None:
- min_count = 1
+ min_count = 8
     if max_count is None:
         max_count = min_count
     if min_count > max_count:

This bug really needs to be reopened but I am not sure how to do that.

David Kranz (david-kranz) wrote :

My nova.conf for the last test:

--logdir=/var/log/nova
--verbose
--nodaemon
--scheduler_driver=nova.scheduler.simple.SimpleScheduler
--dhcpbridge_flagfile=/opt/stack/nova/bin/nova.conf
--network_manager=nova.network.manager.FlatDHCPManager
--my_ip=172.18.0.141
--public_interface=eth0
--vlan_interface=eth0
--sql_connection=mysql://root:mysql@172.18.0.141/nova
--libvirt_type=kvm
--osapi_extensions_path=/opt/stack/openstackx/extensions
--vncproxy_url=http://172.18.0.141:6080
--vncproxy_wwwroot=/opt/stack/noVNC/
--api_paste_config=/opt/stack/keystone/examples/paste/nova-api-paste.ini
--image_service=nova.image.glance.GlanceImageService
--ec2_dmz_host=172.18.0.141
--rabbit_host=172.18.0.141
--rabbit_password=rabbit
--glance_api_servers=172.18.0.141:9292
--flat_network_bridge=br100
--flat_interface=eth1
--multi_host=1

Scott Moser (smoser) wrote :

Per David's comments, I've removed the 'duplicate' flag for this bug.

Scott Moser (smoser) wrote :

David,
  I talked some with Vish in IRC today. It seems like we're in need of some more data to diagnose. I was unable to reproduce this issue in a little "all in one" nova.sh install that I did and Vish hasn't seen anything like it.
  We're assuming that the bug history and traces pasted represents different failures.

  Since you're able to reproduce with devstack, could you set up one on ubuntu either Natty or Oneiric, recreate and report:
 * git commit tip
 * nova.conf (as you have above)
 * traces that you see in the logs when the issue occurs

 I think you've addressed it, but make sure you're not just running out of IP addresses.

  Thanks in advance.

[1] http://eavesdrop.openstack.org/irclogs/%23openstack-dev/%23openstack-dev.2011-10-19.log

David Kranz (david-kranz) wrote :
Download full text (4.8 KiB)

OK. I just did it again. I am not sure what "git commit tip" means. The running out of addresses was a one-time issue I had before understanding that the leased IP addresses for the vms are not returned immediately when the vms are terminated.
I am assuming that this kind of error and stacktrace in the log would be considered a bug even if it does not cause any obvious misbehavior immediately.

 I used the current version of devstack but with a modified stack.sh from last week. I modified it to log to files in /var/log/nova instead of stdout so you have to create that directory before running. I used one compute node running api,network,compute and one controller node running everything. After running the modified stack.sh I exported the shell variables in exercise.sh, modified the min count in nova client to create 10 vms, and did

nova boot --flavor 1 --image 3 testvms

In this case the vms seen to have started but there is the AMQP error stacktrace in the log. When this happens, sometimes all the vms come up and sometimes not. I have attached:

1. nova-network log with stacktrace
2. modified stack.sh
3. localrc for both controller and compute
4. nova.conf (even though it is generated)

(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/opt/stack/nova/nova/rpc/impl_kombu.py", line 628, in _process_data
(nova.rpc): TRACE: ctxt.reply(None, None)
(nova.rpc): TRACE: File "/opt/stack/nova/nova/rpc/impl_kombu.py", line 673, in reply
(nova.rpc): TRACE: msg_reply(self.msg_id, *args, **kwargs)
(nova.rpc): TRACE: File "/opt/stack/nova/nova/rpc/impl_kombu.py", line 781, in msg_reply
(nova.rpc): TRACE: conn.direct_send(msg_id, msg)
(nova.rpc): TRACE: File "/opt/stack/nova/nova/rpc/impl_kombu.py", line 562, in __exit__
(nova.rpc): TRACE: self._done()
(nova.rpc): TRACE: File "/opt/stack/nova/nova/rpc/impl_kombu.py", line 547, in _done
(nova.rpc): TRACE: self.connection.reset()
(nova.rpc): TRACE: File "/opt/stack/nova/nova/rpc/impl_kombu.py", line 382, in reset
(nova.rpc): TRACE: self.channel.close()
(nova.rpc): TRACE: File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 201, in close
(nova.rpc): TRACE: super(Channel, self).close()
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 194, in close
(nova.rpc): TRACE: (20, 41), # Channel.close_ok
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
(nova.rpc): TRACE: return self.dispatch_method(method_sig, args, content)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 115, in dispatch_method
(nova.rpc): TRACE: return amqp_method(self, args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 273, in _close
(nova.rpc): TRACE: (class_id, method_id))
(nova.rpc): TRACE: AMQPChannelException: (404, u"NOT_FOUND - no exchange '5b9a19d1b5e849d7ba749fada648a375' in vhost '/'", (60, 40), 'Channel.basic_publish')
(nova.rpc): TRACE:
2011-10-20 10:45:24,724 ERROR nova.rpc [-] Returning except...

Read more...

David Kranz (david-kranz) wrote :
David Kranz (david-kranz) wrote :
David Kranz (david-kranz) wrote :
David Kranz (david-kranz) wrote :
David Kranz (david-kranz) wrote :
David Kranz (david-kranz) wrote :

One more thing. Since this seems to be some kind of race condition issue, it might be relevant that I was using two real Ubuntu 11.10 machines with the OS reinstalled prior to running this experiment. I left these machines as is so if there is any other state of the machines that is of interest I can provide it.

Mark McLoughlin (markmc) wrote :

By the way, I saw a similar exception today when disassociating a floating IP with diablo https://bugzilla.redhat.com/747568

Vish Ishaya (vishvananda) wrote :

Thank you for the extensive report. This should help us debug. If it is easy for you to reproduce I'm very curious if you get the same exception using the carrot driver instead of kombu.

--rpc_backend=nova.rpc.impl_carrot

David Kranz (david-kranz) wrote :

I tried this before and didn't see the problem as you can see in comment #6. However, given the randomness of when this bug happens and not I can't be sure it is definitive. I can try it again tomorrow.

Thierry Carrez (ttx) on 2011-10-21
Changed in nova:
status: New → Incomplete
David Kranz (david-kranz) wrote :

I tried creating 10 vms 5 times with carrot and all was well. Back to kombu, it failed the second time.

Ask Solem (asksol) wrote :

Hey,

Are you using eventlet with both kombu and carrot?

The difference then would be that Kombu synchronized *_declare calls by delegating them to another greenthread
and having the current thread wait for that thread to finish (kombu.syn.blocking)
So in a sense it *should* be safer than carrot, why it wouldn't be I'm not sure.
Also are you declaring entities both when publishing the request *and* when publishing the reply?

Ask Solem (asksol) wrote :

s/synchronized/synchronizes/

Chris Behrens (cbehrens) wrote :

Ask: Hm. I implemented the code in nova that uses kombu. There's probably something funny about what I did, because I probably didn't use what one would call the 'common interfaces'. :) No offense, but the default interfaces didn't provide great enough flexibility.

Here's a summary of how things work:

To create a consumer, I'm essentially doing this:

exchange = entity.Exchange()
queue = entity.Queue(...., exchange=exchange)
queue.declare().
queue.consume() ***

*** In some cases we declare multiple queues on the same channel. But, what I don't do is use messaging.Consumer(). What did not appear to be supported in kombu was the ability to specify a different callback for each of those queues if I were to pass them all to the same Consumer(). If I did that, Comsumer.consume() would always call the same callback (the ones provided to Consumer()). I think carrot actually supported that, in a way, via ConsumerSet(). But anyway... point is here.. I'm not using Consumer() at all. To consume, I'm essentially calling queue.consume() similar to what Consumer.consume() does... except each Queue has its own callback. (I keep a list of the queues that are declared on a channel and loop through them). I can't find anything wrong with this... as I'm essentially just doing the same work that Consumer() does.

For publishing:

exchange = entity.Exchange()
producer = entity.Producer(..., exchange=exchange)
producer.publish()

There was also some sort of issue i ran into with the reconnecting code within kombu, so all of this stuff is wrapped by our own classes that handle re-creating the appropriate kombu instances.

Now.. back to the bug. If you're not familiar with nova RPC... what's happening in this bug report is:

1) generate a unique uuid to use as exchange and queue names. call queue.declare() which declares the exchange and the queue and binds them. This queue will be used to get a response from remote worker.
2) Using the same connection, publish a message into some other queue for the remote service to pick up. Put the unique uuid in the message so it knows what exchange/queue to respond to.
3) Using the same connection, start consuming on the previously declared queue
4) the remote worker does some work... and tries to respond into the queue we created in #1. The above Exception is reported... sometimes and apparently very rarely for most people. I've never seen the above exception at all... nor have any of the people I work with closely. However, David seems to see it quite easily.

#4 can actually happen before #3... which is fine. The queue was declared in #1, so it certainly should exist, should it not?

Any other ideas?

Have the same issue when using live migration. Change rpc to the "carrot" fix this.

Vish Ishaya (vishvananda) wrote :

I've been trying very hard to reproduce this bug. On my test system with oneiric (which has kombu 1.4.3), I cannot get it to fail. I can run 6 instances simultaneously with euca (works on current devstack) and change max_count to 6 and do a nova boot successfully. I've also tested with current trunk and I don't get any issue there either.

David Kranz (david-kranz) wrote :

OK, I will try it again but will be on a trip next week. Unfortunately the fix for the db pool bug was removed from ubuntu6.2 when it was released so I cannot easily do this at the moment because our automated testing is based on packages. This is another reason why it would be good for the automated build process to produce ppas for both trunk and diablo stable.

Download full text (7.1 KiB)

Is it possible to use the oneiric packages? They are roughly based on diablo stable.

On Nov 4, 2011, at 6:24 AM, David Kranz wrote:

> OK, I will try it again but will be on a trip next week. Unfortunately
> the fix for the db pool bug was removed from ubuntu6.2 when it was
> released so I cannot easily do this at the moment because our automated
> testing is based on packages. This is another reason why it would be
> good for the automated build process to produce ppas for both trunk and
> diablo stable.
>
> --
> You received this bug notification because you are subscribed to
> OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/855030
>
> Title:
> Encountering sporadic AMQPChannelException
>
> Status in OpenStack Compute (Nova):
> Incomplete
>
> Bug description:
> Running one controller and one compute, using nova-network version
> 2011.3~rc~20110909.r1155-0ub on Oneiric . Repeating the following:
>
> Create four instances, terminate them.
>
> This mostly works but one in every three or four times, one of the vms
> does not come up and I see this error in the nova-network log. This
> can occur both in the controller node (also running compute) or the
> compute node. (contents of nova.conf follows)
>
> 2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab semaphore "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:672
> 2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab file lock "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:677
> 2011-09-20 13:22:59,296 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t filter from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,311 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,350 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t nat from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,366 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,424 ERROR nova.rpc [-] Exception during message handling
> (nova.rpc): TRACE: Traceback (most recent call last):
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 628, in _process_data
> (nova.rpc): TRACE: ctxt.reply(None, None)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 673, in reply
> (nova.rpc): TRACE: msg_reply(self.msg_id, *args, **kwargs)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 781, in msg_reply
> (nova.rpc): TRACE: conn.direct_send(msg_id, msg)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 562, in __exit__
> (nova.rpc): TRACE: self._done()
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 547, in _done
> (nova.rpc): TRACE: self.co...

Read more...

David Kranz (david-kranz) wrote :

As far as I can see, the current oneiric packages are based on a security update that does not have any of the diablo fixes. Please correct me if I am wrong. In any event we just built our own packages based on diable/stable and I am installing now.

David Kranz (david-kranz) wrote :

OK, I tried this with packages generated from diablo stable and do not see the problem any more. I think this ticket could be closed for now. It would be a lot of work to figure out exactly which source change made the difference and there are more important things to do. I am working on a more general torture test that will trigger this if it still exists.

David Kranz (david-kranz) wrote :

I am sorry but I was a dork and forgot to remove the carrot directive from nova.conf. When I remove that, the same problem easily happens. I really don't know what to suggest at this point but this is a real bug still there. I was again using two nodes, both running compute, with euca-run-instances creating 16 at a time.

Vish Ishaya (vishvananda) wrote :
Download full text (7.1 KiB)

Ok good to know. Perhaps this only occurs on multiple machines. Are you running just one nova-network?

If I can reproduce this I'm going to turn on amqplib debug logging and see if I can track down what is happening.

Vish

On Nov 4, 2011, at 11:04 AM, David Kranz wrote:

> As far as I can see, the current oneiric packages are based on a
> security update that does not have any of the diablo fixes. Please
> correct me if I am wrong. In any event we just built our own packages
> based on diable/stable and I am installing now.
>
> --
> You received this bug notification because you are subscribed to
> OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/855030
>
> Title:
> Encountering sporadic AMQPChannelException
>
> Status in OpenStack Compute (Nova):
> Incomplete
>
> Bug description:
> Running one controller and one compute, using nova-network version
> 2011.3~rc~20110909.r1155-0ub on Oneiric . Repeating the following:
>
> Create four instances, terminate them.
>
> This mostly works but one in every three or four times, one of the vms
> does not come up and I see this error in the nova-network log. This
> can occur both in the controller node (also running compute) or the
> compute node. (contents of nova.conf follows)
>
> 2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab semaphore "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:672
> 2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab file lock "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:677
> 2011-09-20 13:22:59,296 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t filter from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,311 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,350 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t nat from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,366 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,424 ERROR nova.rpc [-] Exception during message handling
> (nova.rpc): TRACE: Traceback (most recent call last):
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 628, in _process_data
> (nova.rpc): TRACE: ctxt.reply(None, None)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 673, in reply
> (nova.rpc): TRACE: msg_reply(self.msg_id, *args, **kwargs)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 781, in msg_reply
> (nova.rpc): TRACE: conn.direct_send(msg_id, msg)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 562, in __exit__
> (nova.rpc): TRACE: self._done()
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 547, in _done
> (nova.rpc):...

Read more...

David Kranz (david-kranz) wrote :

I am running two nodes, both running network, api, and compute as documented for ha mode.

Vish Ishaya (vishvananda) wrote :

ok, I have seen a few error messges, although I'm having a hard time getting it to fail consistently. Turning on logging seems to make the problem go away so i suspect a tricky race condition. Here is a log of 10 instances going with no failures.

Vish Ishaya (vishvananda) wrote :

It appears that carrot works a little bit differently, opening multiple channels as needed. (greenthread ids are logged as well).

Vish Ishaya (vishvananda) wrote :

Here is the kombu version with greenthread ids logged (note that it always uses channel 1)

Chris Behrens (cbehrens) wrote :

David:

Taking a shot in the dark here... but could you try applying the following patch to impl_kombu.py:

http://paste.openstack.org/show/3207/

And see if the problem goes away?

David Kranz (david-kranz) wrote :

I could do this but am out of the office until Monday. Since Vish can now reproduce this he could give it a try. I will do it on Monday if nothing else ha happened.

Download full text (7.0 KiB)

I will try it, although my errors are extremely inconsistent. I've managed to start and stop 20 vms regularly without errors, so I'm not sure how many i would have to do to consider it fixed.

Vish

On Nov 9, 2011, at 6:17 AM, David Kranz wrote:

> I could do this but am out of the office until Monday. Since Vish can
> now reproduce this he could give it a try. I will do it on Monday if
> nothing else ha happened.
>
> --
> You received this bug notification because you are subscribed to
> OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/855030
>
> Title:
> Encountering sporadic AMQPChannelException
>
> Status in OpenStack Compute (Nova):
> Incomplete
>
> Bug description:
> Running one controller and one compute, using nova-network version
> 2011.3~rc~20110909.r1155-0ub on Oneiric . Repeating the following:
>
> Create four instances, terminate them.
>
> This mostly works but one in every three or four times, one of the vms
> does not come up and I see this error in the nova-network log. This
> can occur both in the controller node (also running compute) or the
> compute node. (contents of nova.conf follows)
>
> 2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab semaphore "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:672
> 2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab file lock "iptables" for method "apply"... from (pid=1082) inner /usr/lib/pymodules/python2.7/nova/utils.py:677
> 2011-09-20 13:22:59,296 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t filter from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,311 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,350 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t nat from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,366 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/pymodules/python2.7/nova/utils.py:165
> 2011-09-20 13:22:59,424 ERROR nova.rpc [-] Exception during message handling
> (nova.rpc): TRACE: Traceback (most recent call last):
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 628, in _process_data
> (nova.rpc): TRACE: ctxt.reply(None, None)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 673, in reply
> (nova.rpc): TRACE: msg_reply(self.msg_id, *args, **kwargs)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 781, in msg_reply
> (nova.rpc): TRACE: conn.direct_send(msg_id, msg)
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 562, in __exit__
> (nova.rpc): TRACE: self._done()
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu.py", line 547, in _done
> (nova.rpc): TRACE: self.connection.reset()
> (nova.rpc): TRACE: File "/usr/lib/pymodules/python2.7/nova/rpc/impl_kombu....

Read more...

David Kranz (david-kranz) wrote :

OK, I tried this patch and it made no difference. Note that I found it even easier to see this problem by starting 16 vms with euca-run-instances. BTW, though it probably not matter, the vm image I am using is

https://uec-images.ubuntu.com/server/oneiric/current/oneiric-server-cloudimg-amd64.tar.gz

as m1.small

Michael Hempel (otaku) wrote :

Hi, this issue happend to me too, when I added my first compute-node to a devstack-install. There was no way to spin up a vm, with random disconnects/errors from rabbit and glance. Turned out I just exceeded the quota for RAM set in horizon, perhaps some meaningful error-message should be created ;-)

Changed in nova:
status: Incomplete → Confirmed
importance: Undecided → Critical
assignee: nobody → Vish Ishaya (vishvananda)
milestone: none → essex-2
Vish Ishaya (vishvananda) wrote :

We finally tracked down the cause of this. Complicated issue regarding the way that we return from rpc calls. If an rpc call was returning None, then it was closing the connection early on the receiving end. This is because we use None to signify the final return in a series of returned values via multicall.

Comstud is working on a fix for multicall so that None is a legal return. In the meantime I found the method that was erroneously returning None and will be submitting a merge (and backport request) right away.

If you want to try out the fix right away, it is pretty simple:

diff --git a/nova/network/manager.py b/nova/network/manager.py
index 96d6dee..955cca0 100644
--- a/nova/network/manager.py
+++ b/nova/network/manager.py
@@ -180,7 +180,7 @@ class RPCAllocateFixedIP(object):
         perform network lookup on the far side of rpc.
         """
         network = self.db.network_get(context, network_id)
- self.allocate_fixed_ip(context, instance_id, network, **kwargs)
+ return self.allocate_fixed_ip(context, instance_id, network, **kwargs)

Reviewed: https://review.openstack.org/1929
Committed: http://github.com/openstack/nova/commit/cabdc3b1f64b7f022a1c62a4cebce54a2deba807
Submitter: Jenkins
Branch: master

 status fixcommitted
 done

commit cabdc3b1f64b7f022a1c62a4cebce54a2deba807
Author: Vishvananda Ishaya <email address hidden>
Date: Mon Nov 28 14:42:38 2011 -0800

    Makes rpc_allocate_fixed_ip return properly

     * Fixes bug 855030
     * Includes test

    Change-Id: If5b874fb0e4abd567445e67141d61942866cc5ec

Changed in nova:
status: Confirmed → Fix Committed

Reviewed: https://review.openstack.org/1930
Committed: http://github.com/openstack/nova/commit/1e3b88ba20b9f9ab925ecd1238845adbc0157cb1
Submitter: Jenkins
Branch: stable/diablo

 tag in-stable-diablo
 done

commit 1e3b88ba20b9f9ab925ecd1238845adbc0157cb1
Author: Vishvananda Ishaya <email address hidden>
Date: Mon Nov 28 14:42:38 2011 -0800

    Makes rpc_allocate_fixed_ip return properly

     * Fixes bug 855030
     * Includes test

    Change-Id: If5b874fb0e4abd567445e67141d61942866cc5ec

Scott Moser (smoser) on 2011-11-29
Changed in nova (Ubuntu Oneiric):
importance: Undecided → High
status: New → Triaged
Changed in nova (Ubuntu Precise):
status: New → Triaged
importance: Undecided → High
Chuck Short (zulcss) wrote :

This is fixed in precise.

Chuck Short (zulcss) on 2011-12-07
Changed in nova (Ubuntu Precise):
status: Triaged → Fix Committed
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2011-12-14
Changed in nova:
status: Fix Committed → Fix Released
Tom Ellis (tellis) wrote :

Could this be pushed to oneiric too Chuck?

Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-2 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers