Nova compute servers unable to reach AMQP servers

Bug #902234 reported by Eugene Shih
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned

Bug Description

I have been running a stress test on a Nova cluster that creates VMs with m1.tiny images and destroys them randomly. Sample code is up on github (beware, the code is very rough): https://github.com/quanta-research-cambridge/openstack-stress-test

The Nova cluster has one Nova controller node and two Nova compute nodes. I have installed a recent version of Diablo on all three nodes from the stable/diablo branch of the nova repository (commit 1e3b88ba20b9f9ab925ecd1238845adbc0157cb1). Images are from https://uec-images.ubuntu.com/server/oneiric/current/oneiric-server-cloudimg-amd64.tar.gz.

In my test, I try to launch at most 32 or 64 machines with a sleep time of at least 600ms between launches. I turned off rate-limiting to simulate total load. I'm using carrot instead of kombu for the messaging. Occasionally, during these runs, on one of the two compute nodes, I will be presented with the following error in the log files:

2011-12-09 10:39:11,225 ERROR nova.rpc [d2910bd0-e7c4-4761-9ba5-ec8cb30466a9 tester testproject]
AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.
2011-12-09 10:39:11,226 ERROR nova.rpc [a88ee258-f5bd-40ad-b84a-faaf7d11d9e7 tester testproject]
AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.
2011-12-09 10:39:11,227 ERROR nova.rpc [204c8797-afd1-47b2-9135-7ef7cc172a25 tester testproject]
AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.
2011-12-09 10:39:11,227 ERROR nova.rpc [c322da24-c733-47f7-95b8-5b4b8861cede tester testproject]
AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.
2011-12-09 10:39:11,227 ERROR nova.rpc [c90f3abc-d55f-4967-a56f-71cfc6835ffc tester testproject]
AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.

I believe this is some kind of bug. If it isn't a big, it should not be an ERROR, but perhaps only a WARNING?

description: updated
description: updated
Revision history for this message
Chris Behrens (cbehrens) wrote :

Hard to say what's going on from that.

Could you run the compute node under 'strace -s8192 -f -v -o compute-trace.log' and provide the compute-trace.log somehow?

Revision history for this message
Eugene Shih (eugene-shih) wrote :

I'm not familiar with strace, can you tell me what you mean by "run the compute node under ...?" I can strace the nova-compute process with -p, if that is what you mean.

Revision history for this message
Eugene Shih (eugene-shih) wrote :
Download full text (4.1 KiB)

I edited the /etc/init/nova-compute.conf files on my two compute nodes to launch nova-compute under strace as suggested by cbehrens. Unfortunately, this seemed to cause another error:

/var/log/nova/nova-compute.log:2011-12-12 13:43:49,311 ERROR nova.compute.manager [-] Instance '76' failed to spawn. Is virtualization enabled in the BIOS? Details: Unexpected error while running command.
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: Traceback (most recent call last):
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 438, in _run_instance
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: network_info, block_device_info)
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 131, in wrapped
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: raise Error(str(e))
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: Error: Unexpected error while running command.
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: Command: sudo iptables-save -t filter
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: Exit code: 1
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: Stdout: ''
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE: Stderr: 'sudo: must be setuid root\n'
/var/log/nova/nova-compute.log:(nova.compute.manager): TRACE:
/var/log/nova/nova-compute.log:2011-12-12 13:43:53,039 ERROR nova.exception [-] Uncaught exception
/var/log/nova/nova-compute.log:(nova.exception): TRACE: Traceback (most recent call last):
/var/log/nova/nova-compute.log:(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 100, in wrapped
/var/log/nova/nova-compute.log:(nova.exception): TRACE: return f(*args, **kw)
/var/log/nova/nova-compute.log:(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line 618, in spawn
/var/log/nova/nova-compute.log:(nova.exception): TRACE: self.firewall_driver.setup_basic_filtering(instance, network_info)
/var/log/nova/nova-compute.log:(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/firewall.py", line 525, in setup_basic_filtering
/var/log/nova/nova-compute.log:(nova.exception): TRACE: self.refresh_provider_fw_rules()
/var/log/nova/nova-compute.log:(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/firewall.py", line 738, in refresh_provider_fw_rules
/var/log/nova/nova-compute.log:(nova.exception): TRACE: self.iptables.apply()
/var/log/nova/nova-compute.log:(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 686, in inner
/var/log/nova/nova-compute.log:(nova.exception): TRACE: retval = f(*args, **kwargs)
/var/log/nova/nova-compute.log:(nova.exception): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/linux_net.py", line 309, in apply
/var/log/nova/nova-compute.log:(nova.exception): TRACE: attempts=5)
/var/log/nova/nova-compute.log:(nova.exception): TRACE: File "/usr/lib/pyth...

Read more...

Revision history for this message
Eugene Shih (eugene-shih) wrote :

Hi, I have now been able to reproduce that bug and capture it using strace on nova-compute. Not only do I get that original ERROR, but the additional "sudo: must be setuid root\n" error. I have attached the strace log from the compute node reporting the error here.

The following command was used to run strace in /etc/init/nova-compute.conf

exec strace -s8192 -f -v -o /var/tmp/compute-trace.log su -c "nova-compute --flagfile
=/etc/nova/nova.conf --flagfile=/etc/nova/nova-compute.conf" nova

description: updated
description: updated
Revision history for this message
Vish Ishaya (vishvananda) wrote : Re: [Bug 902234] Nova compute servers unable to reach AMQP servers
Download full text (4.9 KiB)

Do you get the same error using kombu from stable/diablo?

Carrot is considerably less robust than kombu, and since we finally fixed the error in rpc that was causing kombu to fail, I would suggest using it.

On Dec 13, 2011, at 8:23 AM, Eugene Shih wrote:

> ** Description changed:
>
> I have been running a stress test on a Nova cluster that creates VMs
> with m1.tiny images and destroys them randomly. Sample code is up on
> github (beware, the code is very rough): https://github.com/quanta-
> research-cambridge/openstack-stress-test
>
> The Nova cluster has one Nova controller node and two Nova compute
> nodes. I have installed a recent version of Diablo on all three nodes
> from the stable/diablo branch of the nova repository (commit
> 1e3b88ba20b9f9ab925ecd1238845adbc0157cb1). Images are from https://uec-
> images.ubuntu.com/server/oneiric/current/oneiric-server-cloudimg-
> amd64.tar.gz.
>
> In my test, I try to launch at most 32 or 64 machines with a sleep time
> of at least 600ms between launches. I turned off rate-limiting to
> - simulate total load. Occasionally, during these runs, on one of the two
> - compute nodes, I will be presented with the following error in the log
> - files:
> + simulate total load. I'm using carrot instead of kombu for the
> + messaging. Occasionally, during these runs, on one of the two compute
> + nodes, I will be presented with the following error in the log files:
>
> 2011-12-09 10:39:11,225 ERROR nova.rpc [d2910bd0-e7c4-4761-9ba5-ec8cb30466a9 tester testproject]
> AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.
> 2011-12-09 10:39:11,226 ERROR nova.rpc [a88ee258-f5bd-40ad-b84a-faaf7d11d9e7 tester testproject]
> AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.
> 2011-12-09 10:39:11,227 ERROR nova.rpc [204c8797-afd1-47b2-9135-7ef7cc172a25 tester testproject]
> AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.
> 2011-12-09 10:39:11,227 ERROR nova.rpc [c322da24-c733-47f7-95b8-5b4b8861cede tester testproject]
> AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.
> 2011-12-09 10:39:11,227 ERROR nova.rpc [c90f3abc-d55f-4967-a56f-71cfc6835ffc tester testproject]
> AMQP server on 172.18.0.131:5672 is unreachable: Socket closed. Trying again in 1 seconds.
>
> I believe this is some kind of bug. If it isn't a big, it should not be
> an ERROR, but perhaps only a WARNING?
>
> --
> You received this bug notification because you are subscribed to
> OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/902234
>
> Title:
> Nova compute servers unable to reach AMQP servers
>
> Status in OpenStack Compute (Nova):
> New
>
> Bug description:
> I have been running a stress test on a Nova cluster that creates VMs
> with m1.tiny images and destroys them randomly. Sample code is up on
> github (beware, the code is very rough): https://github.com/quanta-
> research-cambridge/openstack-stress-test
>
> The Nova cluster has one Nova controller node and two Nova compute
> nodes. I have installed a recent version of Diablo o...

Read more...

Thierry Carrez (ttx)
Changed in nova:
status: New → Incomplete
Revision history for this message
Eugene Shih (eugene-shih) wrote :

I have updated nova to include the bug fixes suggested and switched to kombu. I still get the ERROR in the nova-compute.log file

root : ERROR xg01: /var/log/nova/nova-compute.log:2011-12-14 13:10:48,436 ERROR nova.rpc [7d4092b3-d9c3-49f1-a04d-11ad029a48f9 tester testproject] AMQP server on 172.18.0.131:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.

I straced nova-compute and have the log file, but it is 900 MB, so I will share it if you want me to, via another means.

Thierry Carrez (ttx)
Changed in nova:
status: Incomplete → New
Revision history for this message
Tom Fifield (fifieldt) wrote :

Do you still get the same behaviour with essex?

Changed in nova:
status: New → Incomplete
Revision history for this message
David Kranz (david-kranz) wrote :

I am a colleague of the reporter. I think this is probably a duplicate of https://bugs.launchpad.net/nova/+bug/797770. There is a comment near the end of that bug about changing the log message to be DEBUG. It would be great if that happened.

Revision history for this message
Thierry Carrez (ttx) wrote :

@David: I'll close this bug as a duplicate -- could you open a new bug about changing the message to a DEBUG ?

Revision history for this message
David Kranz (david-kranz) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.