nova-network leaks memory overtime and eventually stops responding

Bug #903199 reported by Andrew Glen-Young on 2011-12-12
30
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Johannes Erdfelt

Bug Description

I have noticed a trend in nova-networking where it seems to leak memory over time and then eventually stops responding or processing messages from RabbitMQ.

Stracing the process doesn't reveal anything enlightening. Please let me know if I can provide any further information?

Stopping and starting nova-network seems to be a work-around for this bug (despite LP#785955).

= Process information =

$ ps axfuwww | grep nova-network
nova 27939 0.0 0.0 45824 580 ? Ss Nov30 0:00 su -c nova-network --flagfile=/etc/nova/nova.conf nova
nova 27940 9.1 23.1 4730344 1416820 ? Dl Nov30 1566:29 \_ /usr/bin/python /usr/bin/nova-network --flagfile=/etc/nova/nova.conf

$ sudo strace -p 27940
Process 27940 attached - interrupt to quit

Swapped (from /proc/27940/smaps):
2788140 kB - 27940 (nova-network)

$ sudo rabbitmqctl list_queues | grep -E '^network'
network.cc 3056

= System Information =

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=11.10
DISTRIB_CODENAME=oneiric
DISTRIB_DESCRIPTION="Ubuntu 11.10"

$ dpkg-query --show nova-network
nova-network 2011.3-0ubuntu6.2

Andrew Glen-Young (aglenyoung) wrote :

Added nova config below:

--dhcpbridge_flagfile=/etc/nova/nova.conf
--dhcpbridge=/usr/bin/nova-dhcpbridge
--logdir=/var/log/nova
--state_path=/var/lib/nova
--lock_path=/var/lock/nova
--default_log_levels=DEBUG
--my_ip=172.16.58.1
--rabbit_host=172.16.58.1
--sql_connection=mysql://user:pass@172.16.58.1/nova
--glance_api_servers=172.16.58.1:9292
--network_manager=nova.network.manager.FlatDHCPManager
--network_size=256
--public_interface=eth1
--flat_interface=eth0
--bridge=br100
--fixed_range=172.16.60.0/24
--flat_network_dhcp_start=172.16.60.3
--floating_range=172.16.93.64/26
--use_deprecated_auth
--force_dhcp_release=True
--iscsi_helper=tgtadm
--verbose

Andrea Rosa (andrea-rosa-m) wrote :

There are some errors or warning in the rabbitmq log file?

Andrew Glen-Young (aglenyoung) wrote :

@Andrea:

There are no warnings or errors in the 5352264 line rabbitmq log file that do not match the below regular expressions:

^\s*$
^=INFO REPORT==== [0-9]+-[A-Za-z]{3}-[0-9]{2}::[0-9]{2}:[0-9]{2}:[0-9]{2} ===$
^accepted TCP connection on [::]:5672 from [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}:[0-9]+$

Note that the compute nodes were processing messages from nova-api and nova-scheduler from the same rabbitmq instance while nova-network was not responding. This accounts for the ~3000 message (and growing) network queue that I discovered.

Thierry Carrez (ttx) on 2011-12-13
Changed in nova:
importance: Undecided → High
status: New → Confirmed
Dave Haynes (dave-haynes) wrote :

I have been using Meliae to monitor memory usage in nova-network.
The process accumulates objects which are never recovered by the garbage collector. They are _DummyThread objects, and they are created and documented in the Python threading.py module.

Each of these objects accounts for about 6KB of memory. In a simple test on Nova, creating a new instance and IP every 5 minutes, several hundred _DT objects accumulated during an overnight run. The pragmatic approach is to restart the process when the resident memory usage becomes significant.

The root cause is that certain operations are attempted from within eventlets which make a call to threading.current_thread(). The behaviour is demonstrated (not by myself) here:
https://gist.github.com/1346749/

The operations I have identified which do this are:

1. The lockfiles.synchronize decorator
2. logging.LogRecord.__init__
3. threading._after_fork which I think gets called back from C after subprocess.Popen.

It is possible to monkey-patch the first two, but the third is more difficult. The design of the Python standard libraries are not at fault here.

My feeling is that some re-engineering of Nova is needed, to lighten the load on the wsgi eventlet pool (string processing and low latency look-ups only there) and to hand over more involved operations to another subsystem which deals with lengthy tasks and subprocesses.

This would enable a clearer separation of concerns in the HA environment.

Vish Ishaya (vishvananda) wrote :

Hey Dave,

There is a potential patch for eventlet. Can you see if this patch stops the memory leak:

https://bitbucket.org/gholt/eventlet/changeset/9f3b81131ae9

Dave Haynes (dave-haynes) wrote :

Hi Vish,

Will do. Results early tomorrow.

Dave Haynes (dave-haynes) wrote :

A few hours' test shows that this patch is effective in preventing the accumulation of _DummyThread objects.

The memory footprint does still increase over time. At the moment, I think this is due to SQLAlchemy keeping hold of InstanceState objects. Some of the SQLA caching has an upper limit, but we will need to monitor this.

To go back to eventlet though. Vish, what side-effects can we expect from this patch? Does it not have the potential to change the synchronisation between eventlets; during locking, logging, or waiting on subprocesses?

Dave,

I don't know the implications, because gholt never followed up with submitting it upstream with tests, etc. Seems like we should try and submit it and see what the eventlet maintainers have to say.

Vish

On Jan 12, 2012, at 1:20 AM, Dave Haynes wrote:

> Hi Vish,
>
> Will do. Results early tomorrow.
>
> --
> You received this bug notification because you are subscribed to
> OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/903199
>
> Title:
> nova-network leaks memory overtime and eventually stops responding
>
> Status in OpenStack Compute (Nova):
> Confirmed
>
> Bug description:
> I have noticed a trend in nova-networking where it seems to leak
> memory over time and then eventually stops responding or processing
> messages from RabbitMQ.
>
> Stracing the process doesn't reveal anything enlightening. Please let
> me know if I can provide any further information?
>
> Stopping and starting nova-network seems to be a work-around for this
> bug (despite LP#785955).
>
> = Process information =
>
> $ ps axfuwww | grep nova-network
> nova 27939 0.0 0.0 45824 580 ? Ss Nov30 0:00 su -c nova-network --flagfile=/etc/nova/nova.conf nova
> nova 27940 9.1 23.1 4730344 1416820 ? Dl Nov30 1566:29 \_ /usr/bin/python /usr/bin/nova-network --flagfile=/etc/nova/nova.conf
>
> $ sudo strace -p 27940
> Process 27940 attached - interrupt to quit
>
> Swapped (from /proc/27940/smaps):
> 2788140 kB - 27940 (nova-network)
>
> $ sudo rabbitmqctl list_queues | grep -E '^network'
> network.cc 3056
>
> = System Information =
>
> $ cat /etc/lsb-release
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=11.10
> DISTRIB_CODENAME=oneiric
> DISTRIB_DESCRIPTION="Ubuntu 11.10"
>
> $ dpkg-query --show nova-network
> nova-network 2011.3-0ubuntu6.2
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/903199/+subscriptions

Kevin L. Mitchell (klmitch) wrote :

This would appear to additionally explain the memory leak behavior we see when running the unit tests…

Vish Ishaya (vishvananda) wrote :

I talked to gholt and asked him about upstreaming his patch, so hopefully that will fix everything. If someone wants to figure out how to work around it our code that would be awesome

Changed in nova:
assignee: nobody → Johannes Erdfelt (johannes.erdfelt)

I've proposed a patch which should fix the problem in eventlet:

https://bitbucket.org/which_linden/eventlet/issue/115/monkey-patching-thread-will-cause

James Troup (elmo) on 2012-03-05
tags: added: canonistack
Thierry Carrez (ttx) wrote :

Anything that would be left to do in Nova ? Or should we just close this as invalid/bug-is-in-eventlet ?

Changed in nova:
status: Confirmed → Incomplete

Good question. I don't know of anything else that needs to be done in nova. Closing it out as invalid seems appropriate.

Changed in nova:
status: Incomplete → Invalid

Hi all,

I had to face the same problem on a setup running a VlanManager: high cpu and memory usage, service unable to contact rabbitmq, virtual machine not spawning correctly because not getting a fixed_ip.
I did try to upgrade eventlet to trunk but it did not solve the problem.
I found a workaround faking the result in the function fixed_ip_disassociate_all_by_timeout in nova/db/sqlalchemy/api.py
Basically, if that function does nothing, my nova network runs smoothly even during stress tests.
I have to deallocate fixed_ips running a sql query by cron.

The nova version I'm running is Essex-4

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers