Comment 16 for bug 565101

Thierry, some comments:

*Heap size*

0. The machines on the test rig all have 10G of main memory;
1. when we started (no -Xmx) we would get OutOfMemory when the eucalyptus (CLC, Walrus, CC) resident memory -- as reported by 'top' was at about 800M;
2. When I tried -Xmx384m I observed OutOfMemory eariler, when the eucalyptus resident memory was at about 650M;
3. When I tried -Xmx1024m I observed OutOfMemory later, when the eucalyptus resident memory was at about 1300M.

So, indeed, setting -Xmx has to be done carefully. But (3) above suggests the default heap allocation is *not* 1G (less of 1/4 of 10G or 1G).

*Heap usage*

I have been unable to observe reduction on memory usage (again, as reported by 'top', even on an idle cloud (i.e., a cloud that has no instances running). I would expect memory increase on usage -- class instantiations, etc --, followed by memory being released on system idling. All I have observed, so far, if that memory usage *only* increases. So either there is a code issue, or a Java issue.

Day before yesterday I bounced *all* systems (so that I could guarantee we were starting from a clean slate), and submitted a 2,000 instance run. This run went from about 22:00 to 03:00 EDT; from about 01:30 onwards, no instance start succeeded, and OutOfMemory errors were being reported on the CLC, Walrus, and CC. The only conclusion I can make, right now, is that failed instance startup does not release memory. Remember we are observing 40~50% failure rates right now.

Additionally, it seems that after an OutOfMemory no new instances can be run. In other words, an OutOfMemory *should* be considered a fatal error, requiring restart on the affected components. At least right now.