Complete base perfomance test on GHO

Bug #484779 reported by Philippe Boucher
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Global Health Observatory
Fix Released
Critical
Jiri Dvorak

Bug Description

Basic performance test needs to be run on GHO to ascertain performance vis-a-vis table renders and system load. The current required performance level is based on a rough analysis of the WHOSIS performance logs done back in early summer 2009 - we had decided that 50 000 table renders evenly distributed over a 24hour period was acceptable as this is roughly what we tend to see on WHOSIS

Changed in gho:
assignee: nobody → Knut Staring (knutst)
milestone: none → 1.0
importance: Undecided → High
Revision history for this message
Philippe Boucher (boucherp) wrote :

Performance testing approach:

Analysis of a connection to the GHO through Firebug on Firefox shows that on the client side, the download and rendering of a datatable for an indicator on the left navigation menu is done through a POST to http://host/ghodata/OlapPrint that then returns HTML describing the requested table. I've collected all the post texts required for the tables in World Health Statistics and written a script to sequentially download them - it's is very basic for the moment, the script loops 50 times and will call each entry in the WHS in succession (it does not do simultaneous downloads for the moment). An archive with the script and the post files is attached - cygwin or linux are required, with sh/bash and wget installed. The performance script is perf.sh

Revision history for this message
Philippe Boucher (boucherp) wrote :
Changed in gho:
importance: High → Critical
Revision history for this message
Philippe Boucher (boucherp) wrote :

Running the script against the single war file version of the GHO, running on a virtual machine running RedHat linux, 1GB RAM, 1 GB Swap.
Things start off fine, once the script gets going, it downloads each table in .3 to .9 seconds, after about 250 downloads, it starts to slow down and eventually everything stops.
The tomcat server was originally configured to have a larger memory footprint than RAM + Swap - in that case, it stopped responding, although it looks like it was issuing log messages in an infinite loop.
Corrected this so that the memory footprint was small enough to sit within RAM + Swap (in this case Java VM Max memory set to 768M, that translated to an overall process image size of about 920M). Now we get an out of memory error (heap space) after about 350 downloads.
(Log files are attached)
We also get the following error on every table render:

[gho] 23 11 15:56:56 ERROR intl.who.oh2.olapPrinter.helper.OlapPrintModel <init> - toRow paremeter is not nummeric
[gho] 23 11 15:56:57 DEBUG mondrian.mdx execute - 360: select NON EMPTY Crossjoin({[Indicator].[All Indicators].[WHS].[WHS8_110
]}, {[Time Period].[Year].Members}) ON COLUMNS,
  NON EMPTY {Order([Location].[Countries].Members, [Location].CurrentMember.Properties("CountryName"), BASC)} ON ROWS
from [Core Health Sum Indicators]
where [Measures].[CHSI Indicator Value (String)]

This error is generated when running the attached performance script as well as when we simply click on an indicator entry in the left menu.

Revision history for this message
Philippe Boucher (boucherp) wrote :
Changed in gho:
assignee: Knut Staring (knutst) → Jiri Dvorak (jiri-dvorak)
status: New → In Progress
Revision history for this message
Philippe Boucher (boucherp) wrote :

Reassigning this to Jiri - we need to have this resolved quickly - we're still targetting a soft launch next week (hand over to IT dept on dec 3) and we're targetting 50K hits/24 hour period - we cant restart the application every X requests

Revision history for this message
Jiri Dvorak (jiri-dvorak) wrote :

The error message from OlapPrintModel has been fixed - the new code is now in StarTeam, please refresh.

Revision history for this message
Jiri Dvorak (jiri-dvorak) wrote :

Based on the messages earlier in this thread, the system (especially the Java JVM) is running out of memory. The bad news is that there is not much we can do about it on the application side - barring any unexpected internal bugs or memory leaks, the OH/GHO code memory usage is quite modest, and very little can be gained by additional tuning.

By far, the biggest memory consumer is the Mondrian software component from Pentaho, which is being used by GHO as "black box" (by including mondrian.jar in the build).

Other than adding physical memory to the machine, configuring more "generous" virtual memory in Linux, and maximizing memory availability to the JVM executable (via "java ... -ms ... -mx ..." etc.), the only method of tuning Mondrian resource usage is via the file /who_gho/resources/mondrian.properties . That tuning involved a bit of "black magic" and some testing, and the only official documentation we were able to find is on the URL http://mondrian.pentaho.org/documentation/configuration.php .

Specifically, the following parameters may have immediate influence on Mondrian memory usage (with "sample" values):

- mondrian.query.limit = 4 ... maximum number of simultaneous queries the system will allow

- mondrian.result.limit = 50000 ... maximum size of a result set

- mondrian.rolap.evaluate.MaxEvalDepth = 10 ... maximum number of passes allowable while evaluating an MDX expression

- mondrian.rolap.queryTimeout = 10 ... limits the number of seconds a query executes before it is aborted

- mondrian.rolap.IterationLimit = 10 ... maximum number of iterations allowed when evaluating an aggregate

- mondrian.olap.fun.crossjoin.optimizer.size = ??? ... see the URL, this is pure black magic

- mondrian.rolap.star.disableCaching = true ... setting this to true will clear aggregate cache after each query; that may save a lot of memory

- mondrian.expCache.enable = false ... see the URL, may save some memory, but some MDX queries will run more slowly

- mondrian.rolap.EnableRolapCubeMemberCache ... see the URL, this one probably should stay as "true", but may be worth 1-2 tests ...

- mondrian.util.memoryMonitor.enable = true and mondrian.util.memoryMonitor.percentage.threshold = 60(or 70,80,90) can be used to see memory usage messages earlier in the log file during testing

ALSO PLEASE READ CAREFULLY THE WHOLE SECTION "MEMORY MANAGEMENT" at the URL http://mondrian.pentaho.org/documentation/configuration.php !!!!

Revision history for this message
Jiri Dvorak (jiri-dvorak) wrote :

Tuning Mondrian - well, I wish I could give you a simple, straightforward answer and list of values, but it is not that easy. Mondrian is a stateless multi-threaded server, and if the application throws a bunch of complex MDX queries on it simultaneously, or if it expects a lot of caching to take place, Mondrian will keep grabbing memory.

I have now reviewed my notes from the time when we tuned the WHOSIS project for production, and here is the best I could come up with. Also, it could make sense to force a Java garbage collection periodically, via the system.gc() call. The trick is that each time we "de-tune" something on the caching level, we get lower memory usage, but worse query performance.

Here are some suggestions:

# WHOSIS settings - assuming that GHO doesn't use pre-aggregated tables
mondrian.rolap.aggregates.Use=false
mondrian.rolap.aggregates.Read=false
mondrian.native.topcount.enable=true
mondrian.native.filter.enable=true
#mondrian.expCache.enable=true #see below, may be reset later
# Reduce "resource hogs" and too much concurrency, to allow base tests to complete
# If this doesn't work, try even lower values
mondrian.query.limit=4
mondrian.result.limit=50000
mondrian.rolap.evaluate.MaxEvalDepth=10
mondrian.rolap.queryTimeout=10
mondrian.rolap.IterationLimit=10
mondrian.rolap.star.disableCaching=true
mondrian.expCache.enable=false
# Try it initially with false, later change to true
mondrian.rolap.EnableRolapCubeMemberCache=false
# Hopefully get better memory diagnostics in the log
mondrian.util.memoryMonitor.enable=true
mondrian.util.memoryMonitor.percentage.threshold=75

The only other technique I can think about is by tuning the MDX queries, reducing cross-joins, remove unnecessary sort operators, possibly even reducing unnecessary hierarchy levels in the model. On that side, I do not really consider myself to be an MDX expert (I know only enough to be dangerous).

Revision history for this message
Knut Staring (knutst) wrote :

The problem is that we currently don't know if it is Mondrian or OH (olap_printer) which causes problems. I think it would be good if we could run tests and profilers on both sides of the Atlantic. Probably best to test with Mondrian, OH, and MySql on three different servers.

Revision history for this message
Philippe Boucher (boucherp) wrote :

Adding a stack traces for all threads when the system runs of out memory, as well as sequentials histogram of the heap over the run time of the performance test - (histogram sequencing was done manually while I was doing other things, so the intervals are not constant, but they do reveal certain trends)

You can see that the instance count and allocated memory for mondrian.rolap.RolapResult$CellInfo increases "linearly" over time, along with a few other bits that are probably related. My guess is that this is leaking somehow, however, we're not setup here to debug this correctly, not to mention we dont know this code. Can someone at T4Bi take a look? I'm getting a little tired of doing diagnostics on a black box with the cybernetic equivalent of a Ouija board as my tool set

thanks,

Revision history for this message
Philippe Boucher (boucherp) wrote :

One additional set of comments:
We did a similar test on the WHOSIS production system and that worked fine - we used the same mondrian parameters as WHOSIS (provided by Jiri) on the GHO and reran the test - performance was pretty bad - since caching was turned off, the database was being clobbered, and the system still ran out of memory after about 350 queries

Revision history for this message
Jiri Dvorak (jiri-dvorak) wrote :

Philippe, thanks for collecting the stack trace and histogram files; after reviewing them, I am even more convinced that this is indeed a Mondrian / JPivot capacity and tuning problem.

All stack trace breakpoints are either in Apache Tomcat code (typically TCP/IP socket wait or socket read), or, in 1 thread, a situation 30 levels deep inside Mondrian and JPivot code, when Mondrian is trying to allocate more memory for itself.

The histograms follow all the same pattern (I went through half of them) - e.g. in hist0.txt, the first time any "intl.who.*" class shows up is in position 468 (less than 3 kbyte), while there are literally hundreds of usages of Mondrian and JPivot classes using megabytes of data above that.

I very much sympathize with the frustration with "doing diagnostics on a black box with the cybernetic equivalent of a Ouija board", and to be totally honest, I have experienced a similar level of frustration during most of my previous work with Mondrian - it is a great tool in many respects, but it is not transparent, manageable or easily tuneable. On the WHOSIS project, we tried to open and dissect it a little bit, to avoid the black-box style of tuning ... and that created as many problems as it solved, especially for configuration, deployment and upgrades.

Back in 2008, we had a high-level visit from Pentaho, and a teleconference with one of their chief gurus (Philippe Veltsos has all the details), and we have discussed the opportunities for a closer cooperation with Pentaho, where they would give us more access to the technology and perhaps provide a fixed-budget technical support, while we could concentrate on solving the business problems and application functionality. Perhaps this offer is still on the table, and maybe it would be in everybody's interests to follow up on that.

A couple of other comments:

(1) The data in histogram point to some *REALLY* large data sets. I know that OLAP systems deal with lots of data, but even by that standard ... e.g. hist0.txt shows 12.5MB of ROLAP cell data (entry #3). Do we really need to be producing these sizes of data sets in the 1st production release of GHO?

(2) Given that the problem always persists around the 350th statement - what EXACTLY is happening in that part of the script? Can we tune or restrict the MDX to produce more "reasonable" data sets, at least temporarily as a part of the testing and tuning process?

(3) Let me re-iterate the fact that in previous OLAP testing, we had these kinds of problems when there were problems with cube definitions, MDX expressions or underlying data - orphaned values, algorithms spinning on recursive values in dimensions etc. Can we perhaps try to isolate the MDX statement(s) that cause the problem, or are "very close" to it?

Revision history for this message
Jiri Dvorak (jiri-dvorak) wrote :

One thing that I should also mention - during testing, we should make sure that in the file .../resources/log4j.properties, we maintain the setting:

log4j.logger.mondrian.mdx=debug

Then all MDX expressions should be visible in the Tomcat log; and we should be looking for commonalities among the last MDX statements executed before the system runs out of memory.

Revision history for this message
Philippe Boucher (boucherp) wrote :

The queries I'm using for testing are based on the MDG data set - not very large. All I'm doing with the test script is looping through a set of 22 queries over and over again - The queries were extracted from the menu of MDG indicators. Essentially the script simulates a user that is clicking on every indicator entry in the MDG menu, waits for the page to render, then moves on to the next indicator - when it's gone through the menu, it restarts from the beginning - nothing complicated and essentially after the first pass, everything should pop right out of the cache.
In the script, the 350th statement is nothing special, it's just an MDX query from the MDG menu that has by that time been made about 110 times. If I change the memory parameters, the error will occur consistently earlier (if I lower the available memory) or later (if I raise the available memory).
Take a look at perf.sh in the attached ZIP file containing the performance test - it's a really simple 5 line script that calls wget against http://dkjhdkjhkdjhf/gho/OlapPrint, POSTing the MDX query (each query is stored in an individual file)
I've also tried to run this test using only one simple MDX query for the loop - exactly the same results

Revision history for this message
Philippe Boucher (boucherp) wrote :

More information - if we run the test against the multi war file version of the GHO, where mondrian is a separate war file, we get exactly the same results (does a few hundred requests, gets bigger and bigger, then dies) - The test scripts still go to gho/OlapPrint in this case. Now, if I bypass OlapPrint and go straight to mondrian, so now point EXACTLY the same query but now to mondrian3/mdxquery, we observe the following:
a) dramatically improved response time (queries once cached return a response in .1s on average) we ran over 8000 queries in about 15-20 minutes
b) JVM memory foot print is very stable (climbed to 178MB within the first 2000 queries, then stayed there for the next 8000 queries in the case of our test)
The cube definition and the mondrian properties used were copied from the current single war file version of the GHO that we're working from so that we can rule out configuration difference
Methinks this rules out mondrian itself and points to either the OH code or JPivot,
If you want to replicate the test, just use the contents of the attached zip file and change the URL to mondrian3/mdxquery instead of ghodata/OlapPrint, and in the post-*.txt files, change the mdx= to queryString=

Revision history for this message
Jiri Dvorak (jiri-dvorak) wrote :

Just a few paranoia checks:

(1) Are we 100% sure that all code and all cube definitions (in GlobalHealthObservatory.xml) currently in StarTeam are exactly in synch with the code and definitions used for the WHO performance stress test?

(2) We most likely won't have up-to-date data in our MySQL database ... can a complete database backup be placed to StarTeam (or Central Desktop, or Launchpad)?

Thx

Revision history for this message
Jiri Dvorak (jiri-dvorak) wrote :

BTW, regarding the MySQL backup (dump) ... all we really care about are MDG data (fact table, dimensions, and whatever is needed to resolve their foreign keys). So, leaving the other fact tables out (or empty) to reduce the file size is perfectly OK ... whatever is easiest.

Revision history for this message
Knut Staring (knutst) wrote :
Revision history for this message
Knut Staring (knutst) wrote :

Please find a 7zip of the sql attached.

Revision history for this message
Knut Staring (knutst) wrote :

This is the cube definition we are testing with.

Revision history for this message
Dusan Strnal (dusan-strnal) wrote :

I was able to reproduce this issue on our t4bi server and I've made some adjustments in the code.

Please refresh from Starteam and let me know if the issue is still occurring.

Revision history for this message
Philippe Boucher (boucherp) wrote :

ok, I'll check it out as soon as I'm back in the office (currently sick at home with something flu like)
thanks, P

Revision history for this message
Philippe Boucher (boucherp) wrote :

Still sick at home, so I got all the pieces to run the GHO here. I've checked out who_gho and assembled the war file locally and installed it - when I connect to the GHO and try to access any table, I get this error displayed instead of the table:

HTTP Status 500 -
type Exception report
message
description The server encountered an internal error () that prevented it from fulfilling this request.
exception
javax.servlet.ServletException: Servlet.init() for servlet Jersey REST threw exception
<<<SNIP>>>
root cause
com.sun.jersey.api.container.ContainerException: The ResourceConfig instance does not contain any root resource classes.
<<<SNIP>>>

The thing that's causing this is in the logs:

Dec 2, 2009 10:44:18 PM org.apache.catalina.core.ApplicationContext log
INFO: Initializing Spring root WebApplicationContext
Dec 2, 2009 10:44:18 PM org.apache.catalina.core.ApplicationContext log
SEVERE: StandardWrapper.Throwable
com.sun.jersey.api.container.ContainerException: The ResourceConfig instance does not contain any root resource classes.
        at com.sun.jersey.server.impl.application.WebApplicationImpl.processRootResources(WebApplicationImpl.java:613)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.initiate(WebApplicationImpl.java:492)
        at com.sun.jersey.spi.spring.container.servlet.SpringServlet.initiate(SpringServlet.java:80)
        at com.sun.jersey.spi.container.servlet.ServletContainer.load(ServletContainer.java:540)
        at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:207)
        at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1172)
        at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:992)
        at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4058)
        at org.apache.catalina.core.StandardContext.start(StandardContext.java:4371)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
        at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:627)
        at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553)
        at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488)
<<<SNIP>>>

I'm presuming that there's a config parameter missing in WEB-INF/web.xml? I dont know how this stuff works so I dont know what should or shouldnt be there

My setup here is WindowsXP/sp3, tomcat 6.0.18, mysql 5.0.45 (also running on the same windows box) and JDK 6.0.13 - I checked out a fresh tree from StarTeam at about 20:00UTC today, but only the who_gho part, and I'm using the database restrore script attached to this bug

Revision history for this message
Dusan Strnal (dusan-strnal) wrote :

Reading through the previous comment, I do not see whether the who_oh source was checked out or not before building the war file. The error indicates that there is no servlet definition. That is defined in java classes that are scanned during the startup. The classes are located in who_oh source.

Could you let me know if that's the case here ?

Revision history for this message
Philippe Boucher (boucherp) wrote :

ah... I didnt check out the who_oh source - I thought anything that was used for the deployment was already copied into who_gho - I'll check out who_oh and let you know if that fixes it (I have a slow link at home, hence the lazyness :-)

Revision history for this message
Philippe Boucher (boucherp) wrote :

checked out the who_oh source and rebuilt the war file - same effect - do I need to do something else with the who_oh source other than run the build_gho_xxxxx.bat / build_gho.sh files? I'm assuming that the gho.jar file that's in who_gho is build against the latest in who_oh?

Revision history for this message
Dusan Strnal (dusan-strnal) wrote :

sorry ...I did not realize that you do not recompile who_oh... yes.. gho.jar has been compiled against who_oh; however, I am able to build "gho" from the script build_gho_xxx.sh and run it under a tomcat without any issues.

I have recompiled who_gho/deployements/lib/gho.jar.. to just make sure that gho.jar is ok. Please refresh from Starteam.

Could you check a couple of things in your environment before you refresh who_gho:

-- what is the timestamp on your who_gho/deployements/lib/gho.jar
-- can you post the output of the build_gho..xxxx.bat script ?

Revision history for this message
Dusan Strnal (dusan-strnal) wrote :

Also could you post the web.xml? or send me the file at <email address hidden> ? Thanks.

Revision history for this message
Philippe Boucher (boucherp) wrote :

Attaching the output of the build_gho_XXX.bat script (echo was for batch file). Time stamp on my copy of gho.jar is the local system time when I checked it out, dec 2, 2009 23:58, but I think I recall seeing a date of November 29 2009 in StarTeam (dont remember the time)

Revision history for this message
Philippe Boucher (boucherp) wrote :

I see there's a missing applicationContext.xml from the build log - I checked this out further and saw that there seems to be some sort of error with my checkout of who_gho under /war/WEB-INF -- no classes subdirectory - StarTeam shows an attention icon on the directory but then doesnt seem to think that anything is missing - I've deleted that part of my local tree and redone the checkout, but that gives me the same result - if I manually create the directory, StarTeam still doesnt tell me that anything that should be in there.

Revision history for this message
Dusan Strnal (dusan-strnal) wrote :

Your web.xml looks OK to me and also the timestampt on gho.jar is correct. The compilation from Nov 29 should included all changes.

However, when I look at the build_gho_XXX.bat script output:

_CONF_WAR_DIR is never set which should be something like: set _CONF_WAR_DIR=$DEV_HOME/who_gho/ghodata

Can you confirm that this is set in build_gho_XXX.bat ? If it's not, it might be the reason why the applicationContext.xml and GlobalHealthObservatory.xml are missing.

Regarding to the missing "classes" directory; Can you try to "right click" on war/WEB-INF/classes and then choose "Create working folders" in Starteam client ? If there is an attention icon, usually the working directory is not created. war/WEB-INF/classes directory should be empty when you check out from Starteam. All compiled java classess are in gho.jar file. Only files from who_gho/resources get copied to war/WEB-INF/classes.

Revision history for this message
Philippe Boucher (boucherp) wrote :

The build_gho_sample.bat script doesnt set _CONF_WAR_DIR, the build_gho.sh script sets it explicitly (it would overide anything set in the preceding batchfile).
I manually created the directory who_gho/war/WEB-INF/classes and reran the build and now it populated the WEB-INF/classes directory. the war file seems to be starting correctly (I get data when I select a table, and there's no errors in the log) so I guess that missing directory was the root problem. I'll make a note of this in our docs in case we get this problem again - is there some setting in StarTeam that would make it not create an empty directory? I did a few fresh checkouts on my machine at WHO and didnt run into this. Anyways, thanks for helping me out on this.

Revision history for this message
Dusan Strnal (dusan-strnal) wrote :

I am not aware of any particular settings that would not create an empty directory if it's in the repository.

The Starteam client looses/"gets out of the synch" statuses of its files/dirs when ever the files/folders get moved or changed outside of starteam client. The Starteam client has own "cache" that could be purged and recreated by "Update status" (File -> Update Status ) in the client. To purge the cache/local repository, go to Tools -> Personal Options look for "Repository" section which should have a button "Purge".

Revision history for this message
Philippe Boucher (boucherp) wrote :

I've rerun the attached performance test script several times against my instance - the GHO is now responding consistently and its memory footprint appears to be very stable (131/132 Mb for the last 2000 queries). I'll do a few more thorough tests on monday, but I can confirm that the fix works for our current test scenario - thanks all

Revision history for this message
Philippe Boucher (boucherp) wrote :

The MDG perforance test returns back an average response time of 0.447 seconds per request with a deviation of 0.072, this implies 193288 requests over a 24 hour period which exceeds the baseline performance number of 50000 requests/24h period that we're aiming for. This test was done with the GHO webapp running on a 1Gb virtual machine, with a seperate machine running the test script. The VM is running both the tomcat server as well as the MySQL database. - so far so good

Revision history for this message
Philippe Boucher (boucherp) wrote :

Not going to do further performance testing for the moment - setting to fix-released

Changed in gho:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.