Design output/export format for PMFs

Bug #856256 reported by Lars Butler
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Fix Released
High
Muharem Hrnjadovic

Bug Description

DRAFT

We need a way of serializing PMF output (from the disaggregation calculator) to a file. The data should be structured using a well-known markup language (such as XML or Yaml).

description: updated
tags: added: disaggregation hazard
Changed in openquake:
status: New → Confirmed
milestone: none → 0.4.4
importance: Undecided → High
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

We ideally want a java library that supports operations on sparse matrices as well as the serialization of the latter. This is important for handing off the disaggregation result data from the Java to the python domain where it will be serialized to file/database.

Changed in openquake:
status: Confirmed → In Progress
assignee: nobody → Muharem Hrnjadovic (al-maisan)
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Download full text (3.2 KiB)

= Introduction =

The matrices used by the disaggregation calculator may have 1-5 dimensions and up to 50,000 entries.

For the initial implementation we plan to use dense matrices ignoring the fact that many if not most of the values will be zero. If that proves too costly in terms of resource consumption we will have to explore the usage of sparse matrices.

The general problem with sparse matrices, however, is that the various implementations of of interest ([1], [2]) are 2-dimensional whereas the disaggregation calculator needs to operate on matrices of up to 5 dimensions.

The current thinking is that the disaggregation calculator will be implemented in Java due to the tight coupling with the OpenSHA functions it will need to utilise.
The outcome of the calculation will be the full 5-dimensional disaggregation matrix. In the worst case it will have 50,000 entries and a storage footprint of 1.5/1.0 MB when written to disk (as a hdf5 file [3]) raw/compressed respectively.

According to Damiano Monelli 100,000 (100 sites with 1000 logic tree samples each) is a realistic worst case/upper bound for the number of calculations to be performed by a single disaggregation job.
The disk storage requirement for such a job would hence be approximately 100 GB.

In a subsequent step we would need to derive the disaggregation matrix subsets requested by the user, e.g.:
    - magnitude/distance (2 dimensions)
    - magnitude/distance/epsilon (3 dimensions)
    etc.

We plan to implement the overall disaggregation workflow using the "Pipes and Filters" design patter ([5]) which allows for parallel extraction of the disaggregation matrix subsets.

= disaggregation matrix handling =

Disaggregation matrices will be computed by worker nodes and stored as hdf5 files. The latter need to be consumed on the control node (e.g. for the purpose of serialisation to NRML) as well as on *other* worker nodes (e.g. for the purpose of subsequent disaggregation matrix subset extraction).

The best way to access the hdf5 files across the OpenQuake cluster would be NFS.

Experiments with celery tasks simply returning the disaggregation matrix via RabbitMQ showed that that is not a viable approach. RabbitMQ got bogged down after a short while after consuming all available memory and beginning to swap to disk.

It is envisaged that the full 5-dimensional disaggregation matrix will be calculated by the disaggregation calculator implemented in Java and stored in hdf5 format using [7].

The subsequent extraction of desired disaggregation matrix subsets can be performed by pure Python celery tasks using [4].

= why hdf5? =

hdf5 seems to be a well established and mature format with libraries available in both Java and Python. These libraries are even packaged for Ubuntu already: as libjhdf5-java and python-tables respectively.
These relieves us from inventing our own matrix storage format and gives us the needed Java-Python interoperability.

= References =

[1] http://commons.apache.org/math/apidocs/org/apache/commons/math/linear/OpenMapRealMatrix.html
[2] http://docs.scipy.org/doc/scipy/reference/sparse.html
[3] http://www.hdfgroup.org/HDF5/
[4] http://www.pytables.org/moin
[...

Read more...

Changed in openquake:
status: In Progress → Fix Committed
Changed in openquake:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.