Disaggregation Failure GEM Cluster

Bug #1180274 reported by Laurentiu D.
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake Engine
Fix Released
High
Michele Simionato

Bug Description

The following errors occurred during a disaggregation computation in the GEM cluster:
http://pastie.org/7910299
This is a disaggregation computation for 10 sites and the entire SHARE Area Source Model.

description: updated
Changed in oq-engine:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Lars Butler (lars-butler)
milestone: none → 1.0.0
Revision history for this message
Lars Butler (lars-butler) wrote :

This failure seems to be due to the following reasons:

- PostgreSQL's `shared_buffers` was set to 1024MB
- Each disaggregation matrix INSERT is about ~64MB (in this particular calculation)
- Since each query consumes about 2-3x that, we could easily run max out the buffer if 5-6 workers are inserting simultaneously.

We're going to experiment with the following:
- Increase `shared_buffers` to 12288MB
- Increase `checkpoint_segments` to 64 (from the default of 3). The postgres documentation [1] recommends increasing this in tandem with any increase to `shared_buffers`. Also, we're getting a LOT of checkpoint messages in the postgres log (see below); this should help silence those messages.

LOG: checkpoints are occurring too frequently (26 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".

[1] - http://www.postgresql.org/docs/9.1/static/runtime-config-resource.html

Revision history for this message
Lars Butler (lars-butler) wrote :

Update:

The configuration changes above did not fix the problem; we're still getting the exact same error: invalid memory alloc request size 1073741824.

I will note that this happens when we try to insert large disagg matrix blobs into the hzrdr.disagg_result table, but it doesn't happen everytime. After the job was aborted, I noticed that 312 records were inserted into this table, so this isn't a consistent failure.

I'm still looking into other options for addressing this issue. I'll post updates as soon as I have them.

Revision history for this message
Lars Butler (lars-butler) wrote :

I found the root cause of this error:

We're using a BYTEA field to store the disagg matrices (as pickled numpy arrays), and we're hitting a hard-coded limit to the field size in postgres. The maximum allowed size for BYTEA is 1GB, and we encounter this error when we try to execute an INSERT statement containing ~530MB of data. It seems that a multiplier of 2-3x is required for parsing and processing the data, so that's why we're hitting the limit.

This thread describes a similar problem: http://www.postgresql.org/message-id/20121125171904.GA4226@89-149-202-102.internetserviceteam.com

One possible solution (as the thread suggests) is to break up the binary blob into multiple chunks and store them in separate records.

matley (matley)
Changed in oq-engine:
milestone: 1.0.0 → 1.0.1
Changed in oq-engine:
assignee: Lars Butler (lars-butler) → nobody
Changed in oq-engine:
status: Confirmed → In Progress
assignee: nobody → Daniele Viganò (daniele-vigano)
assignee: Daniele Viganò (daniele-vigano) → nobody
assignee: nobody → Michele Simionato (michele-simionato)
Revision history for this message
Michele Simionato (michele-simionato) wrote :

This may not appear anymore. After the change in https://github.com/gem/oq-engine/pull/1390 only the relevant portions of the disaggregation matrix are stored in the database, which means only only few kilobytes, well under the limit of ~530M.

Changed in oq-engine:
status: In Progress → Fix Committed
Changed in oq-engine:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.