OpenQuake Engine

Disaggregation Failure GEM Cluster

Bug #1180274 reported by Laurentiu D. on 2013-05-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenQuake Engine	Fix Released	High	Michele Simionato	OpenQuake Engine 1.2.0

Bug Description

The following errors occurred during a disaggregation computation in the GEM cluster:
http://pastie.org/7910299
This is a disaggregation computation for 10 sites and the entire SHARE Area Source Model.

See original description

Tags:

Laurentiu D. (laurentiu.danciu) on 2013-05-15

description:

updated

Lars Butler (lars-butler) on 2013-05-15

Changed in oq-engine:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → Lars Butler (lars-butler)
milestone:	none → 1.0.0

Revision history for this message

Lars Butler (lars-butler) wrote on 2013-05-15:

This failure seems to be due to the following reasons:

- PostgreSQL's `shared_buffers` was set to 1024MB
- Each disaggregation matrix INSERT is about ~64MB (in this particular calculation)
- Since each query consumes about 2-3x that, we could easily run max out the buffer if 5-6 workers are inserting simultaneously.

We're going to experiment with the following:
- Increase `shared_buffers` to 12288MB
- Increase `checkpoint_segments` to 64 (from the default of 3). The postgres documentation [1] recommends increasing this in tandem with any increase to `shared_buffers`. Also, we're getting a LOT of checkpoint messages in the postgres log (see below); this should help silence those messages.

LOG: checkpoints are occurring too frequently (26 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".

[1] - http://www.postgresql.org/docs/9.1/static/runtime-config-resource.html

Revision history for this message

Lars Butler (lars-butler) wrote on 2013-05-16:

Update:

The configuration changes above did not fix the problem; we're still getting the exact same error: invalid memory alloc request size 1073741824.

I will note that this happens when we try to insert large disagg matrix blobs into the hzrdr.disagg_result table, but it doesn't happen everytime. After the job was aborted, I noticed that 312 records were inserted into this table, so this isn't a consistent failure.

I'm still looking into other options for addressing this issue. I'll post updates as soon as I have them.

Revision history for this message

Lars Butler (lars-butler) wrote on 2013-05-17:

I found the root cause of this error:

We're using a BYTEA field to store the disagg matrices (as pickled numpy arrays), and we're hitting a hard-coded limit to the field size in postgres. The maximum allowed size for BYTEA is 1GB, and we encounter this error when we try to execute an INSERT statement containing ~530MB of data. It seems that a multiplier of 2-3x is required for parsing and processing the data, so that's why we're hitting the limit.

This thread describes a similar problem: http://www.postgresql.org/message-id/20121125171904.GA4226@89-149-202-102.internetserviceteam.com

One possible solution (as the thread suggests) is to break up the binary blob into multiple chunks and store them in separate records.

matley (matley) on 2013-08-27

Changed in oq-engine:
milestone:	1.0.0 → 1.0.1

Lars Butler (lars-butler) on 2013-10-10

Changed in oq-engine:
assignee:	Lars Butler (lars-butler) → nobody

Daniele Viganò (daniele-vigano) on 2014-03-27

Changed in oq-engine:
status:	Confirmed → In Progress
assignee:	nobody → Daniele Viganò (daniele-vigano)
assignee:	Daniele Viganò (daniele-vigano) → nobody
assignee:	nobody → Michele Simionato (michele-simionato)

Revision history for this message

Michele Simionato (michele-simionato) wrote on 2014-04-01:

This may not appear anymore. After the change in https://github.com/gem/oq-engine/pull/1390 only the relevant portions of the disaggregation matrix are stored in the database, which means only only few kilobytes, well under the limit of ~530M.

Changed in oq-engine:
status:	In Progress → Fix Committed

Daniele Viganò (daniele-vigano) on 2014-12-15

Changed in oq-engine:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.