nhlibi seems to be using 2 db connections per worker

Bug #1068022 reported by Muharem Hrnjadovic
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Won't Fix
High
Muharem Hrnjadovic

Bug Description

<paul_henshaw> al-maisan: regarding postgres max_connections- how many connections are used per worker process right now?
<al-maisan> IIRC only 1
<al-maisan> paul_henshaw: on gemcontrol we have "max_connections = 256" for approx. 190 workers
<paul_henshaw> I think this changed with nhlib however
<al-maisan> Oh?
<paul_henshaw> I am pretty sure we are using more than 1 for process
<al-maisan> is it using a connection as well?
<paul_henshaw> On ktulu I am using 3 worker processes per node in order to avoid too many connections
<al-maisan> paul_henshaw: but bear in mind that not all the hazard calculator use the nhlib
<paul_henshaw> Indeed
<paul_henshaw> I am also pretty sure that nhlib itself does not use the DB
<al-maisan> interesting
<al-maisan> so, why are using more than 1 connection per process then?
<paul_henshaw> I don't remember
* al-maisan looks at the MF cluster
<paul_henshaw> but I have 30 worker nodes, 3 processes per node and max_connections = 200
<paul_henshaw> Which works
<paul_henshaw> but with 4 processes per node, I exceed the 200 limit
<paul_henshaw> =>
<paul_henshaw> at least 2 connections per process
<al-maisan> $ ssh <email address hidden> lsof -i tcp | grep postg | wc -l
<al-maisan> 188
<al-maisan> it looks like 1 pg connection per worker on the MF cluster
<paul_henshaw> al-maisan: but this is still running the master OQ engine
<al-maisan> yes
<al-maisan> you are running the nhlibi branch?
<paul_henshaw> I need to configure a server for use with both
<paul_henshaw> Sometimes we use master
<paul_henshaw> sometimes nhlib
<paul_henshaw> Running from git
<paul_henshaw> and using a different DB
<al-maisan> paul_henshaw: did you hit the limit when using the nhlibi branch?
<paul_henshaw> Exactluy
<paul_henshaw> Exactly
<al-maisan> so, the increase in pg connections is particular to nhlibi
<al-maisan> we need to find why that is and fix it if at all possible
<al-maisan> IIRC the nhlib does not go nowhere near the pg database
<al-maisan> that's why it introduced its own domain classes
<paul_henshaw> Yes, I am sure you are right - the nhlib itself does not use the DB directly

Changed in openquake:
importance: Undecided → High
milestone: none → 0.8.4
assignee: nobody → Lars Butler (lars-butler)
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

We need to confirm whether the nhlibi code indeed uses 2 db connections per worker. If yes, this should be fixed.

tags: added: nhlibi
tags: added: database
Changed in openquake:
assignee: Lars Butler (lars-butler) → Muharem Hrnjadovic (al-maisan)
Changed in openquake:
status: New → In Progress
Revision history for this message
Lars Butler (lars-butler) wrote :

We had a discussion about this a few months back in IRC. Here's an excerpt from the log:

 http://pastebin.ubuntu.com/1297138/

It is expected for each worker _process_ to maintain 2 DB connections: 1 for job_init and 1 for reslt_writer.

The connection for `reslt_writer` is used for writing calculation results directly to the DB. The connection for `job_init` is used to check job status prior to task execution.

Revision history for this message
Lars Butler (lars-butler) wrote :

Something we could consider to "bullet-proof" oq-engine:

- When a job runs, query the postgres server for the max number of available connections (see http://stackoverflow.com/questions/8288823/query-a-parameter-postgresql-conf-setting-like-max-connections)
- Use the Celery Inspect API to check the total number of worker processes
- If the number of worker procs * 2 > max number of DB connections, abort the job display an error.

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Why do we have only one connection per worker in the master branch and two in the nhlib-integration branch? Because the celery workers in the latter branch calculate results *and* write them to the postgres database whereas the workers in the master branch only write results to redis.

Changed in openquake:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.