Locks in workers when using multiprocessing workers
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| OpenERP Connector |
New
|
Undecided
|
Unassigned | ||
Bug Description
Sometimes we can observe stalling jobs when using the multiprocessing workers.
From what I could observe, the alive check can't update the "date_alive" field, due to a database lock. Thus, the workers are considered as dead after 5 minutes and should be deleted, but the delete query is blocked by the "alive check" still locked.
As a result, the jobs stay assigned to them and can't be assigned to new workers.
Using this query on my database:
select bl.pid as blocked_pid, a.usename as blocked_user,
kl.pid as blocking_pid, ka.usename as blocking_user, a.current_query as blocked_statement
from pg_catalog.pg_locks bl
join pg_catalog.
on bl.pid = a.procpid
join pg_catalog.pg_locks kl
join pg_catalog.
on kl.pid = ka.procpid
on bl.transactionid = kl.transactionid and bl.pid != kl.pid
where not bl.granted;
I obtained:
blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_statement
-----
23234 | openerp_prod_snip | 23291 | openerp_prod_snip | update queue_worker set "date_alive"
23249 | openerp_prod_snip | 23347 | openerp_prod_snip | delete from queue_worker where id IN (8873, 8874)
And if I inspect the blocked and blocking pids:
SELECT datname,
datname | usename | procpid | client_addr | waiting | query_start | current_query
openerp_
openerp_
SELECT datname,
datname | usename | procpid | client_addr | waiting | query_start | current_query
openerp_
openerp_
The pid 23291 is the transaction locking all the other, but is <IDLE> in transaction.
Details of the locks of the pid 23291
-----
res_users_pkey | relation | 221638 | 221799 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
res_
res_
res_groups_pkey | relation | 221638 | 221811 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
res_
res_
res_
res_
wkf_
ir_module_
name_uniq | relation | 221638 | 222018 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
res_partner_pkey | relation | 221638 | 222050 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
queue_job_pkey | relation | 221638 | 225820 | | | | | | | | 16/3926746 | 23291 | RowExclusiveLock | t
queue_job_pkey | relation | 221638 | 225820 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
queue_
queue_
queue_
queue_
queue_worker | relation | 221638 | 225828 | | | | | | | | 16/3926746 | 23291 | RowShareLock | t
queue_worker | relation | 221638 | 225828 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
queue_job | relation | 221638 | 225816 | | | | | | | | 16/3926746 | 23291 | RowExclusiveLock | t
queue_job | relation | 221638 | 225816 | | | | | | | | 16/3926746 | 23291 | RowShareLock | t
wkf_
wkf_
wkf_
ir_module_
ir_module_
ir_module_
ir_module_
res_
res_
res_
res_
res_
res_
res_
wkf_instance | relation | 221638 | 221895 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
ir_module_module | relation | 221638 | 221989 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
res_groups | relation | 221638 | 221807 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
res_users | relation | 221638 | 221793 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
mail_alias_pkey | relation | 221638 | 225157 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
mail_
mail_alias | relation | 221638 | 225153 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
res_partner | relation | 221638 | 222046 | | | | | | | | 16/3926746 | 23291 | AccessShareLock | t
(44 lignes)
I still have to find why this transaction stays in a stalling state.
Better display for the long lines of the tables: http://
| description: | updated |