Stalling jobs when launching the connector with multiple workers

Bug #1193293 reported by Guewen Baconnier @ Camptocamp
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenERP Connector
Fix Released
High
Guewen Baconnier @ Camptocamp

Bug Description

Copy of a reported bug:

"
We are seeing a situation when importing many records from Magento (sometimes hundreds, sometimes tens of thousands) where the connector will run well for a while and stall and stop importing records. Usually restarting the server and re-queuing the jobs gets it going again, but this is not ideal.

I have a couple of theories about what might be causing it.
1) We have two Open ERP instances running on the same server on different ports for testing. Both have access to the same postgreSQL databases. Is it possible that workers from both instances are trying to execute import jobs on the same database at the same time and are conflicting with each other to cause this?
2) We have been doing some testing with multiprocessing workers using the "workers" option in the config file. I have run into issues in the past (specifically running a full database upgrade in threaded mode) where the worker would reach it's execution time limit and be killed before it could complete it's task. Is it possible that enumerating large numbers of customers for the initial import ~50k take too long and cause that process to die?
"

Reproducible only when using multiprocessing with option `--workers` in OpenERP.

Revision history for this message
Guewen Baconnier @ Camptocamp (gbaconnier-c2c) wrote :
Download full text (5.0 KiB)

I'll start by giving some insights on the way workers and jobs work and how they are supposed to be resilient.
Hope we'll find a flaw in the concept or implementation, because this is very concerning, and maybe hard to reproduce.
I won't dive too much in the implementation though, to keep the things clear at a design level.

The following text could serve as a basis to complete for inclusion in the documentation.

Concepts:
========

    Worker:
      A worker contains a queue of jobs. Jobs are enqueued in the queue, the worker executes them. A worker is represented in the database by a row in `queue_worker`.
      Basically, it holds a queue in memory, wait for jobs, and pop them out of the queue one after the other.
      The connector framework launches 1 worker per process, in a thread.
      When using multiprocessing, only the workers of a "Cron Processes" execute jobs, unlike those of "HTTP processes".

    Jobs:
      A job is a pending task to be run by a worker. A job is stored in the database by a row in `queue_job`.
      When a job is created, its state is "pending".
      When a job is assigned to a worker, its state is still "pending" and `queue_job.worker_id` contains the id of the assigned worker.
      When a job is enqueued, its state is "enqueued" and `queue_job.worker_id` contains the id of the assigned worker. When a job is enqueued, it is pushed in the Queue (in memory) of the Worker.
      When a jobs is started, its state is "started" and when it is finished, its state is "done".

   WorkerWatcher:
     A worker watcher is responsible for creating a Worker when the OpenERP registry is loaded and signal the aliveness of the process.
     A WorkerWatcher is launched per process in a thread.

Enqueuing process:
===============

    The jobs are assigned and enqueued (cf Concepts) in a worker by a 'Scheduled Action'.
    This is a 2-steps operation with a commit between each step (so they are atomic):

    Assign step:
      1. SELECT all the jobs which are not assigned to a worker (with a FOR UPDATE so it reserve the jobs)
          If the SELECT fails (because another Scheduled Action reserved the jobs for instance, it the enqueuing is stopped and retried the next time.
      2. Update the jobs with the ID of the worker of the current process in `worker_id` and change their state to pending.
      3. Commit the transaction

   Enqueue step:
      4. SELECT all the jobs assigned to the worker of the current process
      5. For each jobs:
          6. Change the state's job to "enqueued"
          7. Commit the transaction
          8. Push the job in the worker's queue (it would likely not fail)

Lifecycle of a Worker:
=================

    Beginning of the life

    1. OpenERP starts
    2. The WorkerWatcher thread is created, it polls every 10 seconds and wait for the OpenERP registry to be ready.
    3. Registry's ready, the WorkerWatcher creates a Worker. Jobs start to be assigned and enqueued into it.
    4. Every 10 seconds, the WorkerWatcher checks if the Worker is still alive, in such case, it write the 'Last Alive Check' date in the `queue_worker` record. (and log Worker 2f183d82-10a3-45ff-8c95-00d2fb...

Read more...

Revision history for this message
Guewen Baconnier @ Camptocamp (gbaconnier-c2c) wrote :

This happens solely when multiprocessing is used.
The workers are running in the 'CronWorker' processes and are dying because the registry is deleted & database closed after each scheduled action.

Here is an alternative branch which resolve this problem by using a standalone script for the jobs workers:
https://code.launchpad.net/~openerp-connector-core-editors/openerp-connector/7.0-connector-worker-rework

description: updated
description: updated
Changed in openerp-connector:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Guewen Baconnier @ Camptocamp (gbaconnier-c2c)
status: Confirmed → Incomplete
status: Incomplete → Fix Committed
description: updated
summary: - Stalling jobs when importing thousands of customers
+ Stalling jobs when launching the connector with multiple workers
Revision history for this message
Guewen Baconnier @ Camptocamp (gbaconnier-c2c) wrote :
Changed in openerp-connector:
status: Fix Committed → Fix Released
information type: Embargoed → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.