Bug #1193293 “Stalling jobs when launching the connector with mu...” : Bugs : OpenERP Connector

Revision history for this message

Guewen Baconnier @ Camptocamp (gbaconnier-c2c) wrote on 2013-06-21:

#1

Download full text (5.0 KiB)

I'll start by giving some insights on the way workers and jobs work and how they are supposed to be resilient.
Hope we'll find a flaw in the concept or implementation, because this is very concerning, and maybe hard to reproduce.
I won't dive too much in the implementation though, to keep the things clear at a design level.

The following text could serve as a basis to complete for inclusion in the documentation.

Concepts:
========

    Worker:
      A worker contains a queue of jobs. Jobs are enqueued in the queue, the worker executes them. A worker is represented in the database by a row in `queue_worker`.
      Basically, it holds a queue in memory, wait for jobs, and pop them out of the queue one after the other.
      The connector framework launches 1 worker per process, in a thread.
      When using multiprocessing, only the workers of a "Cron Processes" execute jobs, unlike those of "HTTP processes".

    Jobs:
      A job is a pending task to be run by a worker. A job is stored in the database by a row in `queue_job`.
      When a job is created, its state is "pending".
      When a job is assigned to a worker, its state is still "pending" and `queue_job.worker_id` contains the id of the assigned worker.
      When a job is enqueued, its state is "enqueued" and `queue_job.worker_id` contains the id of the assigned worker. When a job is enqueued, it is pushed in the Queue (in memory) of the Worker.
      When a jobs is started, its state is "started" and when it is finished, its state is "done".

   WorkerWatcher:
     A worker watcher is responsible for creating a Worker when the OpenERP registry is loaded and signal the aliveness of the process.
     A WorkerWatcher is launched per process in a thread.

Enqueuing process:
===============

The jobs are assigned and enqueued (cf Concepts) in a worker by a 'Scheduled Action'.
This is a 2-steps operation with a commit between each step (so they are atomic):

    Assign step:
      1. SELECT all the jobs which are not assigned to a worker (with a FOR UPDATE so it reserve the jobs)
          If the SELECT fails (because another Scheduled Action reserved the jobs for instance, it the enqueuing is stopped and retried the next time.
      2. Update the jobs with the ID of the worker of the current process in `worker_id` and change their state to pending.
      3. Commit the transaction

   Enqueue step:
      4. SELECT all the jobs assigned to the worker of the current process
      5. For each jobs:
          6. Change the state's job to "enqueued"
          7. Commit the transaction
          8. Push the job in the worker's queue (it would likely not fail)

Lifecycle of a Worker:
=================

Beginning of the life

    1. OpenERP starts
    2. The WorkerWatcher thread is created, it polls every 10 seconds and wait for the OpenERP registry to be ready.
    3. Registry's ready, the WorkerWatcher creates a Worker. Jobs start to be assigned and enqueued into it.
    4. Every 10 seconds, the WorkerWatcher checks if the Worker is still alive, in such case, it write the 'Last Alive Check' date in the `queue_worker` record. (and log Worker 2f183d82-10a3-45ff-8c95-00d2fb...

I'll start by giving some insights on the way workers and jobs work and how they are supposed to be resilient.
Hope we'll find a flaw in the concept or implementation, because this is very concerning, and maybe hard to reproduce.
I won't dive too much in the implementation though, to keep the things clear at a design level.

The following text could serve as a basis to complete for inclusion in the documentation.

Concepts:
========

Worker: 
      A worker contains a queue of jobs. Jobs are enqueued in the queue, the worker executes them. A worker is represented in the database by a row in `queue_worker`.
      Basically, it holds a queue in memory, wait for jobs, and pop them out of the queue one after the other.
      The connector framework launches 1 worker per process, in a thread.
      When using multiprocessing, only the workers of a "Cron Processes" execute jobs, unlike those of "HTTP processes".

Jobs:
      A job is a pending task to be run by a worker. A job is stored in the database by a row in `queue_job`.
      When a job is created, its state is "pending".
      When a job is assigned to a worker, its state is still "pending" and `queue_job.worker_id` contains the id of the assigned worker.
      When a job is enqueued, its state is "enqueued" and `queue_job.worker_id` contains the id of the assigned worker. When a job is enqueued, it is pushed in the Queue (in memory) of the Worker.
      When a jobs is started, its state is "started" and when it is finished, its state is "done".

WorkerWatcher:
     A worker watcher is responsible for creating a Worker when the OpenERP registry is loaded and signal the aliveness of the process.
     A WorkerWatcher is launched per process in a thread.

Enqueuing process:
===============

The jobs are assigned and enqueued (cf Concepts) in a worker by a 'Scheduled Action'.
    This is a 2-steps operation with a commit between each step (so they are atomic):

Assign step:
      1. SELECT all the jobs which are not assigned to a worker (with a FOR UPDATE so it reserve the jobs)
          If the SELECT fails (because another Scheduled Action reserved the jobs for instance, it the enqueuing is stopped and retried the next time.
      2. Update the jobs with the ID of the worker of the current process in `worker_id` and change their state to pending.
      3. Commit the transaction

Enqueue step:
      4. SELECT all the jobs assigned to the worker of the current process
      5. For each jobs:
          6. Change the state's job to "enqueued"
          7. Commit the transaction
          8. Push the job in the worker's queue (it would likely not fail)

Lifecycle of a Worker:
=================

Beginning of the life

1. OpenERP starts
    2. The WorkerWatcher thread is created, it polls every 10 seconds and wait for the OpenERP registry to be ready.
    3. Registry's ready, the WorkerWatcher creates a Worker. Jobs start to be assigned and enqueued into it.
    4. Every 10 seconds, the WorkerWatcher checks if the Worker is still alive, in such case, it write the 'Last Alive Check' date in the `queue_worker` record. (and log  Worker 2f183d82-10a3-45ff-8c95-00d2fbfe65bb is alive on process 5148)
        In the same time, it checks all the others `queue_worker` records and remove them if the 'Last Alive Check' is older than 5 minutes, we consider that it is dead. Note: It checks only if the Worker's thread respond to `is_alive`, if a job itself is stalled, the Worker will still be considered alive. That could be a situation for a stall (could be recognized if a job is started long ago).
    5. If all goes well, the Worker could stay indefinitely. Now we'll see what happens when we have a Worker is destroyed.

Invalidation of the registry (registry is the pooler with the instances of the Models)
        A Worker can't work with an invalided registry. A registry becomes obsolete as such an update of a module is done for instance. When this happens, the reference to the registry is removed from the WorkerWatcher, meaning that the Worker has to terminate. Each time a Worker finishes a job, it checks with the WorkerWatcher if it has to terminate.

6. The Worker terminates.
        7. Go to step 2. -> a new Worker is created for the new registry
        8. After 5 minutes, the old Worker is deleted from the DB, releasing all the jobs assigned to itself (`queue_job.worker_id` deleted by FK cascade).
        9. The released jobs can now be assigned to the new Worker

Process killed by the memory / execution time limits (multiprocessing workers)
        OpenERP when running in multiprocess can kill its processes when they exceed some limits. In such case the Worker is obviously deleted.

6. OpenERP starts a new process -> Go to step 1.
        7. After 5 minutes, the old Worker is deleted from the DB, releasing all the jobs assigned to itself (`queue_job.worker_id` deleted by FK cascade).
        8. The released jobs can now be assigned to the new Worker
    
    Note: a stall of 5 minutes can be observed, while the jobs are still assigned to the dead worker.

Revision history for this message

Guewen Baconnier @ Camptocamp (gbaconnier-c2c) wrote on 2013-06-26:

#2

This happens solely when multiprocessing is used.
The workers are running in the 'CronWorker' processes and are dying because the registry is deleted & database closed after each scheduled action.

Here is an alternative branch which resolve this problem by using a standalone script for the jobs workers:
https://code.launchpad.net/~openerp-connector-core-editors/openerp-connector/7.0-connector-worker-rework

description:	updated
description:	updated

Guewen Baconnier @ Camptocamp (gbaconnier-c2c) on 2013-06-26

Changed in openerp-connector:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → Guewen Baconnier @ Camptocamp (gbaconnier-c2c)
status:	Confirmed → Incomplete
status:	Incomplete → Fix Committed