masakari-engine runs recovery twice for one notification when disconnection with rabbitmq

Bug #1773132 reported by takahara.kengo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
masakari
Undecided
Tushar Patil

Bug Description

[Environment info]
- masakari-api and rabbitmq are connected correctly.
- masakari-engine and rabbitmq are not connected.

[Bug]
When masakari-api recieves a notification and publish message to rabbitmq, masakari-engine cannot start recovery since masakari-engine and rabbitmq are not connected.
Instead, the periodic_task finds a 'new' status record in the DB and starts recovery.

After that, if connection between masakari-engine and rabbitmq is recovered, masakari-engine subscribes the message from queue and starts recovery.
As a result, masakari-engine runs recovery twice for one notification.

Even if the first recovery process was successfull, perhaps the second recovery process may fail and rewrite the DB record from 'finished' to 'error'.

Since the purpose of the periodic_task is to process the unfinished notifications, I think periodic_task should process the 'new' notification.
Therefore, when masakari-engine subscribes message, if the DB record is not 'new', I think that masakari-engine should skip recovery since periodic_task already processes it.

Note:
If periodic_task and main process subscribe a message at the same time, there is a possibility that the recovery process will run in duplex, so it may need to be care for it.

Tushar Patil (tpatil)
Changed in masakari:
status: New → Confirmed
Revision history for this message
Tushar Patil (tpatil) wrote :

As processing of notification is synchronized based on the source host uuid, it is guaranteed that either the main process or periodic task will run recovery process for a given notification.

In _process_notification method, we will need to get the notification from db and compare it with the current one to skip processing.

Tushar Patil (tpatil)
Changed in masakari:
assignee: nobody → Tushar Patil (tpatil)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari (master)

Fix proposed to branch: master
Review: https://review.openstack.org/576042

Changed in masakari:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari (master)

Reviewed: https://review.openstack.org/576042
Committed: https://git.openstack.org/cgit/openstack/masakari/commit/?id=bed5d7781248e411b970887cefe8e73898de4f55
Submitter: Zuul
Branch: master

commit bed5d7781248e411b970887cefe8e73898de4f55
Author: tpatil <email address hidden>
Date: Mon Jun 18 00:11:06 2018 -0700

    Avoid recovery from failure twice

    This patch checks if the notification status is New and the one
    from DB is not New in order to decide whether to run recovery
    from failure or not.

    Change-Id: I7975d1a464e8b0644b4964da4976b1a9eecc29a9
    Closes-Bug: #1773132

Changed in masakari:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari 6.0.0.0b3

This issue was fixed in the openstack/masakari 6.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers