On create instance the instance status alternates between BUILD->ACTIVE->BUILD->ACTIVE

Bug #1482795 reported by Edmond Kotowski
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
High
Peter Stachowski

Bug Description

When creating a trove instance with any datastore the instance starts in BUILD then goes to a fake ACTIVE state. After the fake ACTIVE state it goes back to BUILD then finally the real ACTIVE state. This is happening because the datastore service is first started and sends a heartbeat which causes the instance to go into a fake ACTIVE state then the prepare method on the trove guest sets the status back to BUILD and stops the datastore service. When prepare finishes it sets the status back to the final and real ACTIVE state.

Fix: Adding an optional param force to set_status
def set_status(self, status, force=False) that is going to be set to True for end_install_or_restart which is called at the end of prepare for all datastores either directly or through app.complete_install_or_restart().

Changed in trove:
assignee: nobody → Edmond Kotowski (ekotowski)
importance: Undecided → High
description: updated
Revision history for this message
Matthew Van Dijk (mvandijk) wrote :

I do not think the fix will work for all datastores. The problem here is that the guest and heartbeats are running before prepare has completed. The heartbeats check the status of the database process, which can be misleading during prepare. We need a way to block the heartbeat until prepare has finished and set the status to active.

Revision history for this message
Petr Malik (pmalik) wrote :

I am not sure if the proposed fix targets the real issue here.

The real issue is that the guest starts before prepare - behavior causing all sorts of troubles in various datastores already (e.g. need to wait for the DB to start up in order to stop it immediately at the beginning of prepare, need to cleanup generated files and settings) and sends a HB too early.

I believe a proper fix would be to prevent it from sending the erroneous/any HB the prepare finishes.

Also, are we sure the data directory will always exist?
What about in-memory databases (e.g. Redis)?

P.

Revision history for this message
Edmond Kotowski (ekotowski) wrote :

Yes var lib data fix doesn't work and only exists on mysql right now. The fix I have working for all datastores is this

Fix: Adding an optional param force to set_status
def set_status(self, status, force=False) that is going to be set to True for end_install_or_restart which is called at the end of prepare for all datastores either directly or through app.complete_install_or_restart().

Will send the patch set out with it this week.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/215402

Changed in trove:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/215402
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=2b963fa4b1d6895a03508e47cffc3fac03abd1c2
Submitter: Jenkins
Branch: master

commit 2b963fa4b1d6895a03508e47cffc3fac03abd1c2
Author: Edmond Kotowski <email address hidden>
Date: Thu Aug 20 19:07:04 2015 -0700

    Fix instance from alternating status on create

    Instances will currently alternate between
    BUILD->ACTIVE->BUILD->ACTIVE on create. This was
    happening because a race condition existed between
    the datastore sending heartbeats that the service is
    ACTIVE before and the prepare call manually stopping
    the service to bring the status back to BUILD.

    The fix is to add an optional param force to set_status
    that is going to be set to True for end_install_or_restart
    which is called at the end of prepare for all datastores either
    directly or through app.complete_install_or_restart(). When
    set_status is called it will now check if the instance status
    was currently BUILDING and if the force flag is False. This
    means that the prepare call has not yet finished and to skip
    the heartbeat from updating the status. Once prepare is finished
    it will call complete_install_or_restart which will in turn
    call end_install_or_restart and force the status to be updated
    from BUILD to ACTIVE. If setting status to FAILED or
    BUILD_PENDING it will never skip the heartbeat.

    For mongodb I added complete_install_or_restart to service.py
    to be called at the end of prepare for single instance mongo
    instead of calling set_status directly to RUNNING.

    Cleaned up BaseDbStatusTests and added new tests covering new
    logic.

    Change-Id: I7cbd5667e27608edef9755280a8f072495839e1d
    Closes-Bug: 1482795

Changed in trove:
status: In Progress → Fix Committed
Changed in trove:
milestone: none → liberty-3
Changed in trove:
status: Fix Committed → Fix Released
Revision history for this message
Edmond Kotowski (ekotowski) wrote :

Reopening because I noticed the status is still toggling because I am not checking if self.status is None on the first set_status call from the guest agent.

Changed in trove:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/223892

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on trove (master)

Change abandoned by Edmond Kotowski (<email address hidden>) on branch: master
Review: https://review.openstack.org/223892
Reason: Need to go back and think about this more. We will look into a fix for the M release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/234461

Changed in trove:
assignee: Edmond Kotowski (ekotowski) → Peter Stachowski (peterstac)
Thierry Carrez (ttx)
Changed in trove:
milestone: liberty-3 → 4.0.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/234461
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=1faa4d427d5fec782e740a2095dc6629bb8a33d7
Submitter: Jenkins
Branch: master

commit 1faa4d427d5fec782e740a2095dc6629bb8a33d7
Author: Peter Stachowski <email address hidden>
Date: Fri Sep 18 18:15:01 2015 -0400

    Refactor the datastore manager classes

    There is a large amount of boiler-plate code in each datastore manager.
    As more managers are added, the time involved in maintaining all this
    code wil continue to grow. To alleviate this, a base manager class
    has been added where all common code can reside.

    The initial refactoring just moved some of the obvious code (such as
    rpc_ping and update_status) into the base class, along with defining
    properties that can be used to further abstract functionality
    going forward.

    The issue of having instances move in-and-out of ACTIVE state has
    also been fixed by adding a flag file that is written by the base
    class once prepare has finished successfully.

    Closes-Bug: #1482795
    Closes-Bug: #1487233
    Partially Implements: blueprint datastore-manager-refactor
    Change-Id: I603cf2ddebab1d7a5c874cd66431f803aaee2d42

Changed in trove:
status: In Progress → Fix Committed
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/trove 5.0.0.0b1

This issue was fixed in the openstack/trove 5.0.0.0b1 development milestone.

Changed in trove:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.