Taskmanager resize/migration actions Exception does not properly handle failures.

Bug #1102523 reported by Joe Cruz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
Undecided
Joe Cruz

Bug Description

* First in case of a failure the Taskmanager should call the restart MySQL guest methods only if the Nova server is ACTIVE

When Resizing, Reddwarf checks that the Nova status switches from ACTIVE to VERIFY_RESIZE. If anything goes wrong, it restarts MySQL in any event. If the MySQL app can be restarted, why not?

However, we need to add code to first check that the Nova server status is ACTIVE and only then make the call to restart.

Why?
Let's say a customer is resizing their database, and triggers a migration. As the migration runs, the network is cut, causing the migration to error out. Nova sets the status to ERROR.

The Reddwarf task manager code sees this, and sends a message to restart MySQL.

If the server isn't running, the guest doesn't pick it up until it's turned back on. What happens then depends on the state of the system at that time.

If the original server is still running, the guest may pick up the message and in theory if the volume was disconnected during the Nova migration, but the Nova server is otherwise OK, MySQL may start up and create new databases over the old one. In this theoretical scenario it might look as if data was deleted. Fixing this would require ops to stop MySQL, delete the new database, then reattach the volume and start MySQL.

We can avoid this theoretical scenario by checking that the server is in ACTIVE status before restarting the guest.

* Second revert barrier should be right after Verify_RESIZE is confirmed, but before confirming flavor for a resize action.

Currently if there is a resize failure where the nova server is in VERIFY_RESIZE but has the old flavor id the _perform_nova_action exception block will not revert the server since it has not passed the revert barrier.

Joe Cruz (jcruz7)
Changed in reddwarf:
assignee: nobody → Joe Cruz (jcruz7)
Joe Cruz (jcruz7)
summary: - Taskmanager should call the restart MySQL guest methods only if the Nova
- server is ACTIVE
+ Taskmanager resize/migration actions Exception does not properly handle
+ failures.
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to reddwarf (master)

Fix proposed to branch: master
Review: https://review.openstack.org/20351

Changed in reddwarf:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to reddwarf (master)

Reviewed: https://review.openstack.org/20351
Committed: http://github.com/stackforge/reddwarf/commit/d9c9a91642f6a00482283645e30dc9c184f43370
Submitter: Jenkins
Branch: master

commit d9c9a91642f6a00482283645e30dc9c184f43370
Author: Joe Cruz <email address hidden>
Date: Fri Jan 18 08:55:09 2013 -0600

    Negative Taskmanager Resize/Migration fixes.

    * Revert barrier should be right after Verify_RESIZE is confirmed, but
      before confirming flavor for a resize action.
    * Verify nova server status before restarting mysql.
      If anything goes wrong during resize/migrate actions, taskmanager
      first checks that the Nova server status is ACTIVE and only then make
      the call to restart MySQL.

    Fixes: bug #1102523

    Change-Id: Ibca436d7fdcdef9f1afcec111da84891cd15353c

Changed in reddwarf:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in trove:
milestone: none → havana-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in trove:
milestone: havana-2 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.