Increase replication timeouts for snapshot/restore

Bug #1362310 reported by Morgan Jones
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
High
Nikhil Manchanda

Bug Description

In the current implementation, replication snapshots and creating a new slave from a snapshot may fail due to timeouts waiting for large amounts of data to be backed up or restored.

A potential solution is to somehow incorporate monitoring heartbeats from the guestagent to ensure that the operation can have as much time as necessary, without creating a situation where a failed guestagent will lock out the taskmanager. However, such a solution is beyond the scope of implementation for Juno.

As a temporary solution, change the timeouts on on the snapshot backup calls and the instance restore calls to effectively be "timeout = maxint".

Morgan Jones (6-morgan)
Changed in trove:
assignee: nobody → Morgan Jones (6-morgan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/121938

Changed in trove:
status: New → In Progress
Revision history for this message
Amrith Kumar (amrith) wrote :

this is a replication bug fix that was targeted for Juno during the mid-cycle.

Changed in trove:
milestone: none → juno-rc1
Changed in trove:
importance: Undecided → High
Changed in trove:
assignee: Morgan Jones (6-morgan) → Nikhil Manchanda (slicknik)
Changed in trove:
assignee: Nikhil Manchanda (slicknik) → Morgan Jones (6-morgan)
Changed in trove:
assignee: Morgan Jones (6-morgan) → Nikhil Manchanda (slicknik)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/121938
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=0fe9c9dd59cf4bc084a9875ce789f0b2943c799a
Submitter: Jenkins
Branch: master

commit 0fe9c9dd59cf4bc084a9875ce789f0b2943c799a
Author: Morgan Jones <email address hidden>
Date: Wed Sep 10 10:31:00 2014 -0700

    Make the replication snapshot timeout configurable

    There is no way to tell how long the snapshot for replication
    will take, and we have no good way to poll for the slave state.
    Eventually, we will need to have an intelligent poll (perhaps
    based on guest heartbeats), but in the meantime we will have
    the the snapshot use a configurable timeout which can be set
    as needed, and independently of the agent_call timeouts.

    Co-Authored-By: Nikhil Manchanda <email address hidden>
    Change-Id: I6316d748e91d1ec3eebe25a14bb43fbfe10db669
    Closes-bug: 1362310

Changed in trove:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in trove:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in trove:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.