Fix race condition on configuration-detach

Bug #1468488 reported by Petr Malik
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
High
Petr Malik

Bug Description

Fix race condition on configuration-detach

Configuration detach (and attach) are async calls from the task manager.

A Trove instance is loaded from the database [in instance/service:287],
an async call is made to detach the configuration
group [in instance/service:274] and the database gets updated with the
new configuration id (None) [in instance/service:276].

The current detach implementation [in taskmanager/models:1297]
assumes the instance has a configuration id throughout the execution of
the async call.
This causes a failure in the task manager if the database update
happens before the async call finishes (update_db removes the
configuration id and re-saves the instance in the Trove database).

There is another related issue. If the detach call fails anywhere in
the taskmanager or the guestagent the Trove database still gets updated.
The instance then behaves like it has (not) a configuration group
attached although it is not true.

Notes:

The configuration id is needed on detach because the code tries to
reset the changed values to their defaults dynamically and to do that it
needs to load the configuration group.
If it cannot reset all configuration properties dynamically it puts
the instance to RESTART_REQUIRED state. This is however overridden
in instance/models:1037 where we always set the RESTART_REQUIRED flag on
configuration detach.

Another problem is in update_overrides [in taskmanager/models:1270]
where we currently *always* apply the changed values even if some
of them require instance reset or their default values are not available
(on detach). This potentially puts the instance into a state with only
some changes applied and other waiting for restart.

Proposed fixes:

  - The preferred solution is to move the current configuration casts
    from the task manager to the API and make them into blocking calls
    (configuration change is a quick operation).
    That way we could wait for the guestagent operation to succeed
    before we start updating the Trove record. If the call fails we
    leave the configuration record intact so that it reflects the real
    state of the instance and the user can retry to detach it again.
    This would also allow us to emit a meaningful error message to the
    client.

  - An alternative solution (quick fix) would be to remove the
    code responsible for resetting the configuration values on detach
    (we currently always put the instance to RESTART_REQUIRED state
    anyways). That way the configuration id would no longer be required.
    This should mitigate the race condition but it does not solve the
    other issue - the record in the Trove database gets updated even if
    the guestagent call fails.

Changed in trove:
status: New → In Progress
Changed in trove:
milestone: liberty-2 → liberty-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/198891
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=6a6b3dae4fcbec068a6c9eafc9eaa21819b8a94e
Submitter: Jenkins
Branch: master

commit 6a6b3dae4fcbec068a6c9eafc9eaa21819b8a94e
Author: Petr Malik <email address hidden>
Date: Sun Jul 5 14:08:42 2015 -0400

    Fix race conditions in config overrides tasks

        * Convert configuration casts into blocking calls and move them
          from the task manager to the API. Update the Trove records
          only after a successful update on the guestagent.

        * Change the way how we apply the configuration changes.

           - On attach: Apply the values dynamically only if all of them
                        can be applied at once. Put the instance into
                        the 'RESTART_REQUIRED' state otherwise.
           - On detach: Apply the values dynamically only if:
                        a) Default values for all of them are
                           available in the configuration template.
                        b) All values can be applied at once.
                       Put the instance into the 'RESTART_REQUIRED'
                       state otherwise.

        * Remove override template resolution. Pass override values in a
          Python dict (like it is for 'apply_overrides').
          The user-provided configuration values do not get resolved and
          are applied as supplied by the user - hence no need for a template.
          It also avoids the need to double-parse the overrides in
          guestagents that do not support configuration imports.

        * Moved MySQL-specific value conversions (K, M, G suffixes)
          down to the MySQL guestagent.

        * Update the MySQL 'update_overrides' methods to accept both
          Python dicts and rendered strings. This is for backwards
          compatibility with older task managers that send overrides as
          a string.

        * Remove deprecated methods from the taskmanager.

    Closes-Bug: 1468488
    Change-Id: Ie125131945ad82afe6663e6c5a5d70c6e949c60b

Changed in trove:
status: In Progress → Fix Committed
Changed in trove:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in trove:
milestone: liberty-3 → 4.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.