Ejecting a valid master is sometimes allowed

Bug #1622014 reported by Peter Stachowski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
New
Undecided
Unassigned

Bug Description

There is a test to eject a valid master that is failing more and more often in the scenario tests. Basically the eject will only work if the heartbeat from the current master is more than a minute old. This shouldn't happen in the scenario test run, but it does - quite often.

One theory is that there is clock drift involved, and the comparison of the two timestamps needs to be synchronized (i.e. both timestamps need to originate from the same machine). Typically the database is entrusted with adding timestamps to records to make sure they're all correct relative to each other, however this may or may not be possible in Trove's circumstances (since network or queue delays could mask the 'real' time the heartbeat was generated. It could be that this is not an issue, in which case the heartbeat time delta could be extracted directly from the db in the form of a time diff in seconds.

The current 'eject valid master' test will be turned off until the fix for this lands.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to trove (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/368230

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to trove (master)

Reviewed: https://review.openstack.org/368230
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=7d8d743d8e168f64fa893c975d7301507d4985d8
Submitter: Jenkins
Branch: master

commit 7d8d743d8e168f64fa893c975d7301507d4985d8
Author: Amrith Kumar <email address hidden>
Date: Sat Sep 10 12:33:42 2016 -0400

    Skip 'eject valid master' replication test

    There is a test to eject a valid master during replication that is
    failing more and more often in the scenario tests. Basically the
    eject will only work if the heartbeat from the current master is
    more than a minute old. This shouldn't happen in the scenario
    test run, but it does - quite often.

    Since this is consuming a large amount of gate resources, and the
    bug isn't that onerous (but is probably hard to fix), the current
    'eject valid master' test will be turned off until the fix lands.

    The scenario test will print out the bug number during each run
    as a reminder (using the new SkipKnownBug method).

    Co-Authored-By: Peter Stachowski <email address hidden>
    Co-Authored-By: Amrith Kumar <email address hidden>
    Author: Peter Stachowski <email address hidden>
    Change-Id: Ia543da551ad4394d4964541f9db474e0792b9337
    Related-Bug: #1622014

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to trove (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/368433

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to trove (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/368435

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to trove (stable/mitaka)

Reviewed: https://review.openstack.org/368435
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=b5a426243f85353568c07178c897de65556eacb8
Submitter: Jenkins
Branch: stable/mitaka

commit b5a426243f85353568c07178c897de65556eacb8
Author: Amrith Kumar <email address hidden>
Date: Sat Sep 10 12:33:42 2016 -0400

    Skip 'eject valid master' replication test

    There is a test to eject a valid master during replication that is
    failing more and more often in the scenario tests. Basically the
    eject will only work if the heartbeat from the current master is
    more than a minute old. This shouldn't happen in the scenario
    test run, but it does - quite often.

    Since this is consuming a large amount of gate resources, and the
    bug isn't that onerous (but is probably hard to fix), the current
    'eject valid master' test will be turned off until the fix lands.

    The scenario test will print out the bug number during each run
    as a reminder (using the new SkipKnownBug method).

    Co-Authored-By: Peter Stachowski <email address hidden>
    Co-Authored-By: Amrith Kumar <email address hidden>
    Author: Peter Stachowski <email address hidden>
    Change-Id: Ia543da551ad4394d4964541f9db474e0792b9337
    Related-Bug: #1622014
    (cherry picked from commit 7d8d743d8e168f64fa893c975d7301507d4985d8)
    Conflicts:
     trove/tests/scenario/runners/replication_runners.py
     trove/tests/scenario/runners/test_runners.py

tags: added: in-stable-mitaka
tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to trove (stable/liberty)

Reviewed: https://review.openstack.org/368433
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=c9b3c49cf1994fa2d728a99793ac6b82888de2ae
Submitter: Jenkins
Branch: stable/liberty

commit c9b3c49cf1994fa2d728a99793ac6b82888de2ae
Author: Amrith Kumar <email address hidden>
Date: Sat Sep 10 12:33:42 2016 -0400

    Skip 'eject valid master' replication test

    There is a test to eject a valid master during replication that is
    failing more and more often in the scenario tests. Basically the
    eject will only work if the heartbeat from the current master is
    more than a minute old. This shouldn't happen in the scenario
    test run, but it does - quite often.

    Since this is consuming a large amount of gate resources, and the
    bug isn't that onerous (but is probably hard to fix), the current
    'eject valid master' test will be turned off until the fix lands.

    The scenario test will print out the bug number during each run
    as a reminder (using the new SkipKnownBug method).

    Co-Authored-By: Peter Stachowski <email address hidden>
    Co-Authored-By: Amrith Kumar <email address hidden>
    Author: Peter Stachowski <email address hidden>
    Change-Id: Ia543da551ad4394d4964541f9db474e0792b9337
    Related-Bug: #1622014
    (cherry picked from commit 7d8d743d8e168f64fa893c975d7301507d4985d8)
    Conflicts:
     trove/tests/scenario/runners/replication_runners.py
     trove/tests/scenario/runners/test_runners.py

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to trove (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/374249

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to trove (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/374320

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on trove (stable/newton)

Change abandoned by Peter Stachowski (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/374320

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to trove (master)

Reviewed: https://review.openstack.org/374249
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=e89dfbd51d4cad01a2ad701e9300ccfe6e7eac5a
Submitter: Jenkins
Branch: master

commit e89dfbd51d4cad01a2ad701e9300ccfe6e7eac5a
Author: Peter Stachowski <email address hidden>
Date: Wed Sep 21 15:04:28 2016 +0000

    Skip 'eject valid master' replication test

    This test was skipped in the scenario runs however it was
    not disabled in the original api tests. Although it
    doesn't fail as often there it still happens, so the
    same change has been applied to it.

    See: https://review.openstack.org/#/c/368230

    The test will also print out the bug number during each run
    as a reminder (using the new SkipKnownBug method).

    Change-Id: I931d6d72a70cc93dcd8248d9840eadf1160b9bab
    Related-Bug: #1622014

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to trove (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/375087

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on trove (stable/newton)

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/375087
Reason: Looks like Peter got to it first https://review.openstack.org/#/c/374320/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to trove (stable/newton)

Reviewed: https://review.openstack.org/374320
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=345a7eb196ab0b9fcfd55c2c5b6f573d9b4e2caa
Submitter: Jenkins
Branch: stable/newton

commit 345a7eb196ab0b9fcfd55c2c5b6f573d9b4e2caa
Author: Peter Stachowski <email address hidden>
Date: Wed Sep 21 15:04:28 2016 +0000

    Skip 'eject valid master' replication test

    This test was skipped in the scenario runs however it was
    not disabled in the original api tests. Although it
    doesn't fail as often there it still happens, so the
    same change has been applied to it.

    See: https://review.openstack.org/#/c/368230

    The test will also print out the bug number during each run
    as a reminder (using the new SkipKnownBug method).

    Change-Id: I931d6d72a70cc93dcd8248d9840eadf1160b9bab
    Related-Bug: #1622014
    (cherry picked from commit e89dfbd51d4cad01a2ad701e9300ccfe6e7eac5a)

tags: added: in-stable-newton
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.