intermittent functional test failures related to scrubber

Bug #1768077 reported by Brian Rosmaita
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Glance
Fix Released
High
Brian Rosmaita

Bug Description

I'm not completely sure that the scrubber tests are the culprit, but they seem to be a common factor.

We're seeing different failures in the functional (py27) tests and the functional-py35 tests.

(1) py27
"AssertionError: unexpected error occurred in glance-scrubber" happening in wait_for_scrubber_shutdown

Observed on:
- https://review.openstack.org/#/c/564077
- https://review.openstack.org/#/c/564883

See http://paste.openstack.org/show/720153/ (not sure how long the logs will be kept).

(2) py35
This one is trickier because the failure is causing the subunit parser to crash during test runs, so there are several tests failing, but it looks like the scrubber tests are the constant factor.

Observed on:
- https://review.openstack.org/#/c/564649/
- https://review.openstack.org/#/c/564077/
- https://review.openstack.org/#/c/554174/

See http://paste.openstack.org/show/720155/ for list of tests failing when this happens.
See http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2018-04-27.log.html#t2018-04-27T17:20:55 for discussion with clarkb about these failures.

Tags: testing
Revision history for this message
Brian Rosmaita (brian-rosmaita) wrote :

Have not assigned an importance until we see how often these occur.

Revision history for this message
Brian Rosmaita (brian-rosmaita) wrote :

This seems to be happening a lot

Changed in glance:
importance: Undecided → High
Revision history for this message
Brian Rosmaita (brian-rosmaita) wrote :

Here's a patch that skips the tests that call wait_for_scrubber_shutdown. Ran it a bunch of times and the functional and functional-py35 tests consistently passed. So it looks like that function is a problem.

https://review.openstack.org/#/c/566117/

Revision history for this message
Brian Rosmaita (brian-rosmaita) wrote :

Abhishek has a patch taking a stab at a fix: https://review.openstack.org/#/c/566262/

Revision history for this message
Brian Rosmaita (brian-rosmaita) wrote :

I think the problem here is that in order to test the scrubber restore, there can only be one instance of the scrubber running, and in the current functional test environment, with tests running in parallel, you can't guarantee that. So the wait_for_scrubber_shutdown just keeps trying to run the test until the command goes through. I think a better way to do this would be to run these tests in serial mode. I think you can do this by putting the tests that can't be run in a concurrent environment into their own test class. Then we'd run the functional tests normally, but pass an argument to stestr blacklisting the serial test. After the "regular" functional tests were run, there would be a second command to run only the sequential class. It looks from the docs that there's a way to tell stestr to combine the output from the two runs into a single result.

Matt Treinish is the author of stestr, so maybe we can reach out to him for help.

Revision history for this message
Abhishek Kekane (abhishek-kekane) wrote :

wangxiyuan has proposed patch with serial approach, https://review.openstack.org/#/c/566206/

Revision history for this message
Brian Rosmaita (brian-rosmaita) wrote :

I have a patch up, see what you think: https://review.openstack.org/#/c/566681/

Changed in glance:
assignee: nobody → Brian Rosmaita (brian-rosmaita)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on glance (master)

Change abandoned by Brian Rosmaita (<email address hidden>) on branch: master
Review: https://review.openstack.org/566117
Reason: using https://review.openstack.org/#/c/566681/ instead

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to glance (master)

Reviewed: https://review.openstack.org/566681
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=189ca47598b62478256cf8f0dbc6631ffb0d11d2
Submitter: Zuul
Branch: master

commit 189ca47598b62478256cf8f0dbc6631ffb0d11d2
Author: Brian Rosmaita <email address hidden>
Date: Mon May 7 12:26:03 2018 -0400

    Run scrubber functional tests in serial mode

    The current scrubber functional tests seem to be confusing the subunit parser.
    This patch modifies the functional test definition in tox.ini to run serial
    tests separately from the "regular" functional tests and moves the current
    scrubber tests to the 'serial' directory

    Change-Id: I041c90aa8854bca30f9ea7b0c9d81e41f79cb81e
    Partial-bug: #1768077

Revision history for this message
Brian Rosmaita (brian-rosmaita) wrote :

Here's the follow-up patch that refactors the wait_for_scrubber_shutdown function: https://review.openstack.org/#/c/566947/

I ran it through the functional test gates 10 times trying to see if an intermittent failure would occur, and it passed every time (though that may be due to the other patch that's now running the scrubber tests in serial mode).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to glance (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/571286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to glance (master)

Reviewed: https://review.openstack.org/566947
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=d501960d7e8474927ebd6d6abc52350a0f23275a
Submitter: Zuul
Branch: master

commit d501960d7e8474927ebd6d6abc52350a0f23275a
Author: Brian Rosmaita <email address hidden>
Date: Tue May 8 12:54:24 2018 -0400

    Refactor wait_for_scrubber_shutdown function

    The wait_for_scrubber_shutdown function catches AssertionErrors
    and retries. Refactor it so that it only retries for a specific
    error (namely, when glance scrubber is running in daemon mode when
    the test is being executed) by changing the function so that it
    returns values instead of relying on test assertions in the
    function passed to it. Also refactor the tests that call
    wait_for_scrubber_shutdown so that they check the returned results
    instead of using assertions in the function they pass to
    wait_for_scrubber_shutdown.

    This is a follow-up to https://review.openstack.org/#/c/566681/

    Change-Id: I7108179e0d96e09638ff783b029a8216f0938c3b
    Closes-bug: #1768077

Changed in glance:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/glance 17.0.0.0b3

This issue was fixed in the openstack/glance 17.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.