test_create_ebs_image_and_check_boot is race failing in cells v1 job

Bug #1489581 reported by Matt Riedemann on 2015-08-27
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned
tempest
High
Andrey Pavlov
Changed in tempest:
assignee: nobody → Andrey Pavlov (apavlov-e)
status: New → In Progress
Matt Riedemann (mriedem) wrote :

Adding nova as a low priority since in the attach of the snapshot bdm here:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/block_device.py#n329

nova could be checking the status of the snapshot and waiting for it to be available (or timeout), like it does with the wait_fun in that method for checking when the volume it created is available before returning from the attach method.

Changed in tempest:
importance: Undecided → High
Changed in nova:
importance: Undecided → Low
status: New → Triaged
tags: added: volumes

Reviewed: https://review.openstack.org/217804
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=d35957b02b81536d807da68448b7b9e0a6e5a6fa
Submitter: Jenkins
Branch: master

commit d35957b02b81536d807da68448b7b9e0a6e5a6fa
Author: Andrey Pavlov <email address hidden>
Date: Thu Aug 27 20:12:15 2015 +0300

    add waiting for snapshot available status

    In review https://review.openstack.org/#/c/213621/13
    I've made a bug for some CI systems.

    Code creates an image, but after, it doesn't wait that the volume
    snapshot is in status "available". If the snapshot is in state "creating"
    for let's say 30sec, the server created L200 will fail to boot with message:
    InvalidInput: Invalid input received: Invalid snapshot:
     Originating snapshot status must be one of'available' values (HTTP 400)

    Change-Id: I409eb77475ec3662f395fe0b52c5e03308c48973
    Closes-Bug: #1489581

Changed in tempest:
status: In Progress → Fix Released
Sean Dague (sdague) on 2015-08-31
Changed in nova:
importance: Low → Critical

Reviewed: https://review.openstack.org/218871
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=ebaf245369c28f6571dc317ec60272022cfba761
Submitter: Jenkins
Branch: master

commit ebaf245369c28f6571dc317ec60272022cfba761
Author: Matt Riedemann <email address hidden>
Date: Mon Aug 31 07:12:51 2015 -0700

    Skip ebs scenario test due to bug 1489581

    test_create_ebs_image_and_check_boot race fails frequently in
    the cells and ceph jobs, so we need to skip it for now until
    people can dig into what the failures are.

    Change-Id: Id077a68a9cd274e5b68a36c8eb1032632ceaedc5
    Related-Bug: #1489581

Matt,
As I understand this bug has some inconsistencies.
Title related to https://review.openstack.org/#/c/218450/
But description and fix in tempest related to this bug.

I see that mega query doesn't show errors now. But I don't know -
errors disapper from skip in tempest or after Andrew's review has merged.

What we can do with your 'skip' in tempest right now?

Jordan Pittier (jordan-pittier) wrote :

Can we close this bug and unskip the test now ? I think we should.

Matt Riedemann (mriedem) on 2015-09-10
Changed in nova:
importance: Critical → Medium

Reviewed: https://review.openstack.org/230937
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=35fd828aad02456f9d95f95123c1fcd3d54c1f56
Submitter: Jenkins
Branch: master

commit 35fd828aad02456f9d95f95123c1fcd3d54c1f56
Author: Andrey Pavlov <email address hidden>
Date: Mon Oct 5 13:32:13 2015 +0300

    Remove skip decorator for volume scenario test

    I think that this skip is not actual now. Main issue of this skip
    was fixed in another bug.

    Removing this skip decrator allow to check additional scenario
    of server's booting.

    Change-Id: I1fadc46a022e698eb96b88594325e4d9b86fe901
    Closes-Bug: #1489581

Related fix proposed to branch: master
Review: https://review.openstack.org/241270

Change abandoned by Sylvain Bauza (<email address hidden>) on branch: master
Review: https://review.openstack.org/241258
Reason: Given the cells failures rate, I think it's okay to not skip the tests, but rather recheck instead

When this fails, there are duplicate bdm entries in the image metadata, with 2 entries that have boot_index=0:

http://logs.openstack.org/87/188487/39/check/gate-tempest-dsvm-cells/6a75b7f/logs/tempest.txt.gz#_2015-11-03_14_18_29_118

2015-11-03 14:18:29.118 29774 DEBUG tempest_lib.common.rest_client [req-f7d6b621-e091-4208-8e50-3f2a80eeec94 ] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
        Body: None
    Response - Headers: {'x-image-meta-min_disk': '0', 'connection': 'close', 'x-image-meta-property-block_device_mapping': '[{"guest_format": null, "boot_index": 0, "delete_on_termination": true, "no_device": null, "snapshot_id": "5b7fdd96-17eb-42bd-9e06-71816c44d858", "device_name": "/dev/vda", "disk_bus": "virtio", "image_id": null, "source_type": "snapshot", "device_type": "disk", "volume_id": null, "destination_type": "volume", "volume_size": 1}, {"guest_format": null, "boot_index": 0, "delete_on_termination": true, "no_device": null, "snapshot_id": "6855de29-05a8-4fd0-875c-456cbeac86aa", "device_name": "/dev/vda", "disk_bus": "virtio", "image_id": null, "source_type": "snapshot", "device_type": "disk", "volume_id": null, "destination_type": "volume", "volume_size": 1}]', 'x-image-meta-is_public': 'False', 'x-image-meta-created_at': '2015-11-03T14:18:29.000000', 'x-image-meta-size': '0', 'x-image-meta-owner': 'd7a6c1dd8f544a6d988c690457746387', 'x-image-meta-min_ram': '0', 'x-image-meta-protected': 'False', 'date': 'Tue, 03 Nov 2015 14:18:29 GMT', 'x-image-meta-property-kernel_id': '45097725-7d8c-4592-b7a3-1fcf09c828ab', 'x-image-meta-property-bdm_v2': 'True', 'x-image-meta-id': '7d649ce7-8c34-4ab2-9aa9-8f28a93677ea', 'content-length': '0', 'x-openstack-request-id': 'req-f7d6b621-e091-4208-8e50-3f2a80eeec94', 'content-type': 'text/html; charset=UTF-8', 'x-image-meta-name': 'tempest-image-2038782105', 'status': '200', 'x-image-meta-property-root_device_name': '/dev/vda', 'x-image-meta-updated_at': '2015-11-03T14:18:29.000000', 'x-image-meta-deleted': 'False', 'content-location': 'http://127.0.0.1:9292/v1/images/7d649ce7-8c34-4ab2-9aa9-8f28a93677ea', 'x-image-meta-status': 'active', 'x-image-meta-property-ramdisk_id': 'd920dfcd-0ebe-48d4-b084-f787b644d85f'}
        Body: _log_request_full /opt/stack/new/tempest/.tox/all/local/lib/python2.7/site-packages/tempest_lib/common/rest_client.py:414

Matt Riedemann (mriedem) on 2015-11-03
tags: added: cells
Changed in nova:
status: Triaged → Confirmed

Related fix proposed to branch: master
Review: https://review.openstack.org/241390

Related fix proposed to branch: master
Review: https://review.openstack.org/241397

Matt Riedemann (mriedem) on 2015-11-03
Changed in nova:
importance: Medium → High

Reviewed: https://review.openstack.org/241270
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a084dbf9f07de2ea903ccd8aa84c60013aaa4193
Submitter: Jenkins
Branch: master

commit a084dbf9f07de2ea903ccd8aa84c60013aaa4193
Author: Matt Riedemann <email address hidden>
Date: Tue Nov 3 08:08:07 2015 -0800

    Add debug logging for when boot sequence is invalid in _validate_bdm

    The cells job is failing to boot from volume with a bdm v2 request like:

    "block_device_mapping_v2": [{
        "destination_type": "volume",
        "boot_index": 0,
        "uuid": "553083ac-40f2-4225-ac4d-a1d021eb7fb1",
        "source_type": "volume",
        "delete_on_termination": true
    }]

    Given it seems there should only be one BDM in the list and it's boot
    index is 0, we shouldn't fail on the boot sequence validation, so add
    some debug logging when we hit that failure.

    Also copy some wording out of the 'Block Device Mapping in Nova' section
    of the devref to add as code comments for the validation happening so we
    have context.

    TODO(mriedem): track down ndipanov to figure out why _subsequent_list
    omits the last element of the list that it's processing, we need to doc
    that logic in the code for maintainability.

    Change-Id: I8adc94f3c93c149689fbef424665fb44ac573819
    Related-Bug: #1489581

Download full text (4.3 KiB)

Confirmed that the issue is when nova is asked to create a snapshot of the instance with the volume attached, it's actually creating 2 volume snapshots, and those are the bdms that go into the image snapshot metadata. Then later when we boot the instance from that image, it has 2 bdms in the image metadata both with boot_index=0 and that fails.

So we're probably hitting cells.messaging.bdm_update_or_create_at_top twice and creating a new bdm each time, even though we should be updating the second time, not doing a create.

Notes from logs:

first server id: 2c9cecc1-c3db-4057-81bd-98e488c45ac2

snapshot 1 request to cinder:

http://logs.openstack.org/66/241366/1/check/gate-tempest-dsvm-cells/1d7551e/logs/screen-c-api.txt.gz#_2015-11-03_21_45_08_411

2015-11-03 21:45:08.411 INFO cinder.volume.api [req-db073638-35db-46c5-a358-d832113fc3ef None] Snapshot(cgsnapshot=<?>,cgsnapshot_id=None,created_at=2015-11-03T21:45:08Z,deleted=False,deleted_at=None,display_description=None,display_name='snapshot for tempest-image-1751811273',encryption_key_id=None,id=9b4fe063-6427-4000-a286-7369a7677a17,metadata={},progress='0%',project_id=a86353192f7b42539f247f284249e0bf,provider_auth=None,provider_id=None,provider_location=None,status='creating',updated_at=None,user_id=2e821b0f84a9402d91463d8a7a2d4de7,volume=<?>,volume_id=35909d21-81b8-4fda-82b6-d3d75be61238,volume_size=1,volume_type_id=09c3a4dd-e7b1-4cd1-a03d-809216ba9b44)Snapshot force create request issued successfully.

snapshot 2 request to cinder (same volume_id):

http://logs.openstack.org/66/241366/1/check/gate-tempest-dsvm-cells/1d7551e/logs/screen-c-api.txt.gz#_2015-11-03_21_45_08_925

2015-11-03 21:45:08.925 INFO cinder.volume.api [req-eebfe355-197d-4e01-a946-41f05f733884 None] Snapshot(cgsnapshot=<?>,cgsnapshot_id=None,created_at=2015-11-03T21:45:08Z,deleted=False,deleted_at=None,display_description=None,display_name='snapshot for tempest-image-1751811273',encryption_key_id=None,id=f01c94bc-ddac-4687-8620-41eeacf57b0f,metadata={},progress='0%',project_id=a86353192f7b42539f247f284249e0bf,provider_auth=None,provider_id=None,provider_location=None,status='creating',updated_at=None,user_id=2e821b0f84a9402d91463d8a7a2d4de7,volume=<?>,volume_id=35909d21-81b8-4fda-82b6-d3d75be61238,volume_size=1,volume_type_id=09c3a4dd-e7b1-4cd1-a03d-809216ba9b44)Snapshot force create request issued successfully.

--

image snapshot get response:

http://logs.openstack.org/66/241366/1/check/gate-tempest-dsvm-cells/1d7551e/console.html#_2015-11-03_21_50_38_475

'[{"guest_format": null, "boot_index": 0, "delete_on_termination": true, "no_device": null, "snapshot_id": "9b4fe063-6427-4000-a286-7369a7677a17", "device_name": "/dev/vda", "disk_bus": "virtio", "image_id": null, "source_type": "snapshot", "device_type": "disk", "volume_id": null, "destination_type": "volume", "volume_size": 1}, {"guest_format": null, "boot_index": 0, "delete_on_termination": true, "no_device": null, "snapshot_id": "f01c94bc-ddac-4687-8620-41eeacf57b0f", "device_name": "/dev/vda", "disk_bus": "virtio", "image_id": null, "source_type": "snapshot", "device_type": "disk", "volume_id": null, "destination_type": "volume", "volume_size": 1}]'

400 on in...

Read more...

summary: - test_create_ebs_image_and_check_boot is race failing
+ test_create_ebs_image_and_check_boot is race failing in cells job

Reviewed: https://review.openstack.org/241366
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=86fe90f7056432416ea3c2335ea8c2ad6e16b79a
Submitter: Jenkins
Branch: master

commit 86fe90f7056432416ea3c2335ea8c2ad6e16b79a
Author: Matt Riedemann <email address hidden>
Date: Tue Nov 3 12:58:17 2015 -0800

    cells: add debug logging to bdm_update_or_create_at_top

    The method is racy and we need to know which conditions we're hitting
    when we fail, which is happening sometimes with the
    test_create_ebs_image_and_check_boot test in Tempest.

    We fail because the image snapshot metadata has duplicate BDM entries in
    it so that when we boot an instance from that image, there are multiple
    BDMs with the same boot_index and nova.compute.api.API._validate_bdm
    throws up.

    Related-Bug: #1489581

    Change-Id: I8058e0c16e0fd9b54d4a46e86e0fdede3da55777

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/241397
Reason: This isn't going anywhere.

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/242447

Changed in nova:
assignee: Matt Riedemann (mriedem) → Sylvain Bauza (sylvain-bauza)
Changed in nova:
assignee: Sylvain Bauza (sylvain-bauza) → Dan Smith (danms)

Fix proposed to branch: master
Review: https://review.openstack.org/242649

Changed in nova:
assignee: Dan Smith (danms) → Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Dan Smith (danms)

Reviewed: https://review.openstack.org/241258
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fa203c3165d4ac61e1c191e6d3c6db54778c3ed7
Submitter: Jenkins
Branch: master

commit fa203c3165d4ac61e1c191e6d3c6db54778c3ed7
Author: Sylvain Bauza <email address hidden>
Date: Tue Nov 3 16:45:17 2015 +0100

    Exclude all BDM checks for cells

    Since there are some race conditions with the EBS image checking test,
    it's better to remove the entire Tempest class tests just until we
    figure out what's wrong with the boot from volume stuff

    Change-Id: Ice6a718b6d7b27505696a3c55fc02d44d4b187a7
    Related-Bug: #1489581

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/241742
Reason: Focusing on this change now:

https://review.openstack.org/#/c/242649/

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/242517
Reason: This is not really helping, focusing on this now:

https://review.openstack.org/#/c/242649/

Related fix proposed to branch: master
Review: https://review.openstack.org/242698

Changed in nova:
assignee: Dan Smith (danms) → Matt Riedemann (mriedem)

Reviewed: https://review.openstack.org/242756
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cc19164064ab5ccefd5532a357e2ebd09f22fd4f
Submitter: Jenkins
Branch: stable/liberty

commit cc19164064ab5ccefd5532a357e2ebd09f22fd4f
Author: Sylvain Bauza <email address hidden>
Date: Tue Nov 3 16:45:17 2015 +0100

    Exclude all BDM checks for cells

    Since there are some race conditions with the EBS image checking test,
    it's better to remove the entire Tempest class tests just until we
    figure out what's wrong with the boot from volume stuff

    Change-Id: Ice6a718b6d7b27505696a3c55fc02d44d4b187a7
    Related-Bug: #1489581
    (cherry picked from commit fa203c3165d4ac61e1c191e6d3c6db54778c3ed7)

tags: added: in-stable-liberty
Changed in nova:
assignee: Matt Riedemann (mriedem) → Dan Smith (danms)
Changed in nova:
assignee: Dan Smith (danms) → Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Dan Smith (danms)
Changed in nova:
assignee: Dan Smith (danms) → Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Dan Smith (danms)
Changed in nova:
assignee: Dan Smith (danms) → Matt Riedemann (mriedem)

Reviewed: https://review.openstack.org/241739
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=982bc73f4a4f18ab1a0917e12218d697f9b665a1
Submitter: Jenkins
Branch: master

commit 982bc73f4a4f18ab1a0917e12218d697f9b665a1
Author: Matt Riedemann <email address hidden>
Date: Wed Nov 4 11:48:30 2015 -0800

    Add logging to snapshot_volume_backed method

    The cells job is failing on a race where an image is created
    from an instance with a single volume attached, but we have
    multiple (duplicate) BDMs for the instance so we end up with
    multiple volume snapshots for the image, which later fails a
    boot request using the snapshot image because the image has
    duplicate BDMs with the same boot_index, which is invalid in
    the request.

    This just adds some debug logging when we actually call off to
    cinder to create a volume snapshot so we have the request
    context and instance uuid in the logs for tracing.

    Change-Id: Iba8ad7cf40ba076e90abc8604187700166795c5d
    Related-Bug: #1489581

Change abandoned by Sylvain Bauza (<email address hidden>) on branch: master
Review: https://review.openstack.org/242447
Reason: Well all of that is f**c'd up

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/242698
Reason: We decided in the nova meeting a few weeks ago not to try and address latent bugs in cells v1 exposed by new tests, so we'll drop this and just keep skipping the test.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/242649
Reason: We decided in the nova meeting a few weeks ago not to try and address latent bugs in cells v1 exposed by new tests, so we'll drop this and just keep skipping the test.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/241390
Reason: meh, don't think we care about this much anymore

Marking this as fix released for nova since we just skipped the test for the cells job and agreed that it's going to be a latent bug for cells v1 that we aren't going to fix.

Changed in nova:
status: In Progress → Fix Released
status: Fix Released → Won't Fix

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/242603
Reason: This patch has been idle for a long time, so I am abandoning it to keep the review clean sane. If you're interested in still working on this patch, then please unabandon it and upload a new patchset.

Change abandoned by Dan Smith (<email address hidden>) on branch: master
Review: https://review.openstack.org/242602

Changed in nova:
assignee: Matt Riedemann (mriedem) → Lee Yarwood (lyarwood)
status: Won't Fix → In Progress

Patch in merge conflict, not in progress

Changed in nova:
importance: High → Low
status: In Progress → Confirmed
assignee: Lee Yarwood (lyarwood) → nobody
importance: Low → Medium
Changed in nova:
assignee: nobody → Jay Pipes (jaypipes)
status: Confirmed → In Progress
Changed in nova:
assignee: Jay Pipes (jaypipes) → Lee Yarwood (lyarwood)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/242603
Reason: This review is > 4 weeks without comment, and is not mergable in it's current state. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/242602
Reason: This review is > 4 weeks without comment, and is not mergable in it's current state. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: Lee Yarwood (lyarwood) → nobody
Matt Riedemann (mriedem) wrote :

This is considered a latent bug in cells v1 message processing and given the state of cells v1 and how it's basically only getting critical fixes at this point, I'm just going to mark this as won't fix.

Changed in nova:
status: Confirmed → Won't Fix
summary: - test_create_ebs_image_and_check_boot is race failing in cells job
+ test_create_ebs_image_and_check_boot is race failing in cells v1 job
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers