Bug #1884596 “image import copy-image will start multiple import...” : Bugs : Glance

Revision history for this message

Abhishek Kekane (abhishek-kekane) wrote on 2020-06-23:

#1

Download full text (6.7 KiB)

There is definitely race condition problem here, ran two different scenario and below is the observation;

Note; Image size 3 GB
Available stores; ceph:rbd,slow:file,fast:file
Scenario 1: Copy same image in two different stores using two different commands (may be 2 different users running this operation)
Steps to reproduce:
1. Create image in store fast
   $ glance image-create-via-import --container-format ami --disk-format ami --name copy-scenario-1 --file <image-file> --store fast
2. Ensure that image is active
   $ glance image-show <image-id-from-step-1> | grep status
3. Copy image in store slow
   $ glance image-import <image-id-from-step-1> --import-method copy-image --stores slow
   This will send immidiate 20X response to user and 'os_glance_importing_to_stores' will show 'slow'
   $ glance image-show <image-id-from-step-1> | grep os_glance_importing_to_stores
4. Copy image in ceph (Task 4d6e443e-4829-4e5d-9016-71a39fbcef5e)
   $ glance image-import <image-id-from-step-1> --import-method copy-image --stores ceph
     This will send immidiate 20X response to user and now 'os_glance_importing_to_stores' will overwrite 'slow' with 'ceph'

Observations with the help of g-api logs https://etherpad.opendev.org/p/glance-copy-to-store-race-scenario-logs

Step 3 Task Id is eefbd9c8-be47-4ba5-b0e2-9b44407d3234
Step 4 Task Id is 4d6e443e-4829-4e5d-9016-71a39fbcef5e

1st copy operation (copy image in slow store) will start copying file into staging area.
2nd copy operation (copy image in ceph store) will exit this step as image is already present in staging area.
2nd operation will start importing image in ceph (rbd) backend
For 1st operation, After importing image in slow store while removing it from 'os_glance_importing_to_stores' it will log below debug message;
Store slow not found in property os_glance_importing_to_stores.
At this moment image data is imported completely in slow store and it will delete the staging data from staging store (line #412), but location metadata is not updated yet.
Moment after Import task of 2nd operation of copying image-data to ceph store will also complete (line #899) and this operation will fail while deleting image data from staging area (line #920)
This will trigger revert task for 2nd operation and image-data will be deleted from ceph store only.
Image data will remain in 'slow' store but it doesn't show in image locations anymore.

Final Output:
Original image created in step 1, remains active and availble in store 'fast'
Staging area is clean
Data remians as orphan in 'slow' store (1st copy operation)

Note; Image size 1.5 GB
Available stores; ceph:rbd,slow:file,fast:file,common:file,cheap:file,reliable:file
Scenario 2: User 1 imports image in all stores with allow-failure as True and user 2 tries to copy that image in other store
Steps to reproduce:
1. Create image in all stores with allow-failure as True
   $ glance image-create-via-import --container-format ami --disk-format ami --name copy-scenario-1 --file <image-file> --all-stores True --allow-failure True
   As allow-failure is True, image status will be set to active as soon as it is imported to one of the store, say ceph
   At this moment, ...

There is definitely race condition problem here, ran two different scenario and below is the observation;

Note; Image size 3 GB
Available stores; ceph:rbd,slow:file,fast:file
Scenario 1: Copy same image in two different stores using two different commands (may be 2 different users running this operation)
Steps to reproduce:
1. Create image in store fast
   $ glance image-create-via-import --container-format ami --disk-format ami --name copy-scenario-1 --file <image-file> --store fast
2. Ensure that image is active
   $ glance image-show <image-id-from-step-1> | grep status
3. Copy image in store slow
   $ glance image-import <image-id-from-step-1> --import-method copy-image --stores slow
   This will send immidiate 20X response to user and 'os_glance_importing_to_stores' will show 'slow'
   $ glance image-show <image-id-from-step-1> | grep os_glance_importing_to_stores
4. Copy image in ceph (Task 4d6e443e-4829-4e5d-9016-71a39fbcef5e)
   $ glance image-import <image-id-from-step-1> --import-method copy-image --stores ceph
     This will send immidiate 20X response to user and now 'os_glance_importing_to_stores' will overwrite 'slow' with 'ceph'

Observations with the help of g-api logs https://etherpad.opendev.org/p/glance-copy-to-store-race-scenario-logs

Step 3 Task Id is eefbd9c8-be47-4ba5-b0e2-9b44407d3234
Step 4 Task Id is 4d6e443e-4829-4e5d-9016-71a39fbcef5e

1st copy operation (copy image in slow store) will start copying file into staging area.
2nd copy operation (copy image in ceph store) will exit this step as image is already present in staging area.
2nd operation will start importing image in ceph (rbd) backend
For 1st operation, After importing image in slow store while removing it from 'os_glance_importing_to_stores' it will log below debug message;
Store slow not found in property os_glance_importing_to_stores.
At this moment image data is imported completely in slow store and it will delete the staging data from staging store (line #412), but location metadata is not updated yet.
Moment after Import task of 2nd operation of copying image-data to ceph store will also complete (line #899) and this operation will fail while deleting image data from staging area (line #920)
This will trigger revert task for 2nd operation and image-data will be deleted from ceph store only.
Image data will remain in 'slow' store but it doesn't show in image locations anymore.

Final Output:
Original image created in step 1, remains active and availble in store 'fast'
Staging area is clean
Data remians as orphan in 'slow' store (1st copy operation)

Note; Image size 1.5 GB
Available stores; ceph:rbd,slow:file,fast:file,common:file,cheap:file,reliable:file
Scenario 2: User 1 imports image in all stores with allow-failure as True and user 2 tries to copy that image in other store 
Steps to reproduce:
1. Create image in all stores with allow-failure as True
   $ glance image-create-via-import --container-format ami --disk-format ami --name copy-scenario-1 --file <image-file> --all-stores True --allow-failure True
   As allow-failure is True, image status will be set to active as soon as it is imported to one of the store, say ceph
   At this moment, os_glance_importing_to_stores is showing slow,fast,common,cheap,reliable
2. Now another user sees that one image is active in ceph and he tries to copy it in common (he fails to observe that image import operation is still in progress)
3. Copy image in reliable
   $ glance image-import <image-id-from-step-1> --import-method copy-image --stores common

Observations with the help of g-api logs https://etherpad.opendev.org/p/glance-import-copy-to-store-race-scenario-logs

Step 1 Task Id is 5f9e841b-4a25-4303-ace1-d8387b42590b 
Step 3 Task Id is 8c5e4726-00fc-4856-a381-07fcd0c75824

1st import operation will complete importing image in ceph store (see line #465) and will start importing image in next store in queue (fast, see line #466)
2nd copy operation will exit staging as image is aleady present in staging and will start importing/copying data in common store (see line #517 to #520)
2nd Operation will delete the data from the staging area (see line #536)
Now as 1st operation is already running it will throw error as it does not found data in staging area to import in new store (see line #562),
but still allow-failure is true it will continue to next store and fails as well (line #564)
At line #598 from 1st operation Import task is marked as success and it goes to delete image from staging are task (see line #599)
as staging area is already cleared by 2nd operation (line #536) this task is marked as failed, task goes into revert state and data is deleted from all stores.

Final Output:
Image is active with no checksum and location (STRANGE)
os_glance_failed_import shows list of failed stores
Data remains orphan in one of the store (mostly in the store which we have initiated copy operation, i.e. common in our case)

Abhishek Kekane (abhishek-kekane) on 2020-06-23

Changed in glance:
assignee:	nobody → Abhishek Kekane (abhishek-kekane)
importance:	Undecided → Critical

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-24: Related fix proposed to glance (master)

#2

Related fix proposed to branch: master
Review: https://review.opendev.org/737868

OpenStack Infra (hudson-openstack) on 2020-06-30

Changed in glance:
status:	New → In Progress

Abhishek Kekane (abhishek-kekane) on 2020-07-02

summary:

- image import copy-to-store will start multiple importing threads due to
+ image import copy-image will start multiple importing threads due to
race condition

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-22: Related fix merged to glance (master)

#3

Reviewed: https://review.opendev.org/737868
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=2a51843138e27071bf84269f6b2a601b3ba9978f
Submitter: Zuul
Branch: master

commit 2a51843138e27071bf84269f6b2a601b3ba9978f
Author: Dan Smith <email address hidden>
Date: Wed Jun 24 13:06:38 2020 -0700

Add image_set_property_atomic() helper

    This adds a new DB API method to atomically create a property on an image
    in a way that we can be sure it is created once and only once for the
    purposes of exclusion of multiple threads.

Change-Id: Ifdb711cb241ef13eccaa5ae29a234f2fe4a52eb8
Related-Bug: #1884596

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-21: Change abandoned on glance (master)

#4

Change abandoned by Abhishek Kekane (<email address hidden>) on branch: master
Review: https://review.opendev.org/737596
Reason: Abandoning against https://review.opendev.org/743597

Abhishek Kekane (abhishek-kekane) on 2020-08-24

Changed in glance:
assignee:	Abhishek Kekane (abhishek-kekane) → Dan Smith (danms)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-25: Fix merged to glance (master)

#5

Reviewed: https://review.opendev.org/743597
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=3f6e349d0853a9746d0d744bc3eb0b2baa1ddff9
Submitter: Zuul
Branch: master

commit 3f6e349d0853a9746d0d744bc3eb0b2baa1ddff9
Author: Dan Smith <email address hidden>
Date: Tue Jul 28 09:02:13 2020 -0700

Implement time-limited import locking

    This attempts to provide a time-based import lock that is dependent
    on the task actually making progress. While the task is copying
    data, the task message is updated, which in turn touches the task
    updated_at time. The API will break any lock after 30 minutes of
    no activity on a stalled or dead task. The import taskflow will
    check to see if it has lost the lock at any point, and/or if its
    task status has changed and abort if so.

The logic in more detail:

    1. API locks the image by task-id before we start the task thread, but
       before we return
    2. Import thread will check the task-id lock on the image every time it
       tries to modify the image, and if it has changed, will abort
    3. The data pipeline will heartbeat the task every minute by updating
       the task.message (bonus: we get some status)
    4. If the data pipeline heartbeat ever finds the task state to be changed
       from the expected 'processing' it will abort
    5. On task revert or completion, we drop the task-id lock from the image
    6. If something ever gets stuck or dies, the heartbeating will stop
    7. If the API gets a request for an import where the lock is held, it
       will grab the task by id (in the lock) and check the state and age.
       If the age is sufficiently old (no heartbeating) and the state is
       either 'processing' or terminal, it will mark the task as failed,
       steal the lock, and proceed.

Lots of logging throughout any time we encounter unexpected situations.

Closes-Bug: #1884596
Change-Id: Icb3c1d27e9a514d96fca7c1d824fd2183f69d8b3

Reviewed:  https://review.opendev.org/743597
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=3f6e349d0853a9746d0d744bc3eb0b2baa1ddff9
Submitter: Zuul
Branch:    master

commit 3f6e349d0853a9746d0d744bc3eb0b2baa1ddff9
Author: Dan Smith <dansmith@redhat.com>
Date:   Tue Jul 28 09:02:13 2020 -0700

Implement time-limited import locking
    
    This attempts to provide a time-based import lock that is dependent
    on the task actually making progress. While the task is copying
    data, the task message is updated, which in turn touches the task
    updated_at time. The API will break any lock after 30 minutes of
    no activity on a stalled or dead task. The import taskflow will
    check to see if it has lost the lock at any point, and/or if its
    task status has changed and abort if so.
    
    The logic in more detail:
    
    1. API locks the image by task-id before we start the task thread, but
       before we return
    2. Import thread will check the task-id lock on the image every time it
       tries to modify the image, and if it has changed, will abort
    3. The data pipeline will heartbeat the task every minute by updating
       the task.message (bonus: we get some status)
    4. If the data pipeline heartbeat ever finds the task state to be changed
       from the expected 'processing' it will abort
    5. On task revert or completion, we drop the task-id lock from the image
    6. If something ever gets stuck or dies, the heartbeating will stop
    7. If the API gets a request for an import where the lock is held, it
       will grab the task by id (in the lock) and check the state and age.
       If the age is sufficiently old (no heartbeating) and the state is
       either 'processing' or terminal, it will mark the task as failed,
       steal the lock, and proceed.
    
    Lots of logging throughout any time we encounter unexpected situations.
    
    Closes-Bug: #1884596
    Change-Id: Icb3c1d27e9a514d96fca7c1d824fd2183f69d8b3

Changed in glance:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-25: Related fix proposed to glance (stable/ussuri)

#6

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/748007

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-25: Fix proposed to glance (stable/ussuri)

#7

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/748014

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-27: Related fix merged to glance (stable/ussuri)

#8

Reviewed: https://review.opendev.org/748007
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=56d01f25442a77aadc168e9fa943a5cb809f4ceb
Submitter: Zuul
Branch: stable/ussuri

commit 56d01f25442a77aadc168e9fa943a5cb809f4ceb
Author: Dan Smith <email address hidden>
Date: Wed Jun 24 13:06:38 2020 -0700

Add image_set_property_atomic() helper

    This adds a new DB API method to atomically create a property on an image
    in a way that we can be sure it is created once and only once for the
    purposes of exclusion of multiple threads.

    Change-Id: Ifdb711cb241ef13eccaa5ae29a234f2fe4a52eb8
    Related-Bug: #1884596
    (cherry picked from commit 2a51843138e27071bf84269f6b2a601b3ba9978f)

tags:

added: in-stable-ussuri

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-28: Fix merged to glance (stable/ussuri)

#9

Reviewed: https://review.opendev.org/748014
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=055e5e790272cf6a636b07a5ff0c4c2b59351fe7
Submitter: Zuul
Branch: stable/ussuri

commit 055e5e790272cf6a636b07a5ff0c4c2b59351fe7
Author: Dan Smith <email address hidden>
Date: Tue Jul 28 09:02:13 2020 -0700

Implement time-limited import locking

    This attempts to provide a time-based import lock that is dependent
    on the task actually making progress. While the task is copying
    data, the task message is updated, which in turn touches the task
    updated_at time. The API will break any lock after 30 minutes of
    no activity on a stalled or dead task. The import taskflow will
    check to see if it has lost the lock at any point, and/or if its
    task status has changed and abort if so.

The logic in more detail:

    1. API locks the image by task-id before we start the task thread, but
       before we return
    2. Import thread will check the task-id lock on the image every time it
       tries to modify the image, and if it has changed, will abort
    3. The data pipeline will heartbeat the task every minute by updating
       the task.message (bonus: we get some status)
    4. If the data pipeline heartbeat ever finds the task state to be changed
       from the expected 'processing' it will abort
    5. On task revert or completion, we drop the task-id lock from the image
    6. If something ever gets stuck or dies, the heartbeating will stop
    7. If the API gets a request for an import where the lock is held, it
       will grab the task by id (in the lock) and check the state and age.
       If the age is sufficiently old (no heartbeating) and the state is
       either 'processing' or terminal, it will mark the task as failed,
       steal the lock, and proceed.

Lots of logging throughout any time we encounter unexpected situations.

Conflicts:
- Changes due to policy check being missing from ussuri

    Closes-Bug: #1884596
    Change-Id: Icb3c1d27e9a514d96fca7c1d824fd2183f69d8b3
    (cherry picked from commit 3f6e349d0853a9746d0d744bc3eb0b2baa1ddff9)

Reviewed:  https://review.opendev.org/748014
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=055e5e790272cf6a636b07a5ff0c4c2b59351fe7
Submitter: Zuul
Branch:    stable/ussuri

commit 055e5e790272cf6a636b07a5ff0c4c2b59351fe7
Author: Dan Smith <dansmith@redhat.com>
Date:   Tue Jul 28 09:02:13 2020 -0700

Implement time-limited import locking
    
    This attempts to provide a time-based import lock that is dependent
    on the task actually making progress. While the task is copying
    data, the task message is updated, which in turn touches the task
    updated_at time. The API will break any lock after 30 minutes of
    no activity on a stalled or dead task. The import taskflow will
    check to see if it has lost the lock at any point, and/or if its
    task status has changed and abort if so.
    
    The logic in more detail:
    
    1. API locks the image by task-id before we start the task thread, but
       before we return
    2. Import thread will check the task-id lock on the image every time it
       tries to modify the image, and if it has changed, will abort
    3. The data pipeline will heartbeat the task every minute by updating
       the task.message (bonus: we get some status)
    4. If the data pipeline heartbeat ever finds the task state to be changed
       from the expected 'processing' it will abort
    5. On task revert or completion, we drop the task-id lock from the image
    6. If something ever gets stuck or dies, the heartbeating will stop
    7. If the API gets a request for an import where the lock is held, it
       will grab the task by id (in the lock) and check the state and age.
       If the age is sufficiently old (no heartbeating) and the state is
       either 'processing' or terminal, it will mark the task as failed,
       steal the lock, and proceed.
    
    Lots of logging throughout any time we encounter unexpected situations.
    
    Conflicts:
     - Changes due to policy check being missing from ussuri
    
    Closes-Bug: #1884596
    Change-Id: Icb3c1d27e9a514d96fca7c1d824fd2183f69d8b3
    (cherry picked from commit 3f6e349d0853a9746d0d744bc3eb0b2baa1ddff9)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-01: Related fix merged to glance (master)

#10

Reviewed: https://review.opendev.org/749069
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=bb7774c99b263c4f54d0d126ff4ee769bb3da24c
Submitter: Zuul
Branch: master

commit bb7774c99b263c4f54d0d126ff4ee769bb3da24c
Author: Dan Smith <email address hidden>
Date: Mon Aug 31 07:35:47 2020 -0700

Add a release note about import locking

This adds a release note detailing the new locking behavior and criteria
for stealing the lock.

Related-Bug: #1884596
Change-Id: I19c713c91794694f990f1372fda61cc2e20fac54

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-02: Related fix proposed to glance (stable/ussuri)

#11

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/749514

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-18: Related fix merged to glance (stable/ussuri)

#12

Reviewed: https://review.opendev.org/749514
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=3880298ac932e273ab9921eb09b729bb75c8a02c
Submitter: Zuul
Branch: stable/ussuri

commit 3880298ac932e273ab9921eb09b729bb75c8a02c
Author: Dan Smith <email address hidden>
Date: Mon Aug 31 07:35:47 2020 -0700

Add a release note about import locking

This adds a release note detailing the new locking behavior and criteria
for stealing the lock.

    Related-Bug: #1884596
    Change-Id: I19c713c91794694f990f1372fda61cc2e20fac54
    (cherry picked from commit bb7774c99b263c4f54d0d126ff4ee769bb3da24c)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-02: Fix included in openstack/glance 20.1.0

#13

This issue was fixed in the openstack/glance 20.1.0 release.

Glance

image import copy-image will start multiple importing threads due to race condition

Bug Description

Other bug subscribers

Remote bug watches