Cinder-backup service reports as down during backup of large volumes

Bug #1692775 reported by zheng yin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Gorka Eguileor

Bug Description

Description of problem:
When performing a backup of a large cinder volume, the cinder-backup service shows as down in 'cinder service-list', however, the backup will succeed. This is causing monitoring software to incorrectly report that there is an issue.

Version-Release number of selected component (if applicable):
openstack-cinder-7.0.1-8.el7ost.noarch

How reproducible:
Every time a backup is made of a large volume that takes more than a few minutes to complete

Steps to Reproduce:
1. cinder backup-create <uuid> --name test --force
2. watch cinder service-list

Actual results:
Cinder backup service will report down until the backup is complete

Expected results:
Cinder backup service will remain up throughout the backup task

Revision history for this message
zheng yin (yin-zheng) wrote :
Changed in openstack-vmwareapi-team:
assignee: nobody → zheng yin (yin-zheng)
affects: openstack-vmwareapi-team → cinder
Changed in cinder:
assignee: zheng yin (yin-zheng) → nobody
assignee: nobody → zheng yin (yin-zheng)
Changed in cinder:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/507510

Changed in cinder:
assignee: zheng yin (yin-zheng) → Gorka Eguileor (gorka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/507510
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=af0f00bc52f79d9395adfe0575b0dbe353e18bbe
Submitter: Zuul
Branch: master

commit af0f00bc52f79d9395adfe0575b0dbe353e18bbe
Author: Gorka Eguileor <email address hidden>
Date: Wed Sep 13 19:46:17 2017 +0200

    Run backup compression on native thread

    Backup data compression is a CPU bound operation that will not yield to
    other greenthreads, so given enough simultaneous backup operations they
    will lead to other threads' starvation.

    This is really problematic for DB connections, since starvation will
    lead to connections getting dropped with errors such as "Lost connection
    to MySQL server during query".

    Detailed information on why these connections get dropped can be found
    in comment "[31 Aug 2007 9:21] Magnus Blåudd" on this MySQL bug [1].

    These DB issues may result in backups unnecessary ending in an "error"
    state.

    This patch fixes this by moving the compression to a native thread so
    the cooperative multitasking in Cinder Backup can continue switching
    threads.

    [1] https://bugs.mysql.com/bug.php?id=28359

    Closes-Bug: #1692775
    Closes-Bug: #1719580
    Change-Id: I1946dc0ad9cb7a68072a39816fa9fa224c2eb6a5

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/513016

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/pike)

Reviewed: https://review.openstack.org/513016
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=439f90da8e4c1cfa1acfe36b5ea8a6c5bae6139f
Submitter: Zuul
Branch: stable/pike

commit 439f90da8e4c1cfa1acfe36b5ea8a6c5bae6139f
Author: Gorka Eguileor <email address hidden>
Date: Wed Sep 13 19:46:17 2017 +0200

    Run backup compression on native thread

    Backup data compression is a CPU bound operation that will not yield to
    other greenthreads, so given enough simultaneous backup operations they
    will lead to other threads' starvation.

    This is really problematic for DB connections, since starvation will
    lead to connections getting dropped with errors such as "Lost connection
    to MySQL server during query".

    Detailed information on why these connections get dropped can be found
    in comment "[31 Aug 2007 9:21] Magnus Blåudd" on this MySQL bug [1].

    These DB issues may result in backups unnecessary ending in an "error"
    state.

    This patch fixes this by moving the compression to a native thread so
    the cooperative multitasking in Cinder Backup can continue switching
    threads.

    [1] https://bugs.mysql.com/bug.php?id=28359

    Closes-Bug: #1692775
    Closes-Bug: #1719580
    Change-Id: I1946dc0ad9cb7a68072a39816fa9fa224c2eb6a5
    (cherry picked from commit af0f00bc52f79d9395adfe0575b0dbe353e18bbe)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 12.0.0.0b1

This issue was fixed in the openstack/cinder 12.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 11.0.1

This issue was fixed in the openstack/cinder 11.0.1 release.

Revision history for this message
Chhavi Agarwal (chhagarw) wrote :
Download full text (4.1 KiB)

I am still hitting the same issue even with the given fix
    def _prepare_output_data(self, data):
        if self.compressor is None:
            return 'none', data
        data_size_bytes = len(data)
        # Execute compression in native thread so it doesn't prevent
        # cooperative greenthread switching.
        compressed_data = eventlet.tpool.execute(self.compressor.compress,
                                                 data)

Started the backup for 50GB volume
[root@pvc180 cinder]# cinder --service-type volume backup-list
+--------------------------------------+--------------------------------------+----------+-------------+------+--------------+-------------------------------------------+
| ID | Volume ID | Status | Name | Size | Object Count | Container |
+--------------------------------------+--------------------------------------+----------+-------------+------+--------------+-------------------------------------------+
| 4d2a21e4-63e7-4018-a5ba-5c79bdcbfee1 | b0f5eec3-1e25-4b2e-8b91-f947b9361dfc | creating | chec-backup | 50 | 0 | 4d/2a/4d2a21e4-63e7-4018a5ba-5c79bdcbfee1 |
+--------------------------------------+--------------------------------------+----------+-------------+------+--------------+-------------------------------------------+

cinder service-list shows down for the cinder-backup
+------------------+----------------------------+------+---------+-------+----------------------------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+------------------+----------------------------+------+---------+-------+----------------------------+-----------------+
| cinder-backup | pvc180.rch.stglabs.ibm.com | nova | enabled | down | 2017-11-07T09:38:59.000000 | - |
| cinder-conductor | pvc180.rch.stglabs.ibm.com | nova | enabled | up | 2017-11-07T09:41:00.000000 | - |
| cinder-health | pvc180.rch.stglabs.ibm.com | nova | enabled | up | 2017-11-07T09:41:16.000000 | - |
| cinder-scheduler | pvc180.rch.stglabs.ibm.com | nova | enabled | up | 2017-11-07T09:41:00.000000 | - |
| cinder-volume | evtds8870 | nova | enabled | up | 2017-11-07T09:41:25.000000 | - |
| cinder-volume | y0121v3700b | nova | enabled | up | 2017-11-07T09:41:07.000000 | - |
+------------------+----------------------------+------+---------+-------+----------------------------+-----------------+

Once the backup is completed and available cinder-backup service is backup up.
[root@pvc180 cinder]# cinder --service-type volume backup-list
+--------------------------------------+--------------------------------------+-----------+-------------+------+--------------+--------------------------------------------+
| ID | Volume ID | Status | Name | Size | Object Count | Container |
+-------------------------------...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (driverfixes/ocata)

Fix proposed to branch: driverfixes/ocata
Review: https://review.openstack.org/518306

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (driverfixes/mitaka)

Fix proposed to branch: driverfixes/mitaka
Review: https://review.openstack.org/518309

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (driverfixes/newton)

Fix proposed to branch: driverfixes/newton
Review: https://review.openstack.org/518311

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/518316

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (driverfixes/newton)

Reviewed: https://review.openstack.org/518311
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=41754fd57f27bba646d1d6e26388c7f5f4c2dc4e
Submitter: Zuul
Branch: driverfixes/newton

commit 41754fd57f27bba646d1d6e26388c7f5f4c2dc4e
Author: Gorka Eguileor <email address hidden>
Date: Wed Sep 13 19:46:17 2017 +0200

    Run backup compression on native thread

    Backup data compression is a CPU bound operation that will not yield to
    other greenthreads, so given enough simultaneous backup operations they
    will lead to other threads' starvation.

    This is really problematic for DB connections, since starvation will
    lead to connections getting dropped with errors such as "Lost connection
    to MySQL server during query".

    Detailed information on why these connections get dropped can be found
    in comment "[31 Aug 2007 9:21] Magnus Blåudd" on this MySQL bug [1].

    These DB issues may result in backups unnecessary ending in an "error"
    state.

    This patch fixes this by moving the compression to a native thread so
    the cooperative multitasking in Cinder Backup can continue switching
    threads.

    [1] https://bugs.mysql.com/bug.php?id=28359

    Closes-Bug: #1692775
    Closes-Bug: #1719580
    Change-Id: I1946dc0ad9cb7a68072a39816fa9fa224c2eb6a5
    (cherry picked from commit af0f00bc52f79d9395adfe0575b0dbe353e18bbe)
    (cherry picked from commit 439f90da8e4c1cfa1acfe36b5ea8a6c5bae6139f)
    (cherry picked from commit b241f93267646a6501e3a5fb14521f05e38a2ae9)

tags: added: in-driverfixes-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/518316
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=dd556fa755adca195e7df82477ae6400f693af14
Submitter: Zuul
Branch: master

commit dd556fa755adca195e7df82477ae6400f693af14
Author: Chhavi Agarwal <email address hidden>
Date: Tue Nov 7 07:05:49 2017 -0500

    Run backup-restore operations on native thread

    During huge backup file read write operations holds the CPU which
    leads to thread starvation, and cause cinder backup service to
    report down, as DB operations are impacted.
    Proposed changes are to run CPU and file sensitive operations like
    read, write, compress, decompress on a native thread.

    Change-Id: I1f1d9c0d6e3f04f1ecd5ef7c5d813005ee116409
    Closes-Bug: #1692775
    Co-Authored-By: Gorka Eguileor <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/537003

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 12.0.0.0b3

This issue was fixed in the openstack/cinder 12.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/pike)

Reviewed: https://review.openstack.org/537003
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=60bd878c3761b69311ef2732d37aaeeb3929a8e3
Submitter: Zuul
Branch: stable/pike

commit 60bd878c3761b69311ef2732d37aaeeb3929a8e3
Author: Chhavi Agarwal <email address hidden>
Date: Tue Nov 7 07:05:49 2017 -0500

    Run backup-restore operations on native thread

    During huge backup file read write operations holds the CPU which
    leads to thread starvation, and cause cinder backup service to
    report down, as DB operations are impacted.
    Proposed changes are to run CPU and file sensitive operations like
    read, write, compress, decompress on a native thread.

    Change-Id: I1f1d9c0d6e3f04f1ecd5ef7c5d813005ee116409
    Closes-Bug: #1692775
    Co-Authored-By: Gorka Eguileor <email address hidden>
    (cherry picked from commit dd556fa755adca195e7df82477ae6400f693af14)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 11.1.0

This issue was fixed in the openstack/cinder 11.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (driverfixes/ocata)

Reviewed: https://review.openstack.org/518306
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=b241f93267646a6501e3a5fb14521f05e38a2ae9
Submitter: Zuul
Branch: driverfixes/ocata

commit b241f93267646a6501e3a5fb14521f05e38a2ae9
Author: Gorka Eguileor <email address hidden>
Date: Wed Sep 13 19:46:17 2017 +0200

    Run backup compression on native thread

    Backup data compression is a CPU bound operation that will not yield to
    other greenthreads, so given enough simultaneous backup operations they
    will lead to other threads' starvation.

    This is really problematic for DB connections, since starvation will
    lead to connections getting dropped with errors such as "Lost connection
    to MySQL server during query".

    Detailed information on why these connections get dropped can be found
    in comment "[31 Aug 2007 9:21] Magnus Blåudd" on this MySQL bug [1].

    These DB issues may result in backups unnecessary ending in an "error"
    state.

    This patch fixes this by moving the compression to a native thread so
    the cooperative multitasking in Cinder Backup can continue switching
    threads.

    [1] https://bugs.mysql.com/bug.php?id=28359

    Closes-Bug: #1692775
    Closes-Bug: #1719580
    Change-Id: I1946dc0ad9cb7a68072a39816fa9fa224c2eb6a5
    (cherry picked from commit af0f00bc52f79d9395adfe0575b0dbe353e18bbe)
    (cherry picked from commit 439f90da8e4c1cfa1acfe36b5ea8a6c5bae6139f)

tags: added: in-driverfixes-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by Eric Harney (<email address hidden>) on branch: master
Review: https://review.openstack.org/466607

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/560605

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (driverfixes/mitaka)

Reviewed: https://review.openstack.org/518309
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=0137bc6b0c0e87504253918d5a4572c6ac14e31c
Submitter: Zuul
Branch: driverfixes/mitaka

commit 0137bc6b0c0e87504253918d5a4572c6ac14e31c
Author: Gorka Eguileor <email address hidden>
Date: Wed Sep 13 19:46:17 2017 +0200

    Run backup compression on native thread

    Backup data compression is a CPU bound operation that will not yield to
    other greenthreads, so given enough simultaneous backup operations they
    will lead to other threads' starvation.

    This is really problematic for DB connections, since starvation will
    lead to connections getting dropped with errors such as "Lost connection
    to MySQL server during query".

    Detailed information on why these connections get dropped can be found
    in comment "[31 Aug 2007 9:21] Magnus Blåudd" on this MySQL bug [1].

    These DB issues may result in backups unnecessary ending in an "error"
    state.

    This patch fixes this by moving the compression to a native thread so
    the cooperative multitasking in Cinder Backup can continue switching
    threads.

    [1] https://bugs.mysql.com/bug.php?id=28359

    Closes-Bug: #1692775
    Closes-Bug: #1719580
    Change-Id: I1946dc0ad9cb7a68072a39816fa9fa224c2eb6a5
    (cherry picked from commit af0f00bc52f79d9395adfe0575b0dbe353e18bbe)
    (cherry picked from commit 439f90da8e4c1cfa1acfe36b5ea8a6c5bae6139f)
    (cherry picked from commit b241f93267646a6501e3a5fb14521f05e38a2ae9)
    (cherry picked from commit 41754fd57f27bba646d1d6e26388c7f5f4c2dc4e)

tags: added: in-driverfixes-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/ocata)

Reviewed: https://review.openstack.org/560605
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=173d4d0a4686a409f9578538018aa808bb437ca9
Submitter: Zuul
Branch: stable/ocata

commit 173d4d0a4686a409f9578538018aa808bb437ca9
Author: Gorka Eguileor <email address hidden>
Date: Wed Sep 13 19:46:17 2017 +0200

    Run backup compression on native thread

    Backup data compression is a CPU bound operation that will not yield to
    other greenthreads, so given enough simultaneous backup operations they
    will lead to other threads' starvation.

    This is really problematic for DB connections, since starvation will
    lead to connections getting dropped with errors such as "Lost connection
    to MySQL server during query".

    Detailed information on why these connections get dropped can be found
    in comment "[31 Aug 2007 9:21] Magnus Blåudd" on this MySQL bug [1].

    These DB issues may result in backups unnecessary ending in an "error"
    state.

    This patch fixes this by moving the compression to a native thread so
    the cooperative multitasking in Cinder Backup can continue switching
    threads.

    [1] https://bugs.mysql.com/bug.php?id=28359

    Closes-Bug: #1692775
    Closes-Bug: #1719580
    Change-Id: I1946dc0ad9cb7a68072a39816fa9fa224c2eb6a5
    (cherry picked from commit af0f00bc52f79d9395adfe0575b0dbe353e18bbe)
    (cherry picked from commit 439f90da8e4c1cfa1acfe36b5ea8a6c5bae6139f)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 10.0.7

This issue was fixed in the openstack/cinder 10.0.7 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.