VolumesBackupsV2Test:test_volume_backup_create_get_detailed_list_restore_delete times out waiting to restore volume

Bug #1483434 reported by Matt Riedemann
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Confirmed
Medium
Ivan Kolodyazhny
devstack
Invalid
Undecided
Xing Yang

Bug Description

http://logs.openstack.org/82/202782/15/gate/gate-tempest-dsvm-neutron-full/372a95f/console.html#_2015-08-10_22_07_57_879

2015-08-10 22:07:57.879 | tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete[id-a66eb488-8ee1-47d4-8e9f-575a095728c6]
2015-08-10 22:07:57.879 | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
2015-08-10 22:07:57.879 |
2015-08-10 22:07:57.880 | Captured traceback:
2015-08-10 22:07:57.880 | ~~~~~~~~~~~~~~~~~~~
2015-08-10 22:07:57.880 | Traceback (most recent call last):
2015-08-10 22:07:57.880 | File "tempest/api/volume/admin/test_volumes_backup.py", line 79, in test_volume_backup_create_get_detailed_list_restore_delete
2015-08-10 22:07:57.880 | restore['volume_id'], 'available')
2015-08-10 22:07:57.880 | File "tempest/services/volume/json/volumes_client.py", line 202, in wait_for_volume_status
2015-08-10 22:07:57.880 | waiters.wait_for_volume_status(self, volume_id, status)
2015-08-10 22:07:57.880 | File "tempest/common/waiters.py", line 153, in wait_for_volume_status
2015-08-10 22:07:57.880 | raise exceptions.TimeoutException(message)
2015-08-10 22:07:57.880 | tempest.exceptions.TimeoutException: Request timed out
2015-08-10 22:07:57.880 | Details: Volume bd776a72-66c0-43a6-8d48-39f541634226 failed to reach available status (current error_restoring) within the required time (196 s).

From the tempest logs it looks like it does the backup successfully and the backup is available but the volume goes into 'error_restoring' status and never comes out of it. Tempest should probably fail at that point rather than wait until timeout when the volume is in an error state and won't recover.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (13.2 KiB)

There are mysql errors in the c-bak logs when the volume goes into 'error_restoring' status:

http://logs.openstack.org/82/202782/15/gate/gate-tempest-dsvm-neutron-full/372a95f/logs/screen-c-bak.txt.gz?level=TRACE

2015-08-10 21:36:37.420 ERROR oslo_messaging.rpc.dispatcher [req-1abbfe4f-97f2-4184-b598-522a1e0e2c19 None] Exception during message handling: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher executor_callback))
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher executor_callback)
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher return f(*args, **kwargs)
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/cinder/cinder/backup/manager.py", line 437, in restore_backup
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher backup.save()
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 119, in __exit__
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/cinder/cinder/backup/manager.py", line 431, in restore_backup
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher backup_service)
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher return f(*args, **kwargs)
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/cinder/cinder/volume/drivers/lvm.py", line 500, in restore_backup
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher backup_service.restore(backup, volume['id'], volume_file)
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/new/cinder/cinder/backup/chunkeddriver.py", line 684, in restore
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher self.put_metadata(volume_id, volume_meta)
2015-08-10 21:36:37.420 9136 ERROR oslo_messaging.rpc.dispatcher ...

Changed in cinder:
status: New → Confirmed
importance: Undecided → Medium
tags: added: backup-service db
Mike Perez (thingee)
tags: added: gate-failure
Revision history for this message
Xing Yang (xing-yang) wrote :
Download full text (4.0 KiB)

I looked at the logs. There were 4 restores started in the cinder backup logs. 3 of them finished but 1 didn’t. Of the 4, 2 of them started at almost the same time. One finished and the other one didn’t finish. The unfinished one did finish restoring data and started to restore metadata. At that time, there were connection problems to MySQL so the metadata restore was not finished and eventually timed out.

From the cinder backup log -
http://logs.openstack.org/82/202782/15/gate/gate-tempest-dsvm-neutron-full/372a95f/logs/screen-c-bak.txt.gz#_2015-08-10_21_36_37_525

We can see that the restore of backup started:
2015-08-10 21:34:27.541 INFO cinder.backup.manager [req-139ab0ae-d0ed-43d5-a254-19e895c9ff95 tempest-VolumesBackupsV2Test-998958099] Restore backup started, backup: bd79c0a3-5a76-4dc3-a774-d8ef515c46b1 volume: bd776a72-66c0-43a6-8d48-39f541634226.

2015-08-10 21:34:27.825 DEBUG cinder.backup.chunkeddriver [req-0a4871ba-550a-4435-8e15-07421d4f9eef tempest-VolumesBackupsV1Test-1030415370] starting restore of backup volume_8b22ff43-58ed-4148-9824-20db1efd21d2/20150810213315/az_nova_backup_bd79c0a3-5a76-4dc3-a774-d8ef515c46b1 container: volumebackups, to volume bd776a72-66c0-43a6-8d48-39f541634226, backup: bd79c0a3-5a76-4dc3-a774-d8ef515c46b1. restore /opt/stack/new/cinder/cinder/backup/chunkeddriver.py:648
2015-08

Now the restore of data finished:
2015-08-10 21:36:08.861 DEBUG cinder.backup.chunkeddriver [req-3629e4d4-a3b0-4f6f-9f30-7e6140e95e78 None] v1 volume backup restore of bd79c0a3-5a76-4dc3-a774-d8ef515c46b1 finished. _restore_v1 /opt/stack/new/cinder/cinder/backup/chunkeddriver.py:634

Now it started to restore metadata:
2015-08-10 21:36:08.863 DEBUG cinder.backup.driver [req-3629e4d4-a3b0-4f6f-9f30-7e6140e95e78 None] Restoring volume base metadata _restore_vol_base_meta /opt/stack/new/cinder/cinder/backup/driver.py:168
2015-08

If this is finished, we should see "Restore backup finished, backup bd79c0a3-5a76-4dc3-a774-d8ef515c46b1 restored to volume: bd776a72-66c0-43a6-8d48-39f541634226." However, this never happened.

Instead, there is a Lost connection to MySQL failure at this time:
2015-08-10 21:36:37.420 ERROR oslo_messaging.rpc.dispatcher [req-1abbfe4f-97f2-4184-b598-522a1e0e2c19 None] Exception during message handling: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2015-08-10 21:36:37.42

Then after a few minutes, tempest timed out the restore request and sent a delete backup request:
2015-08-10 21:39:55.018 INFO cinder.backup.manager [req-c6888349-c6f3-4a1d-842f-a34df2161aa6 tempest-VolumesBackupsV2Test-998958099] Delete backup started, backup: bd79c0a3-5a76-4dc3-a774-d8ef515c46b1.

In the cinder volume log, there is a connection pool is full and discard connection error. This matches the Lost connection to MySQL error in cinder backup log:

http://logs.openstack.org/82/202782/15/gate/gate-tempest-dsvm-neutron-full/372a95f/logs/screen-c-vol.txt.gz

2015-08-10 21:36:52.474 WARNING requests.packages.urllib3.connectionpool [req-bd798c22-6983-45e4-b0e7-aa5a8e1dffe4 tempest-VolumesV2ActionsTest-1525165338] Connection pool is full, discarding connection: 127.0.0.1

From the followin...

Read more...

Xing Yang (xing-yang)
Changed in devstack:
assignee: nobody → Xing Yang (xing-yang)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.openstack.org/211735

Changed in devstack:
status: New → In Progress
Revision history for this message
Sean Dague (sdague) wrote :

2015-08-10 21:36:52.474 WARNING requests.packages.urllib3.connectionpool [req-bd798c22-6983-45e4-b0e7-aa5a8e1dffe4 tempest-VolumesV2ActionsTest-1525165338] Connection pool is full, discarding connection: 127.0.0.1

Is from urllib3, it's nothing to do with the database.

Revision history for this message
Xing Yang (xing-yang) wrote :

Found a similar bug here on urllib3.connectionpool full error, but the fix was abandoned. We are also using swift in cinder backup, so this could be a duplicate of that bug:

https://bugs.launchpad.net/python-swiftclient/+bug/1295812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on devstack (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/211735

Revision history for this message
Paul Vinciguerra (pvinci) wrote :

2016-08-29 16:48:44.299734 | ==============================
2016-08-29 16:48:44.299753 | Failed 1 tests - output below:
2016-08-29 16:48:44.299770 | ==============================
2016-08-29 16:48:44.299780 |
2016-08-29 16:48:44.299825 | tempest.api.volume.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete[id-a66eb488-8ee1-47d4-8e9f-575a095728c6]
2016-08-29 16:48:44.299882 | ---------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-08-29 16:48:44.299892 |
2016-08-29 16:48:44.299905 | Captured traceback:
2016-08-29 16:48:44.299932 | ~~~~~~~~~~~~~~~~~~~
2016-08-29 16:48:44.299949 | Traceback (most recent call last):
2016-08-29 16:48:44.299983 | File "tempest/api/volume/test_volumes_backup.py", line 77, in test_volume_backup_create_get_detailed_list_restore_delete
2016-08-29 16:48:44.300001 | restore['volume_id'], 'available')
2016-08-29 16:48:44.300025 | File "tempest/common/waiters.py", line 182, in wait_for_volume_status
2016-08-29 16:48:44.300062 | raise exceptions.VolumeRestoreErrorException(volume_id=volume_id)
2016-08-29 16:48:44.300127 | tempest.exceptions.VolumeRestoreErrorException: Volume c3e0ef56-27b7-4b49-8a9a-b90de74a4f07 failed to restore and is in ERROR status
2016-08-29 16:48:44.300140 |

http://logs.openstack.org/59/358359/14/check/gate-oslo.messaging-src-dsvm-full-zmq/ca714b0/

Revision history for this message
Sean Dague (sdague) wrote :

This devstack bug was last updated over 180 days ago, as devstack
is a fast moving project and we'd like to get the tracker down to
currently actionable bugs, this is getting marked as Invalid. If the
issue still exists, please feel free to reopen it.

Changed in devstack:
status: In Progress → Invalid
Revision history for this message
Matt Riedemann (mriedem) wrote :

Different test, similar failure here:

http://logs.openstack.org/55/569055/7/gate/nova-next/2e23975/job-output.txt.gz#_2018-11-13_12_06_04_534310

2018-11-13 12:06:04.534310 | primary | {3} tempest.api.volume.admin.test_volumes_backup.VolumesBackupsAdminTest.test_volume_backup_export_import [267.653158s] ... FAILED
2018-11-13 12:06:04.534400 | primary |
2018-11-13 12:06:04.534442 | primary | Captured traceback:
2018-11-13 12:06:04.534474 | primary | ~~~~~~~~~~~~~~~~~~~
2018-11-13 12:06:04.534534 | primary | Traceback (most recent call last):
2018-11-13 12:06:04.534609 | primary | File "tempest/api/volume/admin/test_volumes_backup.py", line 115, in test_volume_backup_export_import
2018-11-13 12:06:04.534666 | primary | self.backups_client, backup['id'], 'available')
2018-11-13 12:06:04.534725 | primary | File "tempest/common/waiters.py", line 211, in wait_for_volume_resource_status
2018-11-13 12:06:04.534783 | primary | raise lib_exc.TimeoutException(message)
2018-11-13 12:06:04.534838 | primary | tempest.lib.exceptions.TimeoutException: Request timed out
2018-11-13 12:06:04.534939 | primary | Details: backup afcc4b27-ae1a-4de2-a254-2e4ea3debd52 failed to reach available status (current restoring) within the required time (196 s).

Revision history for this message
Matt Riedemann (mriedem) wrote :

I see this in the cinder-backup logs:

http://logs.openstack.org/55/569055/7/gate/nova-next/2e23975/logs/screen-c-bak.txt.gz?level=TRACE#_Nov_13_12_06_03_142764

Nov 13 12:06:03.142764 ubuntu-xenial-ovh-bhs1-0000491609 cinder-backup[2685]: WARNING cinder.backup.chunkeddriver [None req-e974fe79-fb70-4574-8632-c82372602c7f tempest-VolumesBackupsAdminTest-1344673063 None] Error while listing objects, continuing with delete.: ClientException: Container GET failed: https://158.69.71.222:8080/v1/AUTH_5f06ef15ce9246f0b9ec797944c6796a/volumebackups?format=json&prefix=volume_aa76fd3c-e9c8-409e-8cfc-ae1a9342f5f3/20181113120156/az_nova_backup_afcc4b27-ae1a-4de2-a254-2e4ea3debd52 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<

But from logstash it looks like that's normal:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Error%20while%20listing%20objects%2C%20continuing%20with%20delete.%3A%20ClientException%3A%20Container%20GET%20failed%3A%5C%22%20AND%20message%3A%5C%22404%20Not%20Found%5C%22%20AND%20tags%3A%5C%22screen-c-bak.txt%5C%22&from=7d

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

This is failing very frequently now - 19 fails in 24 hrs / 137 fails in 10 days

Till this bug is fixed, let's skip the test which is affecting lot of projects gate now.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (11.3 KiB)

Looking at a recent failure there is a very obvious stacktrace in the cinder-backup logs:

http://logs.openstack.org/04/649804/1/check/tempest-full/e2ad258/controller/logs/screen-c-bak.txt.gz?level=TRACE#_Apr_11_08_42_32_094879

Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager [None req-1726f3a1-3b67-49d5-926c-646d59c39710 tempest-VolumesBackupsTest-391112369 None] Restoring backup 87ecac88-4dbb-4f58-b0de-3d79e4525008 to volume 883e3950-d0eb-41ba-985f-e9928c5efe38 failed.: OSError: [Errno 5] Input/output error
Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager Traceback (most recent call last):
Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager File "/opt/stack/cinder/cinder/backup/manager.py", line 644, in _run_restore
Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager tpool.Proxy(device_file))
Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager File "/opt/stack/cinder/cinder/backup/chunkeddriver.py", line 788, in restore
Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager restore_func(backup1, volume_id, metadata, volume_file, backup)
Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager File "/opt/stack/cinder/cinder/backup/chunkeddriver.py", line 733, in _restore_v1
Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager os.fsync(fileno)
Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager OSError: [Errno 5] Input/output error
Apr 11 08:42:32.094879 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR cinder.backup.manager
Apr 11 08:43:32.619862 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR oslo_messaging.rpc.server [None req-1726f3a1-3b67-49d5-926c-646d59c39710 tempest-VolumesBackupsTest-391112369 None] Exception during message handling: OSError: [Errno 5] Input/output error
Apr 11 08:43:32.619862 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Apr 11 08:43:32.619862 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
Apr 11 08:43:32.619862 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Apr 11 08:43:32.619862 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
Apr 11 08:43:32.619862 ubuntu-bionic-ovh-gra1-0004997343 cinder-backup[14242]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Apr 11 08:43:32.619862 ubuntu-bionic-ovh-gra1-0...

Revision history for this message
Matt Riedemann (mriedem) wrote :

The error in comment 14 looks like what was reported in bug 1813217.

Revision history for this message
B ChandraSekhar Reddy (chandrasekharbora) wrote :

tempest.api.volume.test_volumes_backup.VolumesBackupsTest.test_volume_backup_create_get_detailed_list_restore_delete

Reason: Skipped until bug: https://launchpad.net/bugs/1483434 is resolved.

This test is getting skipped in our test environment, could someone expedite.

Ivan Kolodyazhny (e0ne)
Changed in cinder:
assignee: nobody → Ivan Kolodyazhny (e0ne)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.