cinder-backup restore process ends unexpectedly

Bug #1700843 reported by Danil Akhmetov on 2017-06-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Medium
MOS Maintenance

Bug Description

Performing volume restore from backup, cinder-backup unexpectedly got this:

>Exception during message handling: IncompleteRead(2571384 bytes read, 1669010 more expected)

with the following traceback:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 137, in _dispatch_and_reply
    incoming.message))
  File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 180, in _dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 126, in _do_dispatch
    result = getattr(endpoint, method)(ctxt, **new_args)
  File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cinder/backup/manager.py", line 371, in restore_backup
    {'status': 'available'})
  File "/usr/lib/python2.7/dist-packages/cinder/openstack/common/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/cinder/backup/manager.py", line 365, in restore_backup
    backup_service)
  File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cinder/volume/driver.py", line 578, in restore_backup
    backup_service.restore(backup, volume['id'], volume_file)
  File "/usr/lib/python2.7/dist-packages/cinder/backup/drivers/swift.py", line 490, in restore
    restore_func(backup, volume_id, metadata, volume_file)
  File "/usr/lib/python2.7/dist-packages/cinder/backup/drivers/swift.py", line 431, in _restore_v1
    (resp, body) = self.conn.get_object(container, object_name)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 1341, in get_object
    response_dict=response_dict, headers=headers)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 1243, in _retry
    rv = func(self.url, self.token, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 872, in get_object
    object_body = resp.read()
  File "/usr/lib/python2.7/dist-packages/urllib3/response.py", line 170, in read
    data = self._fp.read()
  File "/usr/lib/python2.7/httplib.py", line 557, in read

Swift acts as cinder-backup backend, but to my surprise I didn't find anything related to errors or connection drop in swift-object-server and swift-proxy server service log. Network flaps were not noticed also, and no other services experienced connectivity issues.

If it's not cinder-backup's fault, I'd like to see as many details as possible from cinder in this situation.

Changed in mos:
status: New → Incomplete
Denis Meltsaykin (dmeltsaykin) wrote :

Danil, could you please provide more data? Steps to reproduce, logs, anything? So for it doesn't look solvable.

Danil Akhmetov (dinobot) wrote :

Denis, sure.

I faced this issue on one of controllers nodes in MOS 7 (Kilo) environment with Cinder providing 2 backends: netapp and SolidFire. This issue has been successfully resolved.

According to cinder-backup debug log, the problem looked like unexpectedly interrupted HTTP session while cinder-backup getting next chunk's data from swift-object-server via swift-proxy.

I found bunch of issues on swift's side, but I was surprised that I run into original problem again after I resolved problems of swift's side.

Then I found out that the problem is connected only to one of the Cinder backends, and after we cleaned up stuck SolidFire mounts the problem gone.

Finally, it seems to me like cinder-backup multithreading consistency and logging issue: when iSCSI operations failed during the volume restore, process ended up in urllib3 library exception due to terminated http which is very unclear and may send engineer who troubleshooting it on a wrong path.

Denis Meltsaykin (dmeltsaykin) wrote :

Basing on the above explanation I believe it's more of a cosmetic issue so I'm setting the Medium importance.

Changed in mos:
importance: Undecided → Medium
status: Incomplete → Confirmed
assignee: nobody → MOS Maintenance (mos-maintenance)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers