HA. Couldn't create volume after cluster restart

Bug #1357416 reported by Kirill Omelchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
New
High
Vladimir Kuklin

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.centos.thread_3/136/testReport/%28root%29/ceph_ha_restart/ceph_ha_restart/

Env: HA, Centos, Nova FlatDHCP, ceph for images and volumes
- 3x Controllers + ceph, 2x Compute +ceph, 1 Ceph OSD

Scenario:
1. Deploy Cluster
2. Shutdown Ceph OSD node, run OSTF
3. Shutdown one of Compute nodes, run OSTF
4. Restart the whole cluster
3. Run OSTF

Expected result:
OSTF passes successfully

Actual:
Has failed the step 'Create volume and attach it to instance'
On the Horizon UI the volume is in error state

/var/log/cinder-all.log contains such an error:

<156>Aug 15 10:17:57 node-1 cinder-api 2014-08-15 10:17:57.140 3273 AUDIT cinder.api.v1.volumes [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] vol={'migration_status': None, 'availability_zone': 'nova', 'terminated_at': None, 'reservations': ['dd6e4c74-7b27-43f9-aaa8-88721245274f', '7ea69297-4a97-48a0-a57e-fe6acb2bb8bb'], 'updated_at': None, 'provider_geometry': None, 'snapshot_id': None, 'ec2_id': None, 'mountpoint': None, 'deleted_at': None, 'id': '3aa07043-5323-4238-b9a5-c0c84c3f00bc', 'size': 1, 'user_id': u'30ef4e80b8604f92abb951e8a5e3d419', 'attach_time': None, 'attached_host': None, 'display_description': None, 'volume_admin_metadata': [], 'encryption_key_id': None, 'project_id': u'4096c02ba7f6432ca7a2fd9dae9db077', 'launched_at': None, 'scheduled_at': None, 'status': 'creating', 'volume_type_id': None, 'deleted': False, 'provider_location': None, 'host': None, 'source_volid': None, 'provider_auth': None, 'display_name': u'ost1_test-volume393412007', 'instance_uuid': None, 'bootable': False, 'created_at': datetime.datetime(2014, 8, 15, 10, 15, 50, 41601), 'attach_status': 'detached', 'volume_type': None, '_name_id': None, 'volume_metadata': [], 'metadata': {}}
<158>Aug 15 10:17:57 node-1 cinder-api 2014-08-15 10:17:57.141 3273 INFO cinder.api.openstack.wsgi [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] http://10.108.21.2:8776/v1/4096c02ba7f6432ca7a2fd9dae9db077/volumes returned with HTTP 200
<159>Aug 15 10:17:57 node-1 cinder-scheduler 2014-08-15 10:17:57.145 2837 DEBUG stevedore.extension [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] found extension EntryPoint.parse('default = taskflow.engines.action_engine.engine:SingleThreadedActionEngine') _load_plugins /usr/lib/python2.6/site-packages/stevedore/extension.py:156
<158>Aug 15 10:17:57 node-1 cinder-api 2014-08-15 10:17:57.183 3273 INFO eventlet.wsgi.server [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/eventlet/wsgi.py", line 412, in handle_one_response
    write(''.join(towrite))
  File "/usr/lib/python2.6/site-packages/eventlet/wsgi.py", line 354, in write
    _writelines(towrite)
  File "/usr/lib64/python2.6/socket.py", line 334, in writelines
    self.flush()
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
    self._sock.sendall(buffer(data, write_offset, buffer_size))
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 309, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 295, in send
    total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer

Revision history for this message
Kirill Omelchenko (komelchenko) wrote :
Changed in fuel:
status: New → Confirmed
assignee: nobody → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

There is no errors or stacktraces related to Ceph in the attached logs. I see 2 possibilities here:

1) MySQL/Galera or RabbitMQ cluster failed to reassemble after turning the cluster back on.

2) The test simply didn't wait long enough to allow the cluster to reassemble.

QA: Please provide a full diagnostic snapshot instead of just an archive of docker-logs, it would include a lot of additional data relevant for diagnostics, including ceph and ceph-deploy logs, config files from the target nodes, etc.

summary: - HA. Couldn't create volume after cluster restart (Ceph)
+ HA. Couldn't create volume after cluster restart
Changed in fuel:
status: Confirmed → Incomplete
assignee: Dmitry Borodaenko (dborodaenko) → Vladimir Kuklin (vkuklin)
Revision history for this message
Kirill Omelchenko (komelchenko) wrote :

Here's the diagnostic snapshot.

Andrew Woodward (xarses)
Changed in fuel:
status: Incomplete → New
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

also, this could be related to https://bugs.launchpad.net/fuel/+bug/1361747 and oslo.messaging bug which we still do not have completely fixed in master

Revision history for this message
Serg Melikyan (smelikyan) wrote :

I would suggest to re-verify this issue based on Vladimir's words, cause preliminary oslo.messages fix is merged

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.