http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.centos.thread_3/136/testReport/%28root%29/ceph_ha_restart/ceph_ha_restart/
Env: HA, Centos, Nova FlatDHCP, ceph for images and volumes
- 3x Controllers + ceph, 2x Compute +ceph, 1 Ceph OSD
Scenario:
1. Deploy Cluster
2. Shutdown Ceph OSD node, run OSTF
3. Shutdown one of Compute nodes, run OSTF
4. Restart the whole cluster
3. Run OSTF
Expected result:
OSTF passes successfully
Actual:
Has failed the step 'Create volume and attach it to instance'
On the Horizon UI the volume is in error state
/var/log/cinder-all.log contains such an error:
<156>Aug 15 10:17:57 node-1 cinder-api 2014-08-15 10:17:57.140 3273 AUDIT cinder.api.v1.volumes [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] vol={'migration_status': None, 'availability_zone': 'nova', 'terminated_at': None, 'reservations': ['dd6e4c74-7b27-43f9-aaa8-88721245274f', '7ea69297-4a97-48a0-a57e-fe6acb2bb8bb'], 'updated_at': None, 'provider_geometry': None, 'snapshot_id': None, 'ec2_id': None, 'mountpoint': None, 'deleted_at': None, 'id': '3aa07043-5323-4238-b9a5-c0c84c3f00bc', 'size': 1, 'user_id': u'30ef4e80b8604f92abb951e8a5e3d419', 'attach_time': None, 'attached_host': None, 'display_description': None, 'volume_admin_metadata': [], 'encryption_key_id': None, 'project_id': u'4096c02ba7f6432ca7a2fd9dae9db077', 'launched_at': None, 'scheduled_at': None, 'status': 'creating', 'volume_type_id': None, 'deleted': False, 'provider_location': None, 'host': None, 'source_volid': None, 'provider_auth': None, 'display_name': u'ost1_test-volume393412007', 'instance_uuid': None, 'bootable': False, 'created_at': datetime.datetime(2014, 8, 15, 10, 15, 50, 41601), 'attach_status': 'detached', 'volume_type': None, '_name_id': None, 'volume_metadata': [], 'metadata': {}}
<158>Aug 15 10:17:57 node-1 cinder-api 2014-08-15 10:17:57.141 3273 INFO cinder.api.openstack.wsgi [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] http://10.108.21.2:8776/v1/4096c02ba7f6432ca7a2fd9dae9db077/volumes returned with HTTP 200
<159>Aug 15 10:17:57 node-1 cinder-scheduler 2014-08-15 10:17:57.145 2837 DEBUG stevedore.extension [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] found extension EntryPoint.parse('default = taskflow.engines.action_engine.engine:SingleThreadedActionEngine') _load_plugins /usr/lib/python2.6/site-packages/stevedore/extension.py:156
<158>Aug 15 10:17:57 node-1 cinder-api 2014-08-15 10:17:57.183 3273 INFO eventlet.wsgi.server [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/eventlet/wsgi.py", line 412, in handle_one_response
write(''.join(towrite))
File "/usr/lib/python2.6/site-packages/eventlet/wsgi.py", line 354, in write
_writelines(towrite)
File "/usr/lib64/python2.6/socket.py", line 334, in writelines
self.flush()
File "/usr/lib64/python2.6/socket.py", line 303, in flush
self._sock.sendall(buffer(data, write_offset, buffer_size))
File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 309, in sendall
tail = self.send(data, flags)
File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 295, in send
total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer
There is no errors or stacktraces related to Ceph in the attached logs. I see 2 possibilities here:
1) MySQL/Galera or RabbitMQ cluster failed to reassemble after turning the cluster back on.
2) The test simply didn't wait long enough to allow the cluster to reassemble.
QA: Please provide a full diagnostic snapshot instead of just an archive of docker-logs, it would include a lot of additional data relevant for diagnostics, including ceph and ceph-deploy logs, config files from the target nodes, etc.