Fuel for OpenStack

HA. Couldn't create volume after cluster restart

Bug #1357416 reported by Kirill Omelchenko on 2014-08-15

This bug report is a duplicate of: Bug #1361747: Not all slaves joined to the rabbit cluster after failover. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	New	High	Vladimir Kuklin	Fuel for OpenStack 5.1

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.centos.thread_3/136/testReport/%28root%29/ceph_ha_restart/ceph_ha_restart/

Env: HA, Centos, Nova FlatDHCP, ceph for images and volumes
- 3x Controllers + ceph, 2x Compute +ceph, 1 Ceph OSD

Scenario:
1. Deploy Cluster
2. Shutdown Ceph OSD node, run OSTF
3. Shutdown one of Compute nodes, run OSTF
4. Restart the whole cluster
3. Run OSTF

Expected result:
OSTF passes successfully

Actual:
Has failed the step 'Create volume and attach it to instance'
On the Horizon UI the volume is in error state

/var/log/cinder-all.log contains such an error:

<156>Aug 15 10:17:57 node-1 cinder-api 2014-08-15 10:17:57.140 3273 AUDIT cinder.api.v1.volumes [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] vol={'migration_status': None, 'availability_zone': 'nova', 'terminated_at': None, 'reservations': ['dd6e4c74-7b27-43f9-aaa8-88721245274f', '7ea69297-4a97-48a0-a57e-fe6acb2bb8bb'], 'updated_at': None, 'provider_geometry': None, 'snapshot_id': None, 'ec2_id': None, 'mountpoint': None, 'deleted_at': None, 'id': '3aa07043-5323-4238-b9a5-c0c84c3f00bc', 'size': 1, 'user_id': u'30ef4e80b8604f92abb951e8a5e3d419', 'attach_time': None, 'attached_host': None, 'display_description': None, 'volume_admin_metadata': [], 'encryption_key_id': None, 'project_id': u'4096c02ba7f6432ca7a2fd9dae9db077', 'launched_at': None, 'scheduled_at': None, 'status': 'creating', 'volume_type_id': None, 'deleted': False, 'provider_location': None, 'host': None, 'source_volid': None, 'provider_auth': None, 'display_name': u'ost1_test-volume393412007', 'instance_uuid': None, 'bootable': False, 'created_at': datetime.datetime(2014, 8, 15, 10, 15, 50, 41601), 'attach_status': 'detached', 'volume_type': None, '_name_id': None, 'volume_metadata': [], 'metadata': {}}
<158>Aug 15 10:17:57 node-1 cinder-api 2014-08-15 10:17:57.141 3273 INFO cinder.api.openstack.wsgi [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] http://10.108.21.2:8776/v1/4096c02ba7f6432ca7a2fd9dae9db077/volumes returned with HTTP 200
<159>Aug 15 10:17:57 node-1 cinder-scheduler 2014-08-15 10:17:57.145 2837 DEBUG stevedore.extension [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] found extension EntryPoint.parse('default = taskflow.engines.action_engine.engine:SingleThreadedActionEngine') _load_plugins /usr/lib/python2.6/site-packages/stevedore/extension.py:156
<158>Aug 15 10:17:57 node-1 cinder-api 2014-08-15 10:17:57.183 3273 INFO eventlet.wsgi.server [req-8ac461c2-21af-43b9-976d-cd892351b3d3 30ef4e80b8604f92abb951e8a5e3d419 4096c02ba7f6432ca7a2fd9dae9db077 - - -] Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/eventlet/wsgi.py", line 412, in handle_one_response
    write(''.join(towrite))
  File "/usr/lib/python2.6/site-packages/eventlet/wsgi.py", line 354, in write
    _writelines(towrite)
  File "/usr/lib64/python2.6/socket.py", line 334, in writelines
    self.flush()
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
    self._sock.sendall(buffer(data, write_offset, buffer_size))
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 309, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 295, in send
    total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer

Revision history for this message

Kirill Omelchenko (komelchenko) wrote on 2014-08-15:

fuel-master_diganostic_ceph_ha_restart.tgz Edit (8.0 MiB, application/x-tar)

Vladimir Kuklin (vkuklin) on 2014-08-15

Changed in fuel:
status:	New → Confirmed
assignee:	nobody → Dmitry Borodaenko (dborodaenko)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-08-18:

There is no errors or stacktraces related to Ceph in the attached logs. I see 2 possibilities here:

1) MySQL/Galera or RabbitMQ cluster failed to reassemble after turning the cluster back on.

2) The test simply didn't wait long enough to allow the cluster to reassemble.

QA: Please provide a full diagnostic snapshot instead of just an archive of docker-logs, it would include a lot of additional data relevant for diagnostics, including ceph and ceph-deploy logs, config files from the target nodes, etc.

summary:	- HA. Couldn't create volume after cluster restart (Ceph) + HA. Couldn't create volume after cluster restart
Changed in fuel:
status:	Confirmed → Incomplete
assignee:	Dmitry Borodaenko (dborodaenko) → Vladimir Kuklin (vkuklin)

Revision history for this message

Kirill Omelchenko (komelchenko) wrote on 2014-08-21:

fail_error_ceph_ha_restart-2014_08_15__10_21_33.tar.gz Edit (9.3 MiB, application/x-tar)

Here's the diagnostic snapshot.

Andrew Woodward (xarses) on 2014-08-26

Changed in fuel:
status:	Incomplete → New

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-08-27:

also, this could be related to https://bugs.launchpad.net/fuel/+bug/1361747 and oslo.messaging bug which we still do not have completely fixed in master