Sahara cluster was failed during "preparing" state wit err "Operation' timed out after 120 second" "AMQP server closed the connection"

Bug #1497926 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
Medium
MOS Sahara
7.0.x
Invalid
Medium
MOS Maintenance
8.0.x
Invalid
Medium
MOS Sahara

Bug Description

Sahara cluster
root@node-34:~# sahara --os-tenant-id 282f7a68e2ba46868653b65d8e92ac2c cluster-list; echo `nova list --all-tenant | grep -c ACTIVE` are in ACTIVE state from `nova list --all-tenant | grep -c sahara-cluster`
+-------------------------------------+--------------------------------------+-----------+------------+
| name | id | status | node_count |
+-------------------------------------+--------------------------------------+-----------+------------+
| sahara-cluster-4XpNr2hAKsPr21CuQApE | 28627e7e-ad94-4d5b-b708-03ebffc7a2b3 | Preparing | 236 |
+-------------------------------------+--------------------------------------+-----------+------------+
236 are in ACTIVE state from 236

was failed during "preparing" state with the follwoing error:
2015-09-18 19:13:16.075 20326 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server 192.168.0.3:5673 closed the connection. Check login credentials: Socket closed
2015-09-18 19:13:44.387 20326 DEBUG sahara.openstack.common.loopingcall [-] Dynamic looping call <bound method SaharaPeriodicTasks.run_periodic_tasks of <sahara.service.periodic.SaharaPeriodicTasks object at 0x7f0742b57290>> sleeping for 45.00 seconds _inner /usr/lib/python2.7/dist-packages/sahara/openstack/common/loopingcall.py:132
2015-09-18 19:13:52.347 20326 DEBUG amqp [-] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL. See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2015 Pivotal Software, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'connection.blocked': True, u'authentication_failure_close': True, u'basic.nack': True, u'per_consumer_qos': True, u'consumer_priorities': True, u'consumer_cancel_notify': True, u'publisher_confirms': True}, u'cluster_name': <email address hidden>', u'platform': u'Erlang/OTP', u'version': u'3.5.4'}, mechanisms: [u'AMQPLAIN', u'PLAIN'], locales: [u'en_US'] _start /usr/lib/python2.7/dist-packages/amqp/connection.py:751
2015-09-18 19:13:55.691 20326 ERROR sahara.service.ops [-] Error during operating on cluster sahara-cluster-4XpNr2hAKsPr21CuQApE (reason: An error occurred in thread 'configure-instance-sahara-cluster-4xpnr2hakspr21cuqape-worker-ng-159': 'Operation' timed out after 120 second(s)
Error ID: 305c8040-d797-4303-8833-18b8bc566403
Error ID: ea8eed7e-d6ec-4b8b-b953-ebbb04a44265)

rabbitmq cluster was alive at the time:
root@node-35:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-35' ...
[{nodes,[{disc,['rabbit@node-28','rabbit@node-34','rabbit@node-35']}]},
 {running_nodes,['rabbit@node-28','rabbit@node-34','rabbit@node-35']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]}]

Cluster Configuration:
Baremetal,Ubuntu,IBP,HA, Neutron-vxlan,Ceph-all,Nova-debug,Nova-quotas,Sahara,7.0-296
Controllers:3 Computes+Ceph:47

api: '1.0'
astute_sha: 6c5b73f93e24cc781c809db9159927655ced5012
auth_required: true
build_id: '296'
build_number: '296'
feature_groups:
- mirantis
fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
fuel-library_sha: f2eef7717b15c6c0a3e76ef98ad4c7c4532d56f9
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 1f08e6e71021179b9881a824d9c999957fcc7045
fuelmain_sha: 6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85
nailgun_sha: 16a39d40120dd4257698795f12de4ae8200b1778
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 2864459e27b0510a0f7aedac6cdf27901ef5c481
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-18_19-28-07.tar.xz

Tags: scale
Dina Belova (dbelova)
Changed in mos:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Leontii Istomin (listomin) wrote :

from rabbitmq log on node-34:
=ERROR REPORT==== 18-Sep-2015::19:13:54 ===
closing AMQP connection <0.30957.219> (192.168.0.3:45164 -> 192.168.0.26:5673):
{handshake_timeout,frame_header}

192.168.0.26 is node-34
192.168.0.3 is node-28

sahara-engine on node-28 managed the cluster creation.

Revision history for this message
Sergey Reshetnyak (sreshetniak) wrote :

Looks like problem in RabbitMQ. Need more info about RabbitMQ.

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Hi,

Leonty, could you please describe detailed steps to reproduce for the issue?

Sergey , could you please decribe what information you need to get about RabbitMQ?

Thank you!

Revision history for this message
Leontii Istomin (listomin) wrote :

Steps to reproduce the issue:
1. deploy cluster with Sahara
2. enable batching (https://bugs.launchpad.net/mos/+bug/1498003)
3. increase token expiration time (https://bugs.launchpad.net/mos/+bug/1497323)
4. increase db pool connection parameters (https://bugs.launchpad.net/mos/+bug/1497365)
5. Launch sahara cluster with 236 hadoop nodes

Revision history for this message
Leontii Istomin (listomin) wrote :

Sergey, you can find needed info from Diagnostic Snapshot (from description). There is some useful rabbitmq statustic info in /var/log/rabbitmq folder on controller nodes.

Changed in mos:
status: Incomplete → Confirmed
Revision history for this message
Sergey Reshetnyak (sreshetniak) wrote :

Should be recheck on scale lab for MOS 8.0

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

A month without repro and/or feedback. Closing as Invalid. Feel free to re-open, if it's reproduced again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.