building a pxc cluster should fail faster if an instance goes to an ERROR state

Bug #1525104 reported by Craig Vyvial
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
Medium
Unassigned

Bug Description

Currently if you build a pxc cluster and possibly other datastore clusters, the taskmanager waits for the instances in the cluster to the active. If one of them never reaches the active state and goes in an ERROR state the taskmanager continues to poll until timeout. The Cluster should fail fast if there is an instance that errors when creating a cluster.

taskmanager logs

2015-12-11 07:36:04.116 INFO trove.taskmanager.models [-] Created instance 8f457133-afaf-48a9-b4ce-822ed21438ce successfully.
2015-12-11 07:36:04.440 INFO trove.taskmanager.models [-] Created instance 33385f84-69a3-4e5e-8a5b-5cc946eee203 successfully.
2015-12-11 07:36:14.478 INFO trove.taskmanager.models [-] Created instance 8aeeb130-7158-450b-b416-06fb84c99cfd successfully.
2015-12-11 07:36:14.578 INFO trove.taskmanager.models [-] Created instance 82c8d676-82ec-462c-8ca7-16965a883cd4 successfully.
2015-12-11 07:36:16.634 INFO trove.taskmanager.models [-] Created instance 33167a34-5526-46c1-91f0-c22eae3d2f86 successfully.
2015-12-11 07:36:19.594 DEBUG trove.taskmanager.models [-] Checking service status of instance ids: [u'142e2e3b-d8e8-4fa4-9caf-8743e13d072c', u'33167a34-5526-46c1-91f0-c22eae3d2f86', u'33385f84-69a3-4e5e-8a5b-5cc946eee203', u'82c8d676-82ec-462c-8ca7-16965a883cd4', u'8aeeb130-7158-450b-b416-06fb84c99cfd', u'8f457133-afaf-48a9-b4ce-822ed21438ce', u'd39b7f8b-592b-42cd-8688-791d66e0f3c6'] from (pid=26327) _all_status_ready /opt/stack/trove/trove/taskmanager/models.py:207

...
continues to poll...

ubuntu@devstack2:~$ trove list --in
+--------------------------------------+---------------------------+-----------+-------------------+--------+-----------+------+
| ID | Name | Datastore | Datastore Version | Status | Flavor ID | Size |
+--------------------------------------+---------------------------+-----------+-------------------+--------+-----------+------+
| 142e2e3b-d8e8-4fa4-9caf-8743e13d072c | mongo-cluster-configsvr-3 | mongodb | 3.0 | ERROR | 7 | 2 |
| 33167a34-5526-46c1-91f0-c22eae3d2f86 | mongo-cluster-rs1-1 | mongodb | 3.0 | BUILD | 7 | 2 |
| 33385f84-69a3-4e5e-8a5b-5cc946eee203 | mongo-cluster-configsvr-2 | mongodb | 3.0 | BUILD | 7 | 2 |
| 82c8d676-82ec-462c-8ca7-16965a883cd4 | mongo-cluster-rs1-2 | mongodb | 3.0 | BUILD | 7 | 2 |
| 8aeeb130-7158-450b-b416-06fb84c99cfd | mongo-cluster-configsvr-1 | mongodb | 3.0 | BUILD | 7 | 2 |
| 8f457133-afaf-48a9-b4ce-822ed21438ce | mongo-cluster-rs1-3 | mongodb | 3.0 | BUILD | 7 | 2 |
| d39b7f8b-592b-42cd-8688-791d66e0f3c6 | mongo-cluster-mongos-1 | mongodb | 3.0 | ERROR | 7 | 2 |
+--------------------------------------+---------------------------+-----------+-------------------+--------+-----------+------+
ubuntu@devstack2:~$ trove cluster-list
+--------------------------------------+---------------+-----------+-------------------+-----------+
| ID | Name | Datastore | Datastore Version | Task Name |
+--------------------------------------+---------------+-----------+-------------------+-----------+
| 9d1a64a0-910f-4734-b39b-a861e36d584f | mongo-cluster | mongodb | 3.0 | BUILDING |
+--------------------------------------+---------------+-----------+-------------------+-----------+

Revision history for this message
Craig Vyvial (cp16net) wrote :
Download full text (12.6 KiB)

Related to this bug... when all the instances you build in the cluster all goto ERROR state the cluster times out waiting for them and the cluster never gets out of the BUILDING state.

LOGS:
2015-12-16 18:01:28.506 ERROR oslo.service.loopingcall [-] Fixed interval looping call 'trove.common.utils.poll_and_check' failed
2015-12-16 18:01:28.506 TRACE oslo.service.loopingcall Traceback (most recent call last):
2015-12-16 18:01:28.506 TRACE oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 135, in _run_loop
2015-12-16 18:01:28.506 TRACE oslo.service.loopingcall result = func(*self.args, **self.kw)
2015-12-16 18:01:28.506 TRACE oslo.service.loopingcall File "/opt/stack/trove/trove/common/utils.py", line 192, in poll_and_check
2015-12-16 18:01:28.506 TRACE oslo.service.loopingcall raise exception.PollTimeOut
2015-12-16 18:01:28.506 TRACE oslo.service.loopingcall PollTimeOut: Polling request timed out.
2015-12-16 18:01:28.506 TRACE oslo.service.loopingcall
2015-12-16 18:01:28.508 ERROR trove.taskmanager.models [-] Timeout for all instance service statuses to become ready.
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models Traceback (most recent call last):
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models File "/opt/stack/trove/trove/taskmanager/models.py", line 244, in _all_instances_ready
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models time_out=CONF.usage_timeout)
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models File "/opt/stack/trove/trove/common/utils.py", line 208, in poll_until
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models sleep_time=sleep_time, time_out=time_out).wait()
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 121, in wait
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models return hubs.get_hub().switch()
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 294, in switch
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models return self.greenlet.switch()
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 135, in _run_loop
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models result = func(*self.args, **self.kw)
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models File "/opt/stack/trove/trove/common/utils.py", line 192, in poll_and_check
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models raise exception.PollTimeOut
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models PollTimeOut: Polling request timed out.
2015-12-16 18:01:28.508 TRACE trove.taskmanager.models
2015-12-16 18:01:28.514 DEBUG trove.db.models [-] Saving DBInstance: {u'cluster_id': u'565a9eea-6d67-471a-b810-4d3b353189ad', u'shard_id': None, u'deleted_at': None, u'id': u'00b91f8e-edc5-4248-8c56-010b95879417', u'datastore_version_id': u'93962cb1-9566-44f8-8187-ef37f351c0ef', 'errors': {}, u'hostname': None, u'server_status': None, u'task_description': 'Build error: Server.', u'volume_size': 1, u'typ...

Revision history for this message
Craig Vyvial (cp16net) wrote :

Here is a paste of the log from the last comment since its quite hard to read.

http://paste.openstack.org/show/482162/

Revision history for this message
Craig Vyvial (cp16net) wrote :

https://github.com/openstack/trove/blob/master/trove/taskmanager/models.py#L203-L258

This logic needs to be handle if the trove instance status goes to ERROR and the InstanceServiceStatus never gets updated because the guest never comes online. Currently this code is dependent on the guest coming online and sometimes it never does.

Changed in trove:
importance: Undecided → Medium
Amrith Kumar (amrith)
tags: added: delete-instance-force
Revision history for this message
tianhui (tianhui) wrote :

I ran into the same problem in Mitaka
http://paste.openstack.org/show/482162/

Revision history for this message
Zhao Chao (zhaochao1984) wrote :

This should be duplicate to https://bugs.launchpad.net/trove/+bug/1516763, and that one was already fixed.

Changed in trove:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.