p_heat-engine cannot be started by pacemaker once heat-engine returned an error

Bug #1398462 reported by Dennis Dmitriev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Bogdan Dobrelya

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/6.0_swarm/job/6.0_fuelmain.system_test.ubuntu.thread_4/40/

        Scenario:
            1. Create cluster
            2. Add 1 controller node
            3. Deploy the cluster
            4. Add 2 controller nodes
            5. Deploy changes
            6. Check crm status

Expected result: all resources are functional.

Actual result: 'p_heat-engine' failed to start on the primary controller.

According to the logs, on the steps 2 and 3, 'p_heat-engine' started successfully.

But after the step 5, heat-engine service was trying to start when 'rabbitmq' was not started:
================ (primary controller) node-2: /var/log/heat/heat-engine.log
2014-12-02 09:29:51.853 3187 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 127.0.0.1:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
...

Since several minutes 'rabbitmq' was started, but pacemaker couldn't start the 'heat-engine' :
=============== (primary controller) node-2 $ crm status
...
Failed actions:
    p_heat-engine_start_0 (node=node-2, call=47, rc=1, status=Timed Out, last-rc-change=Tue Dec 2 09:29:49 2014
, queued=60001ms, exec=0ms
): unknown error
...

================ (primary controller) node-2: /var/log/pacemaker.log
<29>Dec 2 09:29:50 node-2 pengine[2657]: notice: LogActions: Start p_heat-engine:0 (node-2)
<28>Dec 2 09:30:49 node-2 lrmd[2655]: warning: child_timeout_callback: p_heat-engine_start_0 process (PID 3063) timed out
<28>Dec 2 09:30:49 node-2 lrmd[2655]: warning: operation_finished: p_heat-engine_start_0:3063 - timed out after 60000ms
<28>Dec 2 09:30:49 node-2 pengine[2657]: warning: unpack_rsc_op: Processing failed op start for p_heat-engine:0 on node-2: unknown error (1)
...
<28>Dec 2 09:33:20 node-2 pengine[2657]: warning: unpack_rsc_op: Processing failed op start for p_heat-engine:0 on node-2: unknown error (1)
<28>Dec 2 09:34:55 node-2 pengine[2657]: warning: unpack_rsc_op: Processing failed op start for p_heat-engine:0 on node-2: unknown error (1)
<28>Dec 2 09:34:55 node-2 pengine[2657]: warning: unpack_rsc_op: Processing failed op start for p_heat-engine:0 on node-2: unknown error (1)
<28>Dec 2 09:35:32 node-2 pengine[2657]: warning: unpack_rsc_op: Processing failed op start for p_heat-engine:0 on node-2: unknown error (1)
<28>Dec 2 09:37:34 node-2 pengine[2657]: warning: unpack_rsc_op: Processing failed op start for p_heat-engine:0 on node-2: unknown error (1)
<28>Dec 2 09:37:34 node-2 pengine[2657]: warning: unpack_rsc_op: Processing failed op start for p_heat-engine:0 on node-2: unknown error (1)

These errors are continued until I run:
crm_resource --resource p_heat-engine --cleanup

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Changed in fuel:
importance: Undecided → High
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to the logs snapshot, the final state of heat engine and rabbit was ok:
Last updated: Tue Dec 2 10:00:05 2014
Last change: Tue Dec 2 09:59:37 2014 via crm_attribute on node-2
...
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-2 node-4 node-5 ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-2 ]
     Slaves: [ node-4 node-5 ]

Are you sure this bag is valid?

Changed in fuel:
status: New → Invalid
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There are a messages in node-2's lrmd.log:
2014-12-02T09:28:30.848774+00:00 info: INFO: p_rabbitmq-server: demote: action begin.
2014-12-02T09:30:49.511099+00:00 warning: warning: child_timeout_callback: p_heat-engine_start_0 process (PID 3063) timed out
2014-12-02T09:30:49.511099+00:00 warning: warning: operation_finished: p_heat-engine_start_0:3063 - timed out after 60000ms
2014-12-02T09:30:58.275673+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running and is member of healthy cluster

But the final state was ok, as it shows the latest pcs status taken 10:00:05 2014

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Looks like the bug is invalid and the issue was in system test. Thanks for helping in investigation!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.