Swarm bays are failing after service failure logic added

Bug #1502329 reported by Andrew Melton
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Magnum
Invalid
Undecided
Unassigned

Bug Description

Logic was implemented to report back to heat when bay services failed to start. This logic performed a 'systemctl status X' for each service at the end of cloud-init. This will almost always fail because we have to start the bay services asynchronously and it is very likely they won't be started by the time the services are checked.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to magnum (master)

Reviewed: https://review.openstack.org/230639
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=156e315e98dfa7a0ebe759bb78639c66eb0d8b77
Submitter: Jenkins
Branch: master

commit 156e315e98dfa7a0ebe759bb78639c66eb0d8b77
Author: Andrew Melton <email address hidden>
Date: Fri Oct 2 20:49:30 2015 +0000

    Fix swarm bay failure reporting

    The old method of detecting failures was very likely to fail in
    many cases because it relied on all bay services being started by
    the time cloud-init finished. This is a problem because the bay
    services are started asynchronously and can take quite a while to
    start.

    The new method relies on systemd's OnFailure directive to kick off
    specific service units when a failure is detected. Both the swarm
    agent and manager have their own failure service so that we are
    not overloading a single wait condition with multiple potential
    failures.

    Change-Id: I7ce4be567517fe948dde0ac7225996967196c9e8
    Closes-bug: #1502329

Changed in magnum:
status: New → Fix Committed
Revision history for this message
Eli Qiao (taget-9) wrote :

hi Andrew, thanks for you patch,
but I tested, even I the bay is still CREATE_COMPLETE status even swarm-manager and swarm-agent service are all failed to start.

taget@taget-ThinkStation-P300:~/devstack$ magnum bay-list
+--------------------------------------+-----------+------------+--------------+-----------------+
| uuid | name | node_count | master_count | status |
+--------------------------------------+-----------+------------+--------------+-----------------+
| 1e6feb12-0f81-453b-9599-1d0735c4cafa | swarmbay5 | 1 | 1 | CREATE_COMPLETE |

bash-4.3# systemctl | grep swarm
● swarm-agent.service loaded failed failed Swarm Agent
● swarm-manager.service loaded failed failed Swarm Manager

checking stack events

taget@taget-ThinkStation-P300:~/devstack$ heat event-list swarmbay5-p5r2buwwtnlc | grep handle
| master_wait_handle | 54f4b3d8-a121-4105-b56b-b913b6486305 | state changed | CREATE_IN_PROGRESS | 2015-10-08T02:46:38 |
| agent_wait_handle | 8e3efabc-7946-4043-bfd0-31bc73806e78 | state changed | CREATE_IN_PROGRESS | 2015-10-08T02:46:39 |
| cloud_init_wait_handle | c7887670-b1e3-41bc-988c-7e05627cbe55 | state changed | CREATE_IN_PROGRESS | 2015-10-08T02:46:41 |
| agent_wait_handle | 359f3951-424c-4f82-b470-598761f6cda6 | state changed | CREATE_COMPLETE | 2015-10-08T02:46:42 |
| master_wait_handle | 9aebf7d2-9125-4e77-8f3b-8618f85dfc41 | state changed | CREATE_COMPLETE | 2015-10-08T02:46:43 |
| cloud_init_wait_handle | a5bb4aa7-b3ab-4712-8a0c-88b9594f2492 | state changed | CREATE_COMPLETE | 2015-10-08T02:46:44 |
| cloud_init_wait_handle | 511a8d27-965a-4a37-a608-928fd60b411f | Signal: status:SUCCESS reason:Setup complete | SIGNAL_COMPLETE | 2015-10-08T02:49:31 |
| agent_wait_handle | 8cdeee96-3d64-46f9-9d4e-5de6cf4e7d0c | Signal: status:SUCCESS reason:Setup complete | SIGNAL_COMPLETE | 2015-10-08T02:50:42 |
| master_wait_handle | 707679b9-a378-4564-bb78-678d714ce24a | Signal: status:SUCCESS reason:Setup complete | SIGNAL_COMPLETE | 2015-10-08T02:50:42 |
| master_wait_handle | 2d6ac162-f14f-45d5-804f-1f6f129d7e2e | Signal: status:FAILURE reason:swarm-manager service failed to start. | SIGNAL_COMPLETE | 2015-10-08T02:50:47 |
| agent_wait_handle | ed25d9c5-894b-49f7-959c-f079306e6f10 | Signal: status:FAILURE reason:swarm-agent service failed to start. | SIGNAL_COMPLETE | 2015-10-08T02:50:56 |

Changed in magnum:
status: Fix Committed → Confirmed
Revision history for this message
Eli Qiao (taget-9) wrote :

| master_wait_handle | 707679b9-a378-4564-bb78-678d714ce24a | Signal: status:SUCCESS reason:Setup complete | SIGNAL_COMPLETE | 2015-10-08T02:50:42 |
| master_wait_handle | 2d6ac162-f14f-45d5-804f-1f6f129d7e2e | Signal: status:FAILURE reason:swarm-manager service failed to start. | SIGNAL_COMPLETE | 2015-10-08T02:50:47 |

as you can see that we get master_wait_handle for 2 times,

first one is status:SUCCESS, that is because that we called ExecStartPost after calling /usr/bin/docker run --name swarm-manager

ExecStartPost=/usr/bin/curl -sf -X PUT -H 'Content-Type: application/json' \\$
  --data-binary '{"Status": "SUCCESS", "Reason": "Setup complete", "Data": "OK", "UniqueId": "00000"}' \\$
  "$WAIT_HANDLE"$

ExecStartPost will be called even docker run failed(because no status checking at all)

second one is status:FAILURE,
this one is triggered by master-failure.service , but the stack will ignore it , right ?

then how can we know if the bay is usable or not ? master-failure.service do nothing to help.

Revision history for this message
Eli Qiao (taget-9) wrote :

master_wait_handle will only accept 1 signal, so the second one will be ignored.

I talked to heat guys, and heat template don't support asynchronous failure

see logs:

(17时28分00秒) eliqiao: hello heaters, I have a question about heat template, if a wait_handle get triggered for 2 times,(first time status:SUCCESS , second time status:FAILURE) I expect the stack will be created failed, but actually it is created successfully, is this the correct behavior?
(17时28分57秒) shardy: eliqiao: No, if you have a wait condition with an expected count of 2
(17时29分15秒) shardy: eliqiao: what count are you using? It will complete OK if the count is 1
(17时29分53秒) eliqiao: shardy: let me check.
(17时30分13秒) eliqiao: shardy: not specify , so default will be 1?
(17时30分32秒) shardy: eliqiao: yes, you need to set it to 2 if you expect 2 signals before declaring success
(17时30分46秒) eliqiao: shardy: hmm.. but I am not sure 2 is the right number.
(17时31分25秒) eliqiao: shardy: I would expect that if I got the 2nd signal, stack is fault.
(17时31分41秒) shardy: https://github.com/openstack/heat-templates/blob/master/hot/native_waitcondition.yaml#L32
(17时31分52秒) vgridnev [~vgridnev@91.207.132.76] 进入了聊天室。
(17时32分13秒) shardy: eliqiao: Unfortunately you can't define an asynchronous failure like that, or heat would never be able to say the stack as a whole is complete
(17时32分29秒) shardy: you have to decide how many signals mean complete, then set that number
(17时33分08秒) eliqiao: shardy: ah, too bad..
(17时33分26秒) shardy: eliqiao: perhaps you can wrap whatever is generating the signals in a script, which waits and only sends success if the expected things happen
(17时35分00秒) shardy: eliqiao: if you really do need an async alarming mechanism, a ceilometer alarm is probably better than a wait condition
(17时35分05秒) eliqiao: shardy: hmm... I think it's the issue/problem of my scrpts, I have 2 signals ( the 2nd one will notice stack that my service is not correctly setup)
(17时35分59秒) eliqiao: shardy: :( we don't use celilometer service.
(17时36分33秒) shardy: eliqiao: sounds like you just need to move the signal to after your script can make an accurate determination about if the service is working :)
(17时37分31秒) eliqiao: shardy: yeah, I will think it more, thank you for your kindly help, appricate it very much :)

Revision history for this message
Eli Qiao (taget-9) wrote :

hi Andrew:

I don't understand your words by "This will almost always fail because we have to start the bay services asynchronously and it is very likely they won't be started by the time the services are checked."

won't cfn_signal be triggered at last?
will enable_services being called before cfn_signal was triggered?
when cfn_signal is doing systemctl status checking will all services are running ??

  swarm_master_init:$
    type: "OS::Heat::MultipartMime"$
    properties:$
      parts:$
        - config: {get_resource: disable_selinux}$
        - config: {get_resource: remove_docker_key}$
        - config: {get_resource: write_heat_params}$
        - config: {get_resource: make_cert}$
        - config: {get_resource: write_swarm_agent_failure_service}$
        - config: {get_resource: write_swarm_manager_failure_service}$
        - config: {get_resource: write_docker_service}$
        - config: {get_resource: write_docker_socket}$
        - config: {get_resource: write_swarm_agent_service}$
        - config: {get_resource: write_swarm_master_service}$
        - config: {get_resource: configure_swarm}$
        - config: {get_resource: add_proxy}$
        - config: {get_resource: enable_services}$
        - config: {get_resource: cfn_signal}$

Revision history for this message
Daneyon Hansen (danehans) wrote :

Eli,

Although II feel we can do a better job of validating swarm_master_init and swarm_agent_init before sending the cfn signal, the existing logic works in my testing. The failure service should only execute the curl command [1] if the swarm manager/agent service is unable to start [2]. Review my gist [3] for detailed output.

[1] https://github.com/openstack/magnum/blob/master/magnum/templates/docker-swarm/fragments/write-bay-failure-service.yaml#L14
[2] https://github.com/openstack/magnum/blob/master/magnum/templates/docker-swarm/fragments/write-swarm-agent-service.yaml#L12
[3] https://gist.github.com/danehans/c3affc88ed9fe56a1efe
gist for deatiled

Revision history for this message
Eli Qiao (taget-9) wrote :

hi Daneyon
thanks for your reply.
maybe the test scenario is difference from mine, the root reason in my environment is, cfn was sent 2 times.

first we have get a cfn signal from [1], ExecStart=/usr/bin/docker has been started and run for a while, then
it failed and another cnf signal was sent [2]. so we got 2 cfn signal.

| master_wait_handle | 707679b9-a378-4564-bb78-678d714ce24a | Signal: status:SUCCESS reason:Setup complete | SIGNAL_COMPLETE | 2015-10-08T02:50:42 |
| master_wait_handle | 2d6ac162-f14f-45d5-804f-1f6f129d7e2e | Signal: status:FAILURE reason:swarm-manager service failed to start. | SIGNAL_COMPLETE | 2015-10-08T02:50:47 |

I think why you got CREATE_FAILED signal is the stack reach [2] before [1].

[1]https://github.com/openstack/magnum/blob/master/magnum/templates/docker-swarm/fragments/write-swarm-agent-service.yaml#L22
[2]https://github.com/openstack/magnum/blob/master/magnum/templates/docker-swarm/fragments/write-bay-failure-service.yaml#L14

for testing the scenario, you can use a atomic-5 image (this is a bad image with docker v1.7.1), and Tango works out a new atomic image. on https://fedorapeople.org/groups/magnum/fedora-21-atomic-6-d181.qcow2

(01时49分17秒) Tango: eghobo_: ping
(01时49分37秒) eghobo_: Tango: what's up?
(01时50分31秒) Tango: eghobo_: Hi Egor, I built a new image yesterday with Docker 1.8.1, but my devstack has been acting up so I haven't been able to check the "docker run" problem
(01时50分47秒) Tango: eghobo_: Wonder if you might have some time to give it a try
(01时51分22秒) eghobo_: sure, I need image link
(01时51分33秒) Tango: eghobo_: https://fedorapeople.org/groups/magnum/fedora-21-atomic-6-d181.qcow2
(01时52分30秒) eghobo_: cannot promise now (baby-sitting toady), but will do today
(01时53分01秒) Tango: eghobo_: ok, np, thanks Egor.

Revision history for this message
Egor Guz (eghobo) wrote :

Eli, actually I was tested Andrew's code with incorrect image, service failed and message was passed to heat.

Adrian Otto (aotto)
Changed in magnum:
milestone: none → mitaka-1
Changed in magnum:
status: Confirmed → Invalid
Revision history for this message
Spyros Trigazis (strigazi) wrote :

Reopen if the issue occurs again.

Revision history for this message
Murali Allada (murali-allada) wrote :

please reopen this bug if you see this issue again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.