Deployment Failed: "MCollective agents 'X' didn't respond within the allotted time.

Bug #1359064 reported by Tyler Wilson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Unassigned

Bug Description

{"build_id": "2014-08-19_05-27-34", "ostf_sha": "d2a894d228c1f3c22595a77f04b1e00d09d8e463", "build_number": "81", "auth_required": true, "api": "1.0", "nailgun_sha": "36d27ff737b361f92093986d061bbfc1670bee45", "production": "docker", "fuelmain_sha": "a19c69d1b80abb6dfa5bc8caafb11b32551c2cf8", "astute_sha": "efe3cb3668b9079e68fb1534fd4649ac45a344e1", "feature_groups": ["experimental"], "release": "5.1", "fuellib_sha": "85bf0728a8cbb5385cd644b2103c878b4251ccca"}

1. Create new environment (Ubuntu, HA mode)
2. Choose GRE segmentation
3. Add controller x3 + Ceilometer
4. Add computes x3 + Ceph OSD

Deployment failed with it unable to upload the TestVM image;

2014-08-20 07:20:15 ERR
[495] Error running RPC method deploy: Upload cirros "TestVM" image failed, trace: ["/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions/upload_cirros_image.rb:96:in `raise_cirros_error'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions/upload_cirros_image.rb:79:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:30:in `block in process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:30:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:30:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:45:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/dispatcher.rb:58:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:136:in `dispatch_message'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:99:in `block in dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `call'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `block in each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:97:in `each_with_index'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:97:in `dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:85:in `block in perform_main_job'"]
2014-08-20 07:20:15 ERR
[495] 0c59870e-ec8b-4944-bf08-462f3f80f71e: Upload cirros "TestVM" image failed
2014-08-20 07:20:15 ERR
[495] 0c59870e-ec8b-4944-bf08-462f3f80f71e: cmd: . /root/openrc && /usr/bin/glance image-create --name 'TestVM' --is-public true --container-format='bare' --disk-format='qcow2' --min-ram=64 --property murano_image_info='{"title": "Murano Demo", "type": "cirros.demo"}' --file '/usr/share/cirros-testvm/cirros-x86_64-disk.img'
                                               mcollective error: 0c59870e-ec8b-4944-bf08-462f3f80f71e: MCollective agents '24' didn't respond within the allotted time.
2014-08-20 07:20:15 ERR
[495] MCollective agents '24' didn't respond within the allotted time.

Ceph -s of failed cluster;

root@node-24:~# ceph -s
    cluster aba34f9a-7b28-415c-b328-a12987338939
     health HEALTH_WARN pool .rgw.root pg_num 512 > pgp_num 8
     monmap e3: 3 mons at {node-24=192.168.0.3:6789/0,node-25=192.168.0.4:6789/0,node-26=192.168.0.5:6789/0}, election epoch 10, quorum 0,1,2 node-24,node-25,node-26
     osdmap e37: 9 osds: 9 up, 9 in
      pgmap v49: 2240 pgs, 7 pools, 0 bytes data, 0 objects
            18792 MB used, 4266 GB / 4284 GB avail
                2240 active+clean

After 'ceph osd pool set .rgw.root pgp_num 512'

root@node-24:~# ceph -s
    cluster aba34f9a-7b28-415c-b328-a12987338939
     health HEALTH_OK
     monmap e3: 3 mons at {node-24=192.168.0.3:6789/0,node-25=192.168.0.4:6789/0,node-26=192.168.0.5:6789/0}, election epoch 10, quorum 0,1,2 node-24,node-25,node-26
     osdmap e39: 9 osds: 9 up, 9 in
      pgmap v51: 2240 pgs, 7 pools, 0 bytes data, 0 objects
            18792 MB used, 4266 GB / 4284 GB avail
                2240 active+clean

However I have no way of doing this before the deployment.

Revision history for this message
Tyler Wilson (loth) wrote :
Revision history for this message
Tyler Wilson (loth) wrote :

Per instructed I also issued 4x 'mco ping';

[root@fuel ~]# mco ping
27 time=20.08 ms
29 time=20.51 ms
master time=20.95 ms
28 time=22.45 ms
15 time=22.89 ms
31 time=23.24 ms
32 time=23.65 ms
25 time=23.99 ms
24 time=24.36 ms
26 time=24.78 ms

---- ping statistics ----
10 replies max: 24.78 min: 20.08 avg: 22.69
[root@fuel ~]# mco ping
master time=20.02 ms
27 time=20.52 ms
28 time=21.89 ms
15 time=22.34 ms
29 time=22.67 ms
31 time=23.03 ms
32 time=23.40 ms
25 time=23.77 ms
24 time=24.15 ms
26 time=24.54 ms

---- ping statistics ----
10 replies max: 24.54 min: 20.02 avg: 22.63
[root@fuel ~]# mco ping
master time=20.13 ms
15 time=20.50 ms
28 time=22.03 ms
29 time=22.46 ms
32 time=22.87 ms
31 time=23.26 ms
27 time=23.64 ms
26 time=24.04 ms
25 time=24.42 ms
24 time=24.82 ms

---- ping statistics ----
10 replies max: 24.82 min: 20.13 avg: 22.82
[root@fuel ~]# mco ping
28 time=19.79 ms
29 time=20.22 ms
master time=21.67 ms
27 time=22.13 ms
31 time=22.52 ms
32 time=22.96 ms
26 time=23.33 ms
24 time=23.74 ms
25 time=24.17 ms
15 time=32.55 ms

---- ping statistics ----
10 replies max: 32.55 min: 19.79 avg: 23.31

Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Dmitry Borodaenko (dborodaenko)
milestone: none → 5.1
Revision history for this message
Dima Shulyak (dshulyak) wrote :

reposting here a way to reproduce

I reproduced this behaviour with next procedure:

1.start deployment
2. wait after deployment will be started on slaves
3. find corresponding vm id on host system and virsh suspend it
4. wait until some time will pass and look at next logs
5. virsh resume it

2014-08-20 10:24:34 INFO
[417] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"8b0fe7d8-2e4f-4509-8eea-1f9af1f347ee", "nodes"=>[{"uid"=>"12", "status"=>"error", "error_type"=>"deploy", "role"=>"compute"}]}}
2014-08-20 10:24:34 DEBUG
[417] Nodes statuses: {"succeed"=>[], "error"=>[], "running"=>["13"]}
2014-08-20 10:24:34 DEBUG
[417] Data send by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"12", "status"=>"error", "error_type"=>"deploy", "role"=>"compute"}]}
2014-08-20 10:24:34 DEBUG
[417] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"12", "status"=>"error", "error_type"=>"deploy", "role"=>"compute"}]}
2014-08-20 10:23:52 DEBUG
[417] Retry #5 to run mcollective agent on nodes: '12'
2014-08-20 10:23:09 DEBUG
[417] Retry #4 to run mcollective agent on nodes: '12'
2014-08-20 10:22:27 DEBUG
[417] Retry #3 to run mcollective agent on nodes: '12'
2014-08-20 10:21:44 DEBUG
[417] Retry #2 to run mcollective agent on nodes: '12'
2014-08-20 10:21:02 DEBUG
[417] Retry #1 to run mcollective agent on nodes: '12'

Revision history for this message
Dima Shulyak (dshulyak) wrote :

My proposal would be to retry last action for failed nodes,
In such case if node failed due to reconnect issues - on retry it will succeed and continue deployment

Vladimir what do you think about such solution?

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Fuel Astute Team (fuel-astute)
Curtis Hovey (sinzui)
Changed in fuel:
assignee: Registry Administrators (registry) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.