Deployment fails with Ceph enabled and single controller

Bug #1628614 reported by Scott Machtmes
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
Medium
Oleksiy Molchanov
Mitaka
Invalid
Medium
Oleksiy Molchanov

Bug Description

Summary: Deployment of environment fails with
"Error Deployment has failed. All nodes are finished. Failed tasks: Task[enable_rados/1] Stopping the deployment process!"

Environment: MOS 9.0; 1 controller, 1 compute, ceph enabled for storage

Reproducible: Yes

Excerpt from astute.log:

2016-09-26 22:15:06 DEBUG [9169] Node[1]: Node 1: task enable_rados, task status running
2016-09-26 22:15:06 WARNING [9169] Puppet agent 1 didn't respond within the allotted time
2016-09-26 22:15:06 DEBUG [9169] Task time summary: enable_rados with status failed on node 1 took 00:03:00
2016-09-26 22:15:06 DEBUG [9169] Node[1]: Decreasing node concurrency to: 0
2016-09-26 22:15:06 DEBUG [9169] Cluster[]: Count faild node 1 for group primary-controller
2016-09-26 22:15:06 DEBUG [9169] Cluster[]: Count faild node 1 for group ceph-osd
2016-09-26 22:15:06 WARNING [9169] Cluster[]: Fault tolerance exceeded the stop conditions [{"fault_tolerance"=>-1, "name"=>"primary-controller", "node_ids"=>[], "failed_node_ids"=>["1"]}]
2016-09-26 22:15:06 INFO [9169] Cluster[]: Stop deployment by internal reason
2016-09-26 22:15:06 DEBUG [9169] Cluster[]: Process node: Node[2]
2016-09-26 22:15:06 DEBUG [9169] Cluster[]: Process node: Node[master]
2016-09-26 22:15:06 DEBUG [9169] Cluster[]: Process node: Node[virtual_sync_node]
2016-09-26 22:15:06 WARNING [9169] Validation of node:
{"uid"=>nil,
 "status"=>"stopped",
 "progress"=>83,
 "deployment_graph_task_name"=>"post_deployment_start",
 "task_status"=>"skipped",
 "custom"=>{}}
 for report failed: Node uid is not provided
2016-09-26 22:15:06 INFO [9169] Casting message to Nailgun:
{"method"=>"deploy_resp",
 "args"=>
  {"task_uuid"=>"3da40f03-f371-417a-a5a3-fb44251da4e3",
   "nodes"=>
    [{"uid"=>"1",
      "status"=>"error",
      "progress"=>100,
      "deployment_graph_task_name"=>"enable_rados",
      "task_status"=>"error",
      "custom"=>
       {:time=>
         {"anchor"=>0.00028215,
          "config_retrieval"=>0.723341508,
          "file"=>0.010434416,
          "filebucket"=>7.5317e-05,
          "package"=>0.000582855,
          "pcmk_colocation"=>0.525652009,
          "pcmk_resource"=>0.452966184,
          "schedule"=>0.000541694,
          "service"=>7.676765051,
          "total"=>9.390641184000001,
          "last_run"=>1474927926},
        :resources=>
         {"changed_resources"=>
           "File[ocf_handler_ntp],File[/etc/ntp.conf],Pcmk_resource[p_ntp],Pcmk_colocation[ntp-with-vrouter-ns],Service[ntp]",
          "failed_resources"=>"",
          "failed"=>0,
          "changed"=>5,
          "total"=>15,
          "restarted"=>0,
          "out_of_sync"=>5,
          "failed_to_restart"=>0,
          "scheduled"=>0,
          "skipped"=>0},
        :changes=>{"total"=>5},
        :events=>{"failure"=>0, "success"=>5, "total"=>5},
        :version=>{"config"=>1474927915, "puppet"=>"3.8.5"},
        :raw_report=>nil,
        :status=>"running",
        :running=>1,
        :enabled=>1,
        :idling=>0,
        :stopped=>0,
        :lastrun=>1474927926,
        :runtime=>180,
        :output=>"Currently running; last completed run 180 seconds ago"},
      "error_type"=>"deploy"}]}}

2016-09-26 22:15:06 INFO [9169] Cluster[]: All nodes are finished. Failed tasks: Task[enable_rados/1] Stopping the deployment process!
2016-09-26 22:15:06 INFO [9169] Casting message to Nailgun:
{"method"=>"deploy_resp",
 "args"=>
  {"task_uuid"=>"3da40f03-f371-417a-a5a3-fb44251da4e3",
   "nodes"=>
    [{"uid"=>nil,
      "status"=>"stopped",
      "progress"=>83,
      "deployment_graph_task_name"=>"post_deployment_start",
      "task_status"=>"skipped",
      "custom"=>{}}]}}

2016-09-26 22:15:06 INFO [9169] Deployment summary: time was spent 00:34:10
2016-09-26 22:15:06 INFO [9169] Casting message to Nailgun:
{"method"=>"deploy_resp",
 "args"=>
  {"task_uuid"=>"3da40f03-f371-417a-a5a3-fb44251da4e3",
   "status"=>"error",
   "progress"=>100,
   "error"=>
    "All nodes are finished. Failed tasks: Task[enable_rados/1] Stopping the deployment process!"}}

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Marking as Incomplete, please attach diagnostic snapshot for debugging.

Changed in fuel:
status: New → Incomplete
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
milestone: none → 9.2
tags: added: area-library
Revision history for this message
Scott Machtmes (smachtmes) wrote :

Uploading a snapshot. This run still failed, but in a different ceph task ( top-role-ceph-osd/2 ).

Revision history for this message
Scott Machtmes (smachtmes) wrote :
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
importance: Undecided → Medium
status: Incomplete → Confirmed
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Diagnostic snapshot doesn't contain logs from controller. It seems you have problems with network connectivity, OSD node is not able to connect to controller. Is your controller up?

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Scott Machtmes (smachtmes) wrote :

Strange. I'll have to run the scenario again and repost the logs. I will provide.

Revision history for this message
Scott Machtmes (smachtmes) wrote :

Here is a new diag snapshot from another similar install I did that failed.

Changed in fuel:
status: Incomplete → New
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Sorry for such a late comment. The problem is related to problem with connectivity between node-4 and node-3 using 172.17.2.128/27 storage network.

[node-4][WARNING] 2016-11-04 17:13:17.969031 7fd58819e700 0 -- :/2756036339 >> 172.17.2.129:6789/0 pipe(0x7fd5780008c0 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd578016860).fault

I can see that storage interfaces on both machines are up. Firewall on node-3 is not blocking traffic to 6789 port.

So I suggest you to start 'ceph -w' on node-4 and check traffic going out from node-4 and coming to node-3 to 6789 port via tcpdump.

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Scott Machtmes (smachtmes)
Changed in fuel:
status: New → Incomplete
Changed in fuel:
assignee: Scott Machtmes (smachtmes) → Oleksiy Molchanov (omolchanov)
Roman Vyalov (r0mikiam)
Changed in fuel:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.