3 out of 4 Ceph OSD nodes failing to deploy

Bug #1473824 reported by Rob Neff
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
Medium
Fuel Python (Deprecated)
8.0.x
Won't Fix
Medium
Fuel Python (Deprecated)

Bug Description

1. Used Fuel 6.0 to create a OpenStack Cluster with 1 controller & 4 compute nodes
2. Successfully deployed
---------------
3. Added 4 nodes as Ceph OSD's, configured networking & hard drives
4. Click Deploy

Result:
Error:
2015-07-11 02:00:19 ERR (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]/returns) change from notrun to 0 failed: ceph-deploy --overwrite-conf config pull node-2 returned 1 instead of one of [0]
2015-07-11 02:00:19 ERR /usr/bin/puppet:4
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:91:in `execute'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:137:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:478:in `exit_on_fail'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:470:in `plugin_hook'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:146:in `run_command'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:218:in `main'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:268:in `apply_catalog'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:192:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:124:in `apply_catalog'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:160:in `benchmark'
2015-07-11 02:00:19 ERR /usr/lib/ruby/1.8/benchmark.rb:308:in `realtime'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:161:in `benchmark'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:125:in `apply_catalog'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/resource/catalog.rb:163:in `apply'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/report.rb:108:in `as_logging_destination'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util/log.rb:149:in `with_destination'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/resource/catalog.rb:164:in `apply'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:108:in `evaluate'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/graph/relationship_graph.rb:118:in `traverse'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `evaluate'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:326:in `thinmark'
2015-07-11 02:00:19 ERR /usr/lib/ruby/1.8/benchmark.rb:308:in `realtime'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:327:in `thinmark'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `evaluate'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `call'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:187:in `eval_resource'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:174:in `apply'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:18:in `evaluate'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:81:in `perform_changes'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:81:in `each'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:82:in `perform_changes'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:130:in `sync_if_needed'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:193:in `sync'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/type/exec.rb:120:in `sync'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util/errors.rb:97:in `fail'
2015-07-11 02:00:19 ERR ceph-deploy --overwrite-conf config pull node-2 returned 1 instead of one of [0]

Revision history for this message
Rob Neff (rob-neff) wrote :

Everytime I try to submit a the logs (995MB), launchpad takes 8 minutes to upload, then after reaching 100%, it gives me this:

Timeout error
Sorry, something just went wrong in Launchpad.

We’ve recorded what happened, and we’ll fix it as soon as possible. Apologies for the inconvenience.

Trying again in a couple of minutes might work.

(Error ID: OOPS-c0dd22466f83815695539d8057071ee5)

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Sorry, I must pass it to incomplete. Needs a diagnostic snapshot, you can upload it to different host if you is not able to upload it here.

Changed in fuel:
milestone: none → 7.0
assignee: nobody → Oleksiy Molchanov (omolchanov)
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Rob Neff (rob-neff) wrote :

Thanks for the suggestion. I have uploaded the Snapshot to Dropbox.

https://dl.dropboxusercontent.com/u/3516115/fuel-snapshot-2015-07-11_17-37-13.tgz

Revision history for this message
Rob Neff (rob-neff) wrote :

Changed back to New since I included the Fuel Snapshot.

Changed in fuel:
status: Incomplete → New
Revision history for this message
Rob Neff (rob-neff) wrote :
Dmitry Ilyin (idv1985)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Rob, what you were doing in the mentioned "configured networking & hard drives". I mean network part.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Rob Neff (rob-neff) wrote :

Network (all systems)
  1G - Eth0 - PXE
  1G - Eth1 - Public
40G - Eth2 - Management, Storage, Private

 Hard Drives
1U Compute Servers (No errors)
  1x800GB Raid-0 Boot Drive

2U Ceph Servers (Errors 3 out of 4)
  1x 300GB Intel S3500 Operating System
20x 600GB 10k RPM - Ceph OSD
  1x 800GB Intel S3700 Ceph Journal

Revision history for this message
Rob Neff (rob-neff) wrote :

Added new information

Changed in fuel:
status: Incomplete → New
Changed in fuel:
status: New → Confirmed
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Deeper investigation shows that ceph failed on gathering keys from primary-controller

node-33:
http://paste.openstack.org/show/380324/
node-34:
http://paste.openstack.org/show/380325/
node-35:
http://paste.openstack.org/show/380326/

As you can see node-33 successfully ssh-ed to node-2 and fetched keys, but 34 and 35 failed with no route to host error. Logs on primary controller states that network on primary-controller (node-2) was up and running, all the rest were good too.

From my point of view it was problems with networking on your env, but anyway I will try to reproduce it on my env and post results here.

Revision history for this message
Rob Neff (rob-neff) wrote :

Thanks Oleksiy. We were able to reploy and now the environment is working without any physical or logical networking changes.

If you can just improve the error message, that would really help my confidence in Fuel. Often if something fails, it is very difficult to determine what went wrong and the final resolution from the customer side is "Fuel is buggy".

If you change:
err: ceph-deploy --overwrite-conf config pull node-2 returned 1 instead of one of [0]

to:
err: Could not successfully ssh into to node-2 and fetch keys. Check the network connectivity from this node to node-2.

Then I would not have logged a bug, I would have double-checked my networking and retried the deployment.

Hopefully this feedback is useful. Thanks for investigating.

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Hi Rob,

To debug FUEL properly you should go to astute log before in UI and check what task failed first and on what node. After this you should go to puppet log on that node and check ERROR messages, there you can see what was the root cause.

Anyway I am passing this to python team to listen to their comment for your suggestion.

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Fuel Python Team (fuel-python)
importance: High → Medium
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Idea to show more user friendly message is good. But this bug is for 6.0. We have changed deployment behavior in 6.1. Now you will get message about failed tasks instead of big puppet log. It should helpful to find problem more faster. But we still have problem with message. It is better in 6.1+, but still not so helpful as can be.

Moved to 8.0

Changed in fuel:
status: Confirmed → Won't Fix
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0 → 8.0
status: Won't Fix → Triaged
no longer affects: fuel/8.0.x
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

Reproduced:
[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

Scenario:
1. Create new environment
2. Choose Neutron VxLAN
3. Choose Ceph for images
4. Choose Ceph RadosGW for objects
5. Add 3 controller
6. Add 2 compute
7. Add 1 cinder
8. Add 3 ceph nodes
9. Change default dns server to any 2 public dns servers to the 'Host OS DNS Servers' on Settings tab
10. Change default ntp servers to any 2 public ntp servers to the 'Host OS NTP Servers' on Settings tab
11. Verify networks
12. Deploy the environment

Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

Raised to high - we have to fix this bug in 8.0, new logs provided by Dmitry Belyaninov.

Changed in fuel:
importance: Medium → High
Revision history for this message
Dmitry Tyzhnenko (dtyzhnenko) wrote :

Reproduced on enother scenario:

Deployment with 3 controllers, NeutronVxLAN, both Ceph
Scenario:
1. Create new environment
2. Choose Neutron, VxLAN
3. Choose Ceph for volumes and Ceph for images
4. Change ceph replication factor to 3
5. Add 3 controller
6. Add 2 compute
7. Add 3 ceph
8. Change disk configuration for all Ceph nodes. Change 'Ceph' volume for vdc
9. Make management and storage networks untagged
10. Move each network to separate interface
11. Change default dns server to any 2 public dns servers to the 'Host OS DNS Servers' on Settings tab
12. Change default ntp servers to any 2 public ntp servers to the 'Host OS NTP Servers' on Settings tab
13. Verify networks
14. Start deployment
15. Stop deployment on deployment stage
16. Change openstack username, password, tenant
17. Deploy cluster after stop
18. Verify networks
19. Run OSTF tests

Failed on 15 step

Logs snapshot attached

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Guys, this bug is about message about issues with network connectivity. It cannot break any deployments. Changing priority back to medium.

Changed in fuel:
importance: High → Medium
milestone: 8.0 → 9.0
Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

Marked as "Wont fix" for 8.0 - medium priority

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

We've rewritten our tasks engine in 9.0. I hope this should be enough.

Changed in fuel:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.