Fuel for OpenStack

3 out of 4 Ceph OSD nodes failing to deploy

Bug #1473824 reported by Rob Neff on 2015-07-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Won't Fix	Medium	Fuel Python (Deprecated)	Fuel for OpenStack 9.0
	8.0.x	Won't Fix	Medium	Fuel Python (Deprecated)	Fuel for OpenStack 8.0

Bug Description

1. Used Fuel 6.0 to create a OpenStack Cluster with 1 controller & 4 compute nodes
2. Successfully deployed
---------------
3. Added 4 nodes as Ceph OSD's, configured networking & hard drives
4. Click Deploy

Result:
Error:
2015-07-11 02:00:19 ERR (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]/returns) change from notrun to 0 failed: ceph-deploy --overwrite-conf config pull node-2 returned 1 instead of one of [0]
2015-07-11 02:00:19 ERR /usr/bin/puppet:4
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:91:in `execute'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:137:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:478:in `exit_on_fail'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:470:in `plugin_hook'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:146:in `run_command'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:218:in `main'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:268:in `apply_catalog'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:192:in `run'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:124:in `apply_catalog'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:160:in `benchmark'
2015-07-11 02:00:19 ERR /usr/lib/ruby/1.8/benchmark.rb:308:in `realtime'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:161:in `benchmark'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:125:in `apply_catalog'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/resource/catalog.rb:163:in `apply'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/report.rb:108:in `as_logging_destination'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util/log.rb:149:in `with_destination'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/resource/catalog.rb:164:in `apply'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:108:in `evaluate'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/graph/relationship_graph.rb:118:in `traverse'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `evaluate'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:326:in `thinmark'
2015-07-11 02:00:19 ERR /usr/lib/ruby/1.8/benchmark.rb:308:in `realtime'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:327:in `thinmark'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `evaluate'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `call'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:187:in `eval_resource'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:174:in `apply'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:18:in `evaluate'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:81:in `perform_changes'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:81:in `each'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:82:in `perform_changes'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:130:in `sync_if_needed'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:193:in `sync'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/type/exec.rb:120:in `sync'
2015-07-11 02:00:19 ERR /usr/lib/ruby/site_ruby/1.8/puppet/util/errors.rb:97:in `fail'
2015-07-11 02:00:19 ERR ceph-deploy --overwrite-conf config pull node-2 returned 1 instead of one of [0]

Tags:

Revision history for this message

Rob Neff (rob-neff) wrote on 2015-07-13:

Fuel 6.0 Ceph Deployment Error.png Edit (128.0 KiB, image/png)

Everytime I try to submit a the logs (995MB), launchpad takes 8 minutes to upload, then after reaching 100%, it gives me this:

Timeout error
Sorry, something just went wrong in Launchpad.

We’ve recorded what happened, and we’ll fix it as soon as possible. Apologies for the inconvenience.

Trying again in a couple of minutes might work.

(Error ID: OOPS-c0dd22466f83815695539d8057071ee5)

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-07-13:

Sorry, I must pass it to incomplete. Needs a diagnostic snapshot, you can upload it to different host if you is not able to upload it here.

Changed in fuel:
milestone:	none → 7.0
assignee:	nobody → Oleksiy Molchanov (omolchanov)
importance:	Undecided → High
status:	New → Incomplete

Revision history for this message

Rob Neff (rob-neff) wrote on 2015-07-13:

Thanks for the suggestion. I have uploaded the Snapshot to Dropbox.

https://dl.dropboxusercontent.com/u/3516115/fuel-snapshot-2015-07-11_17-37-13.tgz

Revision history for this message

Rob Neff (rob-neff) wrote on 2015-07-13:

Changed back to New since I included the Fuel Snapshot.

Changed in fuel:
status:	Incomplete → New

Revision history for this message

Rob Neff (rob-neff) wrote on 2015-07-13:

Fuel 6.0 Ceph Deployment Error.png Edit (128.0 KiB, image/png)

Dmitry Ilyin (idv1985) on 2015-07-13

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-07-14:

Rob, what you were doing in the mentioned "configured networking & hard drives". I mean network part.

Oleksiy Molchanov (omolchanov) on 2015-07-15

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Rob Neff (rob-neff) wrote on 2015-07-15:

Network (all systems)
1G - Eth0 - PXE
1G - Eth1 - Public
40G - Eth2 - Management, Storage, Private

Hard Drives
1U Compute Servers (No errors)
1x800GB Raid-0 Boot Drive

2U Ceph Servers (Errors 3 out of 4)
1x 300GB Intel S3500 Operating System
20x 600GB 10k RPM - Ceph OSD
1x 800GB Intel S3700 Ceph Journal

Revision history for this message

Rob Neff (rob-neff) wrote on 2015-07-15:

Added new information

Changed in fuel:
status:	Incomplete → New

Oleksiy Molchanov (omolchanov) on 2015-07-16

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-07-16:

Deeper investigation shows that ceph failed on gathering keys from primary-controller

node-33:
http://paste.openstack.org/show/380324/
node-34:
http://paste.openstack.org/show/380325/
node-35:
http://paste.openstack.org/show/380326/

As you can see node-33 successfully ssh-ed to node-2 and fetched keys, but 34 and 35 failed with no route to host error. Logs on primary controller states that network on primary-controller (node-2) was up and running, all the rest were good too.

From my point of view it was problems with networking on your env, but anyway I will try to reproduce it on my env and post results here.

Revision history for this message

Rob Neff (rob-neff) wrote on 2015-07-16:

#10

Thanks Oleksiy. We were able to reploy and now the environment is working without any physical or logical networking changes.

If you can just improve the error message, that would really help my confidence in Fuel. Often if something fails, it is very difficult to determine what went wrong and the final resolution from the customer side is "Fuel is buggy".

If you change:
err: ceph-deploy --overwrite-conf config pull node-2 returned 1 instead of one of [0]

to:
err: Could not successfully ssh into to node-2 and fetch keys. Check the network connectivity from this node to node-2.

Then I would not have logged a bug, I would have double-checked my networking and retried the deployment.

Hopefully this feedback is useful. Thanks for investigating.

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-07-20:

#11

Hi Rob,

To debug FUEL properly you should go to astute log before in UI and check what task failed first and on what node. After this you should go to puppet log on that node and check ERROR messages, there you can see what was the root cause.

Anyway I am passing this to python team to listen to their comment for your suggestion.

Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → Fuel Python Team (fuel-python)
importance:	High → Medium

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-08-05:

#12

Idea to show more user friendly message is good. But this bug is for 6.0. We have changed deployment behavior in 6.1. Now you will get message about failed tasks instead of big puppet log. It should helpful to find problem more faster. But we still have problem with message. It is better in 6.1+, but still not so helpful as can be.

Moved to 8.0

Changed in fuel:
status:	Confirmed → Won't Fix

Dmitry Pyzhov (dpyzhov) on 2015-10-12

Changed in fuel:
milestone:	7.0 → 8.0
status:	Won't Fix → Triaged
no longer affects:	fuel/8.0.x

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-python

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2015-12-28:

#13

error_output Edit (4.6 KiB, text/plain)

Reproduced:
[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2015-12-28:

#14

Full logs

https://drive.google.com/a/mirantis.com/file/d/0B1CktchMwAXHbnE1U3kwZTBqc0U/view?usp=sharing

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2015-12-28:

#15

Scenario:
1. Create new environment
2. Choose Neutron VxLAN
3. Choose Ceph for images
4. Choose Ceph RadosGW for objects
5. Add 3 controller
6. Add 2 compute
7. Add 1 cinder
8. Add 3 ceph nodes
9. Change default dns server to any 2 public dns servers to the 'Host OS DNS Servers' on Settings tab
10. Change default ntp servers to any 2 public ntp servers to the 'Host OS NTP Servers' on Settings tab
11. Verify networks
12. Deploy the environment

Revision history for this message

Vladimir Khlyunev (vkhlyunev) wrote on 2015-12-28:

#16

Raised to high - we have to fix this bug in 8.0, new logs provided by Dmitry Belyaninov.

Changed in fuel:
importance:	Medium → High

Revision history for this message

Dmitry Tyzhnenko (dtyzhnenko) wrote on 2015-12-28:

#17

fuel-snapshot-2015-12-28_15-54-49.tar.xz Edit (81.5 MiB, application/octet-stream)

Reproduced on enother scenario:

Deployment with 3 controllers, NeutronVxLAN, both Ceph
Scenario:
1. Create new environment
2. Choose Neutron, VxLAN
3. Choose Ceph for volumes and Ceph for images
4. Change ceph replication factor to 3
5. Add 3 controller
6. Add 2 compute
7. Add 3 ceph
8. Change disk configuration for all Ceph nodes. Change 'Ceph' volume for vdc
9. Make management and storage networks untagged
10. Move each network to separate interface
11. Change default dns server to any 2 public dns servers to the 'Host OS DNS Servers' on Settings tab
12. Change default ntp servers to any 2 public ntp servers to the 'Host OS NTP Servers' on Settings tab
13. Verify networks
14. Start deployment
15. Stop deployment on deployment stage
16. Change openstack username, password, tenant
17. Deploy cluster after stop
18. Verify networks
19. Run OSTF tests

Failed on 15 step

Logs snapshot attached

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2015-12-29:

#18

Guys, this bug is about message about issues with network connectivity. It cannot break any deployments. Changing priority back to medium.

Changed in fuel:
importance:	High → Medium
milestone:	8.0 → 9.0

Revision history for this message

Vladimir Khlyunev (vkhlyunev) wrote on 2016-01-13:

#19

Marked as "Wont fix" for 8.0 - medium priority

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-03-02:

#20

We've rewritten our tasks engine in 9.0. I hope this should be enough.

Changed in fuel:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.