Deployment on 98 nodes with Ceilo on CentOS failed due to wrong ceph roles layout

Bug #1397367 reported by Sergey Galkin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Python (Deprecated)
5.1.x
Invalid
High
Fuel Python (Deprecated)

Bug Description

astute_sha: c15623d05ccdf7ac10873e7a90df954de8726280
auth_required: true
build_id: 2014-11-26_13-35-04
build_number: '10'
feature_groups:
- mirantis
fuellib_sha: 25eb629f3c6a6ff41cf187e260fe4ff456cfc4e4
fuelmain_sha: 465afb6479a0b3c677040fb978cc109dcf62f774
nailgun_sha: bf9ddb9f9d5dbb09c4b50201ce176635791d7d3e
ostf_sha: a35f516f1606b0d03d51ff63bfe3fbe23de4b622
production: docker
release: '6.0'

Steps to reproduce
1. Start deploy cluster with 96 compute nodes, 3 controlles in HA on CentOS with cinder, neutron gre, ceilometer.
Deployment has failed with error:
 Deployment has failed. Method deploy. Upload cirros "TestVM" image failed.
Inspect Astute logs for the details

In the astute logs

2014-11-28 15:42:15 ERR
[417] Error running RPC method deploy: Upload cirros "TestVM" image failed, trace:
["/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deployment_actions/upload_cirros_image.rb:109:in `raise_cirros_error'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deployment_actions/upload_cirros_image.rb:92:in `process'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/deploy_actions.rb:25:in `block in process'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/deploy_actions.rb:25:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/deploy_actions.rb:25:in `process'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/deployment_engine.rb:95:in `deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:264:in `deploy_cluster'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:35:in `deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/dispatcher.rb:59:in `deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:142:in `dispatch_message'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:103:in `block in dispatch'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `call'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `block in each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:101:in `each_with_index'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:101:in `dispatch'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:85:in `block in perform_main_job'"]
2014-11-28 15:42:15 ERR
[417] d95e44a4-fc76-4389-a940-64742b97ab7b: Upload cirros "TestVM" image failed
2014-11-28 15:42:15 ERR
[417] d95e44a4-fc76-4389-a940-64742b97ab7b: cmd: . /root/openrc && /usr/bin/glance image-create --name 'TestVM' --is-public true --container-format='bare' --disk-format='qcow2' --min-ram=64 --property murano_image_info='{"title": "Murano Demo", "type": "cirros.demo"}' --file '/opt/vm/cirros-x86_64-disk.img'
                                               mcollective error: d95e44a4-fc76-4389-a940-64742b97ab7b: MCollective agents '32' didn't respond within the allotted time.
2014-11-28 15:42:15 ERR
[417] MCollective agents '32' didn't respond within the allotted time.

Łukasz Oleś (loles)
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 6.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please link a logs

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Egor Kotko (ykotko) wrote :

Have similar on Ubuntu:

http://paste.openstack.org/show/142420/

{"build_id": "2014-11-27_23-41-13", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "45", "auth_required": true, "api": "1.0", "nailgun_sha": "500e36d08a45dbb389bf2bd97673d9bff48ee84d", "production": "docker", "fuelmain_sha": "51e66db7750e9c856ba128f35cfb6724895bf479", "astute_sha": "ef8aa0fd0e3ce20709612906f1f0551b5682a6ce", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.3-5.1.1": {"VERSION": {"build_id": "2014-11-27_23-41-13", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "45", "api": "1.0", "nailgun_sha": "500e36d08a45dbb389bf2bd97673d9bff48ee84d", "production": "docker", "fuelmain_sha": "51e66db7750e9c856ba128f35cfb6724895bf479", "astute_sha": "ef8aa0fd0e3ce20709612906f1f0551b5682a6ce", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "15a387462f7be50c4f87ad986d0c81535025c125"}}}, "fuellib_sha": "15a387462f7be50c4f87ad986d0c81535025c125"}

Revision history for this message
Egor Kotko (ykotko) wrote :
Changed in fuel:
status: Incomplete → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Guys, it is really confirmed for 5.1.1, as I remember we certified 5.1.1 only on 20 nodes.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We cannot address high bugs in 5.1.1 anyway, due to HCF

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Nastya, but you're right we do not support 100 nodes deployment in 5.1.x

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Well, nodes *were* provisioned, though
2014-11-28T16:27:46 info: [421] All nodes are provisioned

will continue investigate the logs in #2

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The deployment logs from astute are ok, I can see also glance report about successful image creation, but it came a bit later than astute expected: http://paste.openstack.org/show/142646/

Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Fuel Astute Team (fuel-astute)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Yes, Bogdan is right. Glance image should upload not more than 60 seconds. Glance image size ~ 16 mb. If this operation took more then 60 seconds - we have cinder performance degradation.

This bug similar with https://bugs.launchpad.net/fuel/+bug/1397254 (the same signs ).

I believe that this error not in Astute and we need discovery cinder performance.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Fuel for Openstack (fuel)
Revision history for this message
Mike Scherbakov (mihgen) wrote :

MOS team, please analyse openstack logs - we need to understand why it took too much to upload an image into Glance. This could be related to Glance, Keystone, Glance backend...

Changed in fuel:
assignee: Fuel for Openstack (fuel) → MOS All (mos-all)
Changed in fuel:
assignee: MOS All (mos-all) → MOS Glance (mos-glance)
Revision history for this message
Alexander Tivelkov (ativelkov) wrote :

The message
INFO glance.registry.api.v1.images [911143ba-bf4b-4847-a852-77385abd308a 97bea994c8cd4e16b8701d1dd776272f 18d3f86359684daa94bc83fc6f8868eb - - -] Successfully created image 4dc69685-c45f-4193-bd50-6c2743c9dd4a
Does not mean that the image was indeed uploaded, it just says that the record has been created in the database (that's what the registry service does). The actual uploading is made in glance-api service, and it seems like its last log message is

2014-11-28T17:18:46.437043+00:00 debug: 2014-11-28 17:18:46.450 26587 DEBUG glance.store.rbd [9ec256a4-29b6-4c14-a4dd-11c64a5d796f 97bea994c8cd4e16b8701d1dd776272f 18d3f86359684daa94bc83fc6f8868eb - - -] creating image 4dc69685-c45f-4193-bd50-6c2743c9dd4a with order 23 and size 13167616 add /usr/lib/python2.7/dist-packages/glance/store/rbd.py:332

I.e. glance calls to ceph and attempts to put data there. And it seems like there is no response. We need somebody to look into ceph logs to find out what's going on.

Stanislav Makar (smakar)
Changed in fuel:
assignee: MOS Glance (mos-glance) → Stanislav Makar (smakar)
Revision history for this message
Stanislav Makar (smakar) wrote :

astute.yaml:
volumes_ceph: true
images_ceph: true

roles:
  role: primary-controller
  name: node-1
  role: controller
  name: node-2
  role: controller
  name: node-3
  role: compute
  name: node-4
  role: compute
  name: node-5
  role: cinder
  name: node-6
  role: cinder
  name: node-7

There are node ceph-osd nodes

Revision history for this message
Stanislav Makar (smakar) wrote :

There are not ceph-osd nodes

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Mike Scherbakov (mihgen) wrote :

If comment #12 goes to wrong direction, then we have to investigate more. If there were no Ceph, then there was Swift (in HA we use Swift by default for Glance backend)

Changed in fuel:
status: Invalid → Incomplete
assignee: Stanislav Makar (smakar) → MOS Glance (mos-glance)
Revision history for this message
Alexander Tivelkov (ativelkov) wrote :

Mike, according to snapshot glance is configured to use ceph: see glance-api.conf, it has "default_store = rbd"

Also, the logs state that it was indeed attempting to put the image to ceph. If there is no ceph deployed, then something is wrong with the configuration.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to #13, the configuration looks not valid

summary: - Deployment on 98 nodes with Ceilo on CentOS failed
+ Deployment on 98 nodes with Ceilo on CentOS failed due to wrong ceph
+ roles layout
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We should check how it was possible to deploy an env with ceph for volumes and images w/o ceph-osd roles assigned

Changed in fuel:
assignee: MOS Glance (mos-glance) → Fuel Python Team (fuel-python)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Well, all of it was related to the https://bugs.launchpad.net/fuel/+bug/1397367/comments/2 actually. The original bug does not provide the logs

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I suggest to resubmit the logs from #2 and #3 to the separate bug about ceph roles layout and close this one as invalid due to no logs

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.