[library] Deployment failed due to failure in uploading Cirros image

Bug #1330456 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Fuel Library (Deprecated)

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_0_swarm/job/master_fuelmain.system_test.centos.thread_2/91/testReport/junit/%28root%29/deploy_simple_flat/deploy_simple_flat/

Deployment failed on uploading cirros image:
[404] Error running RPC method deploy: Upload cirros "TestVM" image failed, trace: ["/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions/upload_cirros_image.rb:91:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `block in process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:46:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/dispatcher.rb:105:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:129:in `dispatch_message'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:92:in `block in dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `call'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `block in each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:90:in `each_with_index'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:90:in `dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:78:in `block in perform_main_job'"]

Cluster configuration in test:
1. CentOS, Simple, KVM, Flat Nove-network, 1 controller, 1 compute
2. Run deployment
3. Deployment failed in RPC method deploy: Upload cirros "TestVM" image failed

Logs are attached

Tags: ceph
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
milestone: 5.0.1 → 5.1
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Please include date and time in the file name when uploading a logs bundle.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The attached logs bundle is incomplete, there's no logs from remote nodes, not even astute log.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

The same issue was reproduced on CI once again:

http://jenkins-product.srt.mirantis.net:8080/view/5.0_swarm/job/5.0_fuelmain.system_test.centos.thread_1/17/testReport/(root)/ceph_multinode_compact/ceph_multinode_compact/

I guess glance-api hunged and mcollective couldn't execute command (timeout error). Here is output I manually tried to execute on failed node (controller):

http://paste.openstack.org/show/84521/

After I restarted glance-api command has been successfully executed:

http://paste.openstack.org/show/84522/

You can find all needed logs in attached diagnostic snapshot.

Changed in fuel:
status: Incomplete → New
Changed in fuel:
status: New → Confirmed
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :
Download full text (3.2 KiB)

ps on node-2 reports 2 glance client processes still running trying to upload an image:
root 1400 0.4 1.6 93796 25260 ? Sl 18:48 0:02 /usr/bin/ruby /usr/sbin/mcollectived --pid=/var/run/mcollectived.pid --config=/etc/mcollective/server.cfg
root 24999 0.1 1.3 145100 20904 ? S 18:56 0:00 \_ /usr/bin/python /usr/bin/glance -N http://127.0.0.1:5000/v2.0/ -T admin -I admin -K admin image-create --name TestVM --is-public true --container-format=bare --disk-format=qcow2 --min-ram=64 --property murano_image_info={"title": "Murano Demo", "type": "cirros.demo"} --file /opt/vm/cirros-x86_64-disk.img
root 25224 0.1 1.3 145100 20904 ? S 18:57 0:00 \_ /usr/bin/python /usr/bin/glance -N http://127.0.0.1:5000/v2.0/ -T admin -I admin -K admin image-create --name TestVM --is-public true --container-format=bare --disk-format=qcow2 --min-ram=64 --property murano_image_info={"title": "Murano Demo", "type": "cirros.demo"} --file /opt/vm/cirros-x86_64-disk.img
the start times (18:56, 18:57) match entries in astute.log for first and second attempt to upload the image.

netstat shows that both processes are stuck trying to send data to glance:

tcp 0 251812 10.108.11.3:53190 10.108.11.3:9292 ESTABLISHED 25224/python
tcp 0 251044 10.108.11.3:53173 10.108.11.3:9292 ESTABLISHED 24999/python

glance stops immediately after creating the first image in rbd:

2014-06-18T18:56:24.580898+00:00 debug: 2014-06-18 18:56:19.503 19700 DEBUG glance.api.v1.images [66310b31-6ed7-4ce4-b04b-a51158c2c3fd 55d9669c8be144e1ac8ba0f95100aa6b 51386ebf7c964ee19a6bd9d3b1f8cdf7 - - -] Uploading image data for image 36a4f183-f80b-4f2f-a5d8-5e44b336e44a to rbd store _upload /usr/lib/python2.6/site-packages/glance/api/v1/images.py:595
2014-06-18T18:56:24.580898+00:00 debug: 2014-06-18 18:56:19.526 19700 DEBUG glance.store.rbd [66310b31-6ed7-4ce4-b04b-a51158c2c3fd 55d9669c8be144e1ac8ba0f95100aa6b 51386ebf7c964ee19a6bd9d3b1f8cdf7 - - -] creating image 36a4f183-f80b-4f2f-a5d8-5e44b336e44a with order 23 and size 14811136 add /usr/lib/python2.6/site-packages/glance/store/rbd.py:326

ceph.log reports that about a minute earlier 2 of the osds (from the controller itself) went down:

2014-06-18 18:55:16.309440 mon.0 10.108.12.3:6789/0 68 : [INF] osdmap e32: 6 osds: 6 up, 6 in
...
2014-06-18 18:55:37.532267 mon.0 10.108.12.3:6789/0 73 : [DBG] osd.0 10.108.12.3:6802/23830 reported failed by osd.4 10.108.12.2:6800/18041
2014-06-18 18:55:37.532449 mon.0 10.108.12.3:6789/0 74 : [DBG] osd.1 10.108.12.3:6800/23644 reported failed by osd.4 10.108.12.2:6800/18041
...
2014-06-18 18:56:00.592615 mon.0 10.108.12.3:6789/0 113 : [INF] osdmap e33: 6 osds: 4 up, 6 in

which eventually is followed by reporting slow requests, probably from the first stuck image upload:

2014-06-18 18:56:50.210213 osd.2 10.108.12.2:6803/18229 1 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.652801 secs
2014-06-18 18:56:50.210219 osd.2 10.108.12.2:6803/18229 2 : [WRN] slow request 30.652801 seconds old, received at 2014-06-18 18:56:19.557346: osd_op(client.4344.0:1 36a4f183-f80b-4f2f-a5d8-5e44b336e44a.rbd [sta...

Read more...

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

correction: 10.108.12.2 (node with failed osds) is node-1, one of the computes, not the controller.

tags: added: ceph
Dmitry Ilyin (idv1985)
summary: - Deployment failed due to failure in uploading Cirros image
+ [library] Deployment failed due to failure in uploading Cirros image
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

is there a reproducer? more than month since the report. Closing as incomplete. Reopen if reproduced.

Dmitry Ilyin (idv1985)
Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Sergey Murashov (smurashov) wrote :

iso: {"build_id": "2014-08-25_00-01-17", "ostf_sha": "907f25f8fad39b177bf6a66fba9785afa7dd8008", "build_number": "476", "auth_required": true, "api": "1.0", "nailgun_sha": "44876ddae29823449e0cbc59428aafa466cbbbc2", "production": "docker", "fuelmain_sha": "5d44e4173f3b15b5991737269fbfae71275a63da", "astute_sha": "0c1b16721f7810b1c400f51937ecc6f058be8cff", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "31debf95128afa44a9991491eaff169625cb49b1"}

Reproduced, HA, Ubuntu, 3controllers, 1compute+ceph-osd, neutron gre, murano

Revision history for this message
Sergey Murashov (smurashov) wrote :
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

this error message is shown for each failed deployment. your bug is related to an intermittent glance db connectivity issue. here is the log of glance-registry for node-3:

OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading authorization packet', system error: 110") None None

Please, file a separate bug with this particular snapshot attached.

Changed in fuel:
status: Confirmed → Incomplete
Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.