Fuel for OpenStack

[library] Deployment failed due to failure in uploading Cirros image

Bug #1330456 reported by Andrey Sledzinskiy on 2014-06-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 5.1

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_0_swarm/job/master_fuelmain.system_test.centos.thread_2/91/testReport/junit/%28root%29/deploy_simple_flat/deploy_simple_flat/

Deployment failed on uploading cirros image:
[404] Error running RPC method deploy: Upload cirros "TestVM" image failed, trace: ["/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions/upload_cirros_image.rb:91:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `block in process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:46:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/dispatcher.rb:105:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:129:in `dispatch_message'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:92:in `block in dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `call'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `block in each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:90:in `each_with_index'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:90:in `dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:78:in `block in perform_main_job'"]

Cluster configuration in test:
1. CentOS, Simple, KVM, Flat Nove-network, 1 controller, 1 compute
2. Run deployment
3. Deployment failed in RPC method deploy: Upload cirros "TestVM" image failed

Logs are attached

Tags:

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-06-16:

logs.tar.gz Edit (2.0 MiB, application/x-tar)

Nastya Urlapova (aurlapova) on 2014-06-16

Changed in fuel:
milestone:	5.0.1 → 5.1

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-06-18:

Please include date and time in the file name when uploading a logs bundle.

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-06-18:

The attached logs bundle is incomplete, there's no logs from remote nodes, not even astute log.

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2014-06-19:

fuel-snapshot-2014-06-18_18-58-44.tgz Edit (2.8 MiB, application/x-tar)

The same issue was reproduced on CI once again:

http://jenkins-product.srt.mirantis.net:8080/view/5.0_swarm/job/5.0_fuelmain.system_test.centos.thread_1/17/testReport/(root)/ceph_multinode_compact/ceph_multinode_compact/

I guess glance-api hunged and mcollective couldn't execute command (timeout error). Here is output I manually tried to execute on failed node (controller):

http://paste.openstack.org/show/84521/

After I restarted glance-api command has been successfully executed:

http://paste.openstack.org/show/84522/

You can find all needed logs in attached diagnostic snapshot.

Changed in fuel:
status:	Incomplete → New

Dmitry Borodaenko (angdraug) on 2014-06-20

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-06-20:

Download full text (3.2 KiB)

ps on node-2 reports 2 glance client processes still running trying to upload an image:
root 1400 0.4 1.6 93796 25260 ? Sl 18:48 0:02 /usr/bin/ruby /usr/sbin/mcollectived --pid=/var/run/mcollectived.pid --config=/etc/mcollective/server.cfg
root 24999 0.1 1.3 145100 20904 ? S 18:56 0:00 \_ /usr/bin/python /usr/bin/glance -N http://127.0.0.1:5000/v2.0/ -T admin -I admin -K admin image-create --name TestVM --is-public true --container-format=bare --disk-format=qcow2 --min-ram=64 --property murano_image_info={"title": "Murano Demo", "type": "cirros.demo"} --file /opt/vm/cirros-x86_64-disk.img
root 25224 0.1 1.3 145100 20904 ? S 18:57 0:00 \_ /usr/bin/python /usr/bin/glance -N http://127.0.0.1:5000/v2.0/ -T admin -I admin -K admin image-create --name TestVM --is-public true --container-format=bare --disk-format=qcow2 --min-ram=64 --property murano_image_info={"title": "Murano Demo", "type": "cirros.demo"} --file /opt/vm/cirros-x86_64-disk.img
the start times (18:56, 18:57) match entries in astute.log for first and second attempt to upload the image.

netstat shows that both processes are stuck trying to send data to glance:

tcp 0 251812 10.108.11.3:53190 10.108.11.3:9292 ESTABLISHED 25224/python
tcp 0 251044 10.108.11.3:53173 10.108.11.3:9292 ESTABLISHED 24999/python

glance stops immediately after creating the first image in rbd:

2014-06-18T18:56:24.580898+00:00 debug: 2014-06-18 18:56:19.503 19700 DEBUG glance.api.v1.images [66310b31-6ed7-4ce4-b04b-a51158c2c3fd 55d9669c8be144e1ac8ba0f95100aa6b 51386ebf7c964ee19a6bd9d3b1f8cdf7 - - -] Uploading image data for image 36a4f183-f80b-4f2f-a5d8-5e44b336e44a to rbd store _upload /usr/lib/python2.6/site-packages/glance/api/v1/images.py:595
2014-06-18T18:56:24.580898+00:00 debug: 2014-06-18 18:56:19.526 19700 DEBUG glance.store.rbd [66310b31-6ed7-4ce4-b04b-a51158c2c3fd 55d9669c8be144e1ac8ba0f95100aa6b 51386ebf7c964ee19a6bd9d3b1f8cdf7 - - -] creating image 36a4f183-f80b-4f2f-a5d8-5e44b336e44a with order 23 and size 14811136 add /usr/lib/python2.6/site-packages/glance/store/rbd.py:326

ceph.log reports that about a minute earlier 2 of the osds (from the controller itself) went down:

2014-06-18 18:55:16.309440 mon.0 10.108.12.3:6789/0 68 : [INF] osdmap e32: 6 osds: 6 up, 6 in
...
2014-06-18 18:55:37.532267 mon.0 10.108.12.3:6789/0 73 : [DBG] osd.0 10.108.12.3:6802/23830 reported failed by osd.4 10.108.12.2:6800/18041
2014-06-18 18:55:37.532449 mon.0 10.108.12.3:6789/0 74 : [DBG] osd.1 10.108.12.3:6800/23644 reported failed by osd.4 10.108.12.2:6800/18041
...
2014-06-18 18:56:00.592615 mon.0 10.108.12.3:6789/0 113 : [INF] osdmap e33: 6 osds: 4 up, 6 in

which eventually is followed by reporting slow requests, probably from the first stuck image upload:

2014-06-18 18:56:50.210213 osd.2 10.108.12.2:6803/18229 1 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.652801 secs
2014-06-18 18:56:50.210219 osd.2 10.108.12.2:6803/18229 2 : [WRN] slow request 30.652801 seconds old, received at 2014-06-18 18:56:19.557346: osd_op(client.4344.0:1 36a4f183-f80b-4f2f-a5d8-5e44b336e44a.rbd [sta...

ps on node-2 reports 2 glance client processes still running trying to upload an image:
root      1400  0.4  1.6  93796 25260 ?        Sl   18:48   0:02 /usr/bin/ruby /usr/sbin/mcollectived --pid=/var/run/mcollectived.pid --config=/etc/mcollective/server.cfg
root     24999  0.1  1.3 145100 20904 ?        S    18:56   0:00  \_ /usr/bin/python /usr/bin/glance -N http://127.0.0.1:5000/v2.0/ -T admin -I admin -K admin image-create --name TestVM --is-public true --container-format=bare --disk-format=qcow2 --min-ram=64 --property murano_image_info={"title": "Murano Demo", "type": "cirros.demo"} --file /opt/vm/cirros-x86_64-disk.img
root     25224  0.1  1.3 145100 20904 ?        S    18:57   0:00  \_ /usr/bin/python /usr/bin/glance -N http://127.0.0.1:5000/v2.0/ -T admin -I admin -K admin image-create --name TestVM --is-public true --container-format=bare --disk-format=qcow2 --min-ram=64 --property murano_image_info={"title": "Murano Demo", "type": "cirros.demo"} --file /opt/vm/cirros-x86_64-disk.img
the start times (18:56, 18:57) match entries in astute.log for first and second attempt to upload the image.

netstat shows that both processes are stuck trying to send data to glance:

tcp        0 251812 10.108.11.3:53190           10.108.11.3:9292            ESTABLISHED 25224/python
tcp        0 251044 10.108.11.3:53173           10.108.11.3:9292            ESTABLISHED 24999/python

glance stops immediately after creating the first image in rbd:

2014-06-18T18:56:24.580898+00:00 debug:  2014-06-18 18:56:19.503 19700 DEBUG glance.api.v1.images [66310b31-6ed7-4ce4-b04b-a51158c2c3fd 55d9669c8be144e1ac8ba0f95100aa6b 51386ebf7c964ee19a6bd9d3b1f8cdf7 - - -] Uploading image data for image 36a4f183-f80b-4f2f-a5d8-5e44b336e44a to rbd store _upload /usr/lib/python2.6/site-packages/glance/api/v1/images.py:595
2014-06-18T18:56:24.580898+00:00 debug:  2014-06-18 18:56:19.526 19700 DEBUG glance.store.rbd [66310b31-6ed7-4ce4-b04b-a51158c2c3fd 55d9669c8be144e1ac8ba0f95100aa6b 51386ebf7c964ee19a6bd9d3b1f8cdf7 - - -] creating image 36a4f183-f80b-4f2f-a5d8-5e44b336e44a with order 23 and size 14811136 add /usr/lib/python2.6/site-packages/glance/store/rbd.py:326

ceph.log reports that about a minute earlier 2 of the osds (from the controller itself) went down:

which eventually is followed by reporting slow requests, probably from the first stuck image upload:

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-06-20:

correction: 10.108.12.2 (node with failed osds) is node-1, one of the computes, not the controller.

tags:

added: ceph

Dmitry Ilyin (idv1985) on 2014-07-15

summary:

- Deployment failed due to failure in uploading Cirros image
+ [library] Deployment failed due to failure in uploading Cirros image

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-07-22:

is there a reproducer? more than month since the report. Closing as incomplete. Reopen if reproduced.

Dmitry Ilyin (idv1985) on 2014-07-23

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Sergey Murashov (smurashov) wrote on 2014-08-26:

iso: {"build_id": "2014-08-25_00-01-17", "ostf_sha": "907f25f8fad39b177bf6a66fba9785afa7dd8008", "build_number": "476", "auth_required": true, "api": "1.0", "nailgun_sha": "44876ddae29823449e0cbc59428aafa466cbbbc2", "production": "docker", "fuelmain_sha": "5d44e4173f3b15b5991737269fbfae71275a63da", "astute_sha": "0c1b16721f7810b1c400f51937ecc6f058be8cff", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "31debf95128afa44a9991491eaff169625cb49b1"}

Reproduced, HA, Ubuntu, 3controllers, 1compute+ceph-osd, neutron gre, murano

Revision history for this message

Sergey Murashov (smurashov) wrote on 2014-08-26:

fuel-snapshot-2014-08-26_14-13-47.tgz Edit (7.8 MiB, application/x-tar)

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-08-26:

#10

this error message is shown for each failed deployment. your bug is related to an intermittent glance db connectivity issue. here is the log of glance-registry for node-3:

OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading authorization packet', system error: 110") None None

Please, file a separate bug with this particular snapshot attached.

Changed in fuel:
status:	Confirmed → Incomplete

Vladimir Sharshov (vsharshov) on 2014-09-03

Changed in fuel:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.