Fuel for OpenStack

[astute] Upload Cirros TestVM image failed with - import pkg_resources Error communicating with http://10.108.6.2:9292 [Errno 32] Broken pipe

Bug #1340593 reported by Andrey Sledzinskiy on 2014-07-11

This bug report is a duplicate of: Bug #1374160: Can't stop Ceph OSD after fresh install. Edit Remove

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 6.0
	5.0.x	Incomplete	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 5.0.2

Bug Description

{
    "build_id": "2014-07-10_00-39-56",
    "mirantis": "yes",
    "build_number": "112",
    "ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
    "nailgun_sha": "f5ff82558f99bb6ca7d5e1617eddddf7142fe857",
    "production": "docker",
    "api": "1.0",
    "fuelmain_sha": "293015843304222ead899270449495af91b06aed",
    "astute_sha": "5df009e8eab611750309a4c5b5c9b0f7b9d85806",
    "release": "5.0.1",
    "fuellib_sha": "364dee37435cbdc85d6b814a61f57800b83bf22d"
}
Steps:
1. Create next cluster - CentOS, HA, Flat nova-network, Ceph for volumes and for images, 3 controller+ceph nodes, 2 compute+ceph nodes, 1 ceph node
2. Deploy cluster

Actual result - deployment failed with error on controller-1 (node-1)

Trying to execute image uploading from controller:

[root@node-1 ~]# /usr/bin/glance -N http://10.108.7.2:5000/v2.0/ -T admin -I admin -K admin image-create --name 'TestVM' --is-public true --container-format='bare' --disk-format='qcow2' --min-ram=64 --property murano_image_info='{"title": "Murano Demo", "type": "cirros.demo"}' --file '/opt/vm/cirros-x86_64-disk.img'
/usr/lib/python2.6/site-packages/eventlet/hubs/__init__.py:8: UserWarning: Module backports was already imported from /usr/lib64/python2.6/site-packages/backports/__init__.pyc, but /usr/lib/python2.6/site-packages is being added to sys.path
import pkg_resources
Error communicating with http://10.108.6.2:9292 [Errno 32] Broken pipe

Errors in astute log:

[402] Error running RPC method deploy: Upload cirros "TestVM" image failed, trace: ["/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions/upload_cirros_image.rb:91:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `block in process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:45:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/dispatcher.rb:105:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:126:in `dispatch_message'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:89:in `block in dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `call'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `block in each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:87:in `each_with_index'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:87:in `dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:72:in `block in perform_main_job'"]

2014-07-11 06:21:07 ERR

[402] 02353a31-fa07-42e5-9017-1c483e7a9838: Upload cirros "TestVM" image failed

2014-07-11 06:21:07 ERR

[402] 02353a31-fa07-42e5-9017-1c483e7a9838: cmd: /usr/bin/glance -N http://10.108.7.2:5000/v2.0/ -T admin -I admin -K admin image-create --name 'TestVM' --is-public true --container-format='bare' --disk-format='qcow2' --min-ram=64 --property murano_image_info='{"title": "Murano Demo", "type": "cirros.demo"}' --file '/opt/vm/cirros-x86_64-disk.img'
mcollective error: 02353a31-fa07-42e5-9017-1c483e7a9838: MCollective agents '1' didn't respond within the allotted time.

2014-07-11 06:21:07 ERR

[402] MCollective agents '1' didn't respond within the allotted time.

Errors in puppet log

(/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) change from notrun to 0 failed: ceph-deploy osd activate node-1:/dev/vdb4 node-1:/dev/vdc4 returned 1 instead of one of [0]

Logs are attached

Tags:

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-07-11:

fail_ceph_ha-2014_07_11__06_22_38.tar.gz Edit (7.4 MiB, application/x-tar)

Bogdan Dobrelya (bogdando) on 2014-07-14

tags:

added: ceph

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-14:

There are also errors in puppet log from node-1 before the specified ones:
2014-07-11T05:40:08.063114 node-1 ./node-1.test.domain.local/puppet-apply.log:2014-07-11T05:40:08.063114+00:00 debug: (/Stage[main]/Ceph::Mon/Exec[c
eph-deploy mon create]/unless) 2014-07-11 05:40:08.050066 7fc6f032f700 0 librados: client.admin initialization error (2) No such file or directory
2014-07-11T05:40:08.063882 node-1 ./node-1.test.domain.local/puppet-apply.log:2014-07-11T05:40:08.063882+00:00 debug: (/Stage[main]/Ceph::Mon/Exec[ceph-deploy mon create]/unless) Error connecting to cluster: ObjectNotFound

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-07-15:

The "ObjectNotFound" error is red herring, it's supposed to happen: Exec[ceph-deploy mon create] runs 'ceph mon stat' first to see if monitor already exists.

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-07-15:

The real problem seems to be that ceph-deploy has crashed with the following backtrace:

2014-07-11T06:15:31.573851+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) Traceback (most recent call last):
2014-07-11T06:15:31.574079+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) File "/usr/bin/ceph-deploy", line 21, in <module>
2014-07-11T06:15:31.574840+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) sys.exit(main())
2014-07-11T06:15:31.574962+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) File "/usr/lib/python2.6/site-packages/ceph_deploy/util/decorators.py", line 83, in newfunc
2014-07-11T06:15:31.574962+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) return f(*a, **kw)
2014-07-11T06:15:31.575167+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) File "/usr/lib/python2.6/site-packages/ceph_deploy/cli.py", line 150, in main
2014-07-11T06:15:31.576201+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) return args.func(args)
2014-07-11T06:15:31.576315+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) File "/usr/lib/python2.6/site-packages/ceph_deploy/osd.py", line 371, in osd
2014-07-11T06:15:31.576315+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) activate(args, cfg)
2014-07-11T06:15:31.576596+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) File "/usr/lib/python2.6/site-packages/ceph_deploy/osd.py", line 278, in activate
2014-07-11T06:15:31.577323+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) cmd=cmd, ret=ret, out=out, err=err)
2014-07-11T06:15:31.577323+00:00 notice: (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) NameError: global name 'ret' is not defined

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-07-15:

Across the whole cluster, 2 OSDs out of 12 failed to activate:
2014-07-11 06:18:46.948432 mon.0 10.108.7.3:6789/0 122 : [INF] osdmap e61: 12 osds: 10 up, 12 in

At the end of ceph.log on node-1, there's a number of slow requests:
2014-07-11 06:19:38.746562 osd.9 10.108.7.7:6800/24689 2 : [WRN] slow request 30.154477 seconds old, received at 2014-07-11 06:19:08.592017: osd_op(client.4777.0:1 00109b68-f1b2-4e8d-90e4-ebb4b5eeffe0.rbd [stat] 3.82587b08 e61) v4 currently reached pg
...
2014-07-11 06:21:09.288881 osd.10 10.108.7.8:6802/24827 4 : [WRN] slow request 60.062250 seconds old, received at 2014-07-11 06:20:09.226564: osd_op(client.4783.0:14 rbd_data.12af32850276.0000000000000000 [write 0~8388608] 3.29b77010 e61) v4 currently reached pg

Could be that Glance image upload timed out due to Ceph OSD cluster rebalance following failure of 2 OSDs.

Changed in fuel:
importance:	High → Medium
status:	New → Incomplete

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-07-15:

Marking Incomplete to confirm that this is actually reproducible and not due to overloaded virtual test environment.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-15:

Cannot reproduce on {"build_id": "2014-07-14_00-31-14", "mirantis": "yes", "build_number": "122", "ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f", "nailgun_sha": "4637ae61920b92f56154a03f7e717e3bd597525c", "production": "docker", "api": "1.0", "fuelmain_sha": "8a8faa68f7939d095dde60665f16b86d068f696d", "astute_sha": "5df009e8eab611750309a4c5b5c9b0f7b9d85806", "release": "5.0.1", "fuellib_sha": "2d1e1369c13bc9771e9473086cb064d257a21fc2"}
(KVM)

Dmitry Ilyin (idv1985) on 2014-07-15

summary:

- Upload Cirros TestVM image failed with - import pkg_resources Error
- communicating with http://10.108.6.2:9292 [Errno 32] Broken pipe
+ [astute] Upload Cirros TestVM image failed with - import pkg_resources
+ Error communicating with http://10.108.6.2:9292 [Errno 32] Broken pipe

Dmitry Borodaenko (angdraug) on 2014-08-12

Changed in fuel:
milestone:	5.1 → 6.0

Egor Kotko (ykotko) on 2014-09-12

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-09-12:

{"build_id": "2014-09-08_13-06-43", "ostf_sha": "2969c1ad443b632e815bb1f01149c3800cd7aa3a", "build_number": "199", "api": "1.0", "nailgun_sha": "c18a21381843dffe807b254a4ff96eec259953cb", "production": "docker", "fuelmain_sha": "b1e46c8c3ac86a8eafac225507294226322160ea", "astute_sha": "6db5f5031b74e67b92fcac1f7998eaa296d68025", "feature_groups": ["mirantis"], "release": "5.0.2", "fuellib_sha": "c891490242d7d88f3c9582cf023e8f2db0b9d337"}

I have reproduced this issue via Fuel cli env.

My steps to reproduce:
1. Deploy lab via fuel cli:
Ubuntu HA, neutron vlan, Ceph for volumes and images,
3Comntrollers, 2Computes, 2Cephs

Here is my env:
http://172.18.164.197:8000/

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-09-13:

Right version of ISO for previous comment:

{"build_id": "2014-09-12_00-01-11", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "4", "auth_required": true, "api": "1.0", "nailgun_sha": "d389bc6489fe296c9c210f7c65ac84e154a8b82b", "production": "docker", "fuelmain_sha": "d899675a5a393625f8166b29099d26f45d527035", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["experimental"], "release": "5.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-09-12_00-01-11", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "4", "api": "1.0", "nailgun_sha": "d389bc6489fe296c9c210f7c65ac84e154a8b82b", "production": "docker", "fuelmain_sha": "d899675a5a393625f8166b29099d26f45d527035", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["experimental"], "release": "5.1", "fuellib_sha": "395fd9d20a003603cc9ad26e16cb13c1c45e24e6"}}}, "fuellib_sha": "395fd9d20a003603cc9ad26e16cb13c1c45e24e6"}