mcollective failed to upload image but error was deployment timed out

Bug #1312443 reported by Andrew Woodward
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Vladimir Sharshov

Bug Description

{"build_id": "2014-04-24_19-11-02", "mirantis": "yes", "build_number": "138", "nailgun_sha": "4e231ac07474a2c01b72e0529b79bfd80809eba3", "production": "prod", "ostf_sha": "134765fcb5a07dce0cd1bb399b2290c988c3c63b", "fuelmain_sha": "a53a007a36fd0ff738fe6382bf8a9797aaebf789", "astute_sha": "6e8fa4cc12968d7b468fc590b2f06bb59bf74511", "release": "5.0", "fuellib_sha": "9c526a662d718a0225d02ff4267c132e46acba80"}

ubuntu neutron vlan
ceph for images, volumes, nova, object

3 x controller, 2 x Ceph-osd + Compute

138 iso has broken ceph, which will cause glance upload to fail

glance upload returned -1 but error was deployment timed out

Expected result:
error should be deployment error, and mcollective log should clearly identify that glance upload failed.
------------------------------------

2014-04-24 20:51:47 ERR
[5048] Timeout of deployment is exceeded.
2014-04-24 20:51:47 ERR
[5048] MCollective agents '1' didn't respond within the allotted time.
2014-04-24 20:50:47 DEBUG
[5048] Retry #1 to run mcollective agent on nodes: '1'
2014-04-24 20:49:47 DEBUG
[5048] c040b8e8-4a0e-417b-b94b-64e434675ea3: cmd: /usr/bin/glance -N http://192.168.0.2:5000/v2.0/ -T admin -I admin -K admin index && (/usr/bin/glance -N http://192.168.0.2:5000/v2.0/ -T admin -I admin -K admin index | grep TestVM)
                                               stdout: ID Name Disk Format Container Format Size
------------------------------------ ------------------------------ -------------------- -------------------- --------------
                                               stderr:
                                               exit code: 1
2014-04-24 20:49:47 DEBUG
[5048] c040b8e8-4a0e-417b-b94b-64e434675ea3: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"1", :statuscode=>0, :statusmsg=>"OK", :data=>{:stdout=>"ID Name Disk Format Container Format Size \n------------------------------------ ------------------------------ -------------------- -------------------- --------------\n", :exit_code=>1, :stderr=>""}}

Tags: scale astute
Revision history for this message
Andrew Woodward (xarses) wrote :
Mike Scherbakov (mihgen)
Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
Changed in fuel:
status: New → Confirmed
Changed in fuel:
milestone: 5.0 → 5.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

this happens due to clock skew between ceph nodes. should be an environment or ceph config issue

Revision history for this message
Andrew Woodward (xarses) wrote :

The issue reported is not that glance failed to upload to ceph. There was no ceph-osd's due to another bug. The issue in this report is that mcollective failed with error of "deployment timeout" instead of the "task failed" message that is was producing previously. This is additionally hard to see since the failed task output is at DEBUG log level which hides the true failure from user.

Andrew Woodward (xarses)
description: updated
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Aleksey Kasatkin was repeat this on #198. Logs attached.

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

On comment #4 from Vladimir: It was #199. Thanks.

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Same for #203 (CentOS, HA, Neutron) but deployment is hanged:

ERR [412] Error running provisioning: undefined method `warning' for #<Logger:0x00000001c4ec98>, trace: ["/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/rsyslogd.rb:31:in `rescue in send_sighup'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/rsyslogd.rb:23:in `send_sighup'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:151:in `remove_nodes'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:56:in `provision'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/dispatcher.rb:86:in `provision'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:126:in `dispatch_message'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:89:in `block in dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `call'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `block in each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:87:in `each_with_index'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:87:in `dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:72:in `block in perform_main_job'"]
ERR [412] MCollective agents 'master' didn't respond within the allotted time.

Changed in fuel:
milestone: 5.1 → 5.0
importance: Medium → High
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

#203 again. Same result with slightly different configuration (Ubuntu, HA, Neutron-GRE, 3 ctrl+cind, 1 comp).

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

#203 again. (CentOS, HA, Nova-VLAN, 3 ctrl+ceph, 1 comp)

2014-05-14 16:54:53 ERR
[402] Timeout of deployment is exceeded.
2014-05-14 16:54:53 ERR
[402] MCollective agents '1' didn't respond within the allotted time.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/93676

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

#205.

2014-05-15 12:20:16 ERR
[412] Timeout of deployment is exceeded.
2014-05-15 12:20:16 ERR
[412] MCollective agents '6' didn't respond within the allotted time.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/93676
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=c0418e1739cceb864fd2c548a7c6b9ad6a46ed86
Submitter: Jenkins
Branch: master

commit c0418e1739cceb864fd2c548a7c6b9ad6a46ed86
Author: Vladimir Sharshov <email address hidden>
Date: Thu May 15 12:44:26 2014 +0400

    Avoid to report incorrect info about deployment timeout

    Now we will report about mcollective problem
    with shell command exectuion as error execution
    instead of timeout error.

    Change-Id: Ie56a378de673071a10d6465689c1fafad89699a8
    Closes-Bug: #1312443

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/93960

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I suggest to close this bug after https://review.openstack.org/93960 and discuss problem with MCollective here: https://bugs.launchpad.net/fuel/+bug/1320123

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

ISO #212.

2014-05-19 14:13:02 ERR
[400] Error running RPC method deploy: Upload cirros "TestVM" image failed, trace: ["/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions/upload_cirros_image.rb:91:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `block in process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/post_deploy_actions.rb:29:in `process'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:46:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/dispatcher.rb:105:in `deploy'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:126:in `dispatch_message'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:89:in `block in dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `call'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `block in each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:87:in `each_with_index'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:87:in `dispatch'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:72:in `block in perform_main_job'"]
2014-05-19 14:13:02 ERR
[400] 01025403-65d8-49fb-86c7-68b5d186760a: Upload cirros "TestVM" image failed
2014-05-19 14:13:02 ERR
[400] MCollective agents '1' didn't respond within the allotted time.
2014-05-19 14:11:02 ERR
[400] MCollective agents '1' didn't respond within the allotted time.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/93960
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=eb09c664315b6d17024d7cdd9cdae70039a4884a
Submitter: Jenkins
Branch: master

commit eb09c664315b6d17024d7cdd9cdae70039a4884a
Author: Vladimir Sharshov <email address hidden>
Date: Fri May 16 20:50:39 2014 +0400

    Avoid to report incorrect info about deployment timeout(part 2)

    Now we will report about mcollective problem
    with upload file exectuion as error execution
    instead of timeout error.

    Change-Id: Ib7b3882946f60f92e27ee69830cab57a916a059f
    Closes-Bug: #1312443

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Leontii Istomin (listomin) wrote :

I have faced with the issue on scale lab with 6.0.1 #76

[root@fuel ~]# fuel --fuel-version
api: '1.0'
astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
auth_required: true
build_id: 2015-02-07_20-50-01
build_number: '76'
feature_groups:
- mirantis
fuellib_sha: 64f3ebe9fcbd18bf6c80a948e06061783a090347
fuelmain_sha: c799e3a6d88289e58db764a6be7910aab7da3149
nailgun_sha: 2ef819732a3ee7acf7b610e7d1c1a6da0434c1a0
ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
production: docker
release: 6.0.1
release_versions:
  2014.2-6.0.1:
    VERSION:
      api: '1.0'
      astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
      build_id: 2015-02-07_20-50-01
      build_number: '76'
      feature_groups:
      - mirantis
      fuellib_sha: 64f3ebe9fcbd18bf6c80a948e06061783a090347
      fuelmain_sha: c799e3a6d88289e58db764a6be7910aab7da3149
      nailgun_sha: 2ef819732a3ee7acf7b610e7d1c1a6da0434c1a0
      ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
      production: docker
      release: 6.0.1

Baremetal,Ubuntu, HA, Neutron-gre,Сeph-all, Debug, 6.0.1_76 + NOVA_QUOTAS
Controllers:3 Computes:22

In astute log:
http://paste.openstack.org/show/173029/
I uploaded image manually without errors
Diagnostic snapshot is here: https://drive.google.com/a/mirantis.com/file/d/0Bx4ptZV1Jt7hM0VjTVFvLUNwZHM/view?usp=sharing

Changed in fuel:
status: Fix Committed → Confirmed
Dina Belova (dbelova)
tags: added: scale
Changed in fuel:
status: Confirmed → Fix Committed
no longer affects: fuel/6.0.x
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.