[418] Error running provisioning: <type 'exceptions.OSError'>:[Errno 2] No such file or directory: '/var/www/cobbler/images/centos-x86_64/initrd.img', trace: /usr/lib64/ruby/2.1.0/xmlrpc/client.rb:271:in `call'

Bug #1340704 reported by Andrey Sledzinskiy
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Vladimir Sharshov
5.0.x
Fix Committed
High
Vladimir Sharshov

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/5.0_swarm/job/5.0_fuelmain.system_test.ubuntu.thread_2/39/testReport/junit/%28root%29/deploy_reset_on_ready/deploy_reset_on_ready/

{

    "build_id": "2014-07-10_00-39-56",
    "mirantis": "yes",
    "build_number": "112",
    "ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
    "nailgun_sha": "f5ff82558f99bb6ca7d5e1617eddddf7142fe857",
    "production": "docker",
    "api": "1.0",
    "fuelmain_sha": "293015843304222ead899270449495af91b06aed",
    "astute_sha": "5df009e8eab611750309a4c5b5c9b0f7b9d85806",
    "release": "5.0.1",
    "fuellib_sha": "364dee37435cbdc85d6b814a61f57800b83bf22d"

}

Steps:
1. Create next cluster - Ubuntu, Simple, Nova Vlan network, 1 controller, 1 compute
2. Deploy cluster
3. After successful deployment reset it
4. After reset deploy it again

Actual result - deployment failed with
[418] Error running provisioning: <type 'exceptions.OSError'>:[Errno 2] No such file or directory: '/var/www/cobbler/images/centos-x86_64/initrd.img', trace: /usr/lib64/ruby/2.1.0/xmlrpc/client.rb:271:in `call'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/cobbler.rb:108:in `sync'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/cobbler_manager.rb:108:in `sync'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/cobbler_manager.rb:46:in `add_nodes'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/orchestrator.rb:56:in `provision'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/dispatcher.rb:86:in `provision'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:126:in `dispatch_message'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:89:in `block in dispatch'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `call'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:64:in `block in each'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/task_queue.rb:56:in `each'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:87:in `each_with_index'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:87:in `dispatch'
/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/server.rb:72:in `block in perform_main_job'

Logs are attached

Tags: astute
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
description: updated
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Looks like same task try to perform several time in parallel:

2014-07-11 03:44:48 DEBUG
[399] Cobbler syncing
2014-07-11 03:44:48 DEBUG
[399] Trying to reboot node: node-3
2014-07-11 03:44:48 DEBUG
[399] Trying to reboot node: node-1
2014-07-11 03:44:41 DEBUG
[404] Cobbler syncing
2014-07-11 03:44:41 DEBUG
[407] Cobbler syncing
2014-07-11 03:44:41 DEBUG
[418] Cobbler syncing
2014-07-11 03:44:41 DEBUG
[415] Cobbler syncing
2014-07-11 03:44:41 DEBUG
[404] Saving system node-3
2014-07-11 03:44:41 DEBUG
[399] Cobbler syncing
2014-07-11 03:44:41 DEBUG
[407] Saving system node-3
2014-07-11 03:44:41 DEBUG
[418] Saving system node-3

OR

2014-07-11 03:44:38 INFO
[407] : Starting removing of nodes: ["1", "3"]
2014-07-11 03:44:38 INFO
[404] : Starting removing of nodes: ["1", "3"]
2014-07-11 03:44:38 INFO
[399] : Starting removing of nodes: ["1", "3"]
2014-07-11 03:44:38 INFO
[418] : Starting removing of nodes: ["1", "3"]
2014-07-11 03:44:38 INFO
[415] : Starting removing of nodes: ["1", "3"]

Evidence:

2014-07-11 03:44:32 DEBUG [415] Main worker task id is b5577130-20a9-452c-a25d-908e6f62e1f8
2014-07-11 03:44:32 DEBUG [399] Main worker task id is b5577130-20a9-452c-a25d-908e6f62e1f8
2014-07-11 03:44:32 DEBUG [407] Main worker task id is b5577130-20a9-452c-a25d-908e6f62e1f8
2014-07-11 03:44:30 DEBUG [404] Main worker task id is b5577130-20a9-452c-a25d-908e6f62e1f8
2014-07-11 03:44:32 DEBUG [418] Main worker task id is b5577130-20a9-452c-a25d-908e6f62e1f8

Potencial reason why they do such things:

2014-07-11 03:44:32 WARNING [418] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-11 03:44:32 WARNING [415] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-11 03:44:32 WARNING [399] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-11 03:44:31 WARNING [404] Trying to reconnect to message broker. Retry 5 sec later...

34 warnings about connection problem like above.
7 warnings like 'TCP connection to AMQP failed'

Timeline:

2014-07-11 03:44:30 DEBUG [404] Dispatching message: {"args"=>{"task_uuid"=>"b5577130-20a9-452c-a25d-908e6f62e1f8", ...
2014-07-11 03:44:31 WARNING [404] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-11 03:44:32 DEBUG [407] Dispatching message: {"args"=>{"task_uuid"=>"b5577130-20a9-452c-a25d-908e6f62e1f8",
2014-07-11 03:44:32 WARNING [412] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-11 03:44:32 DEBUG [399] Dispatching message: {"args"=>{"task_uuid"=>"b5577130-20a9-452c-a25d-908e6f62e1f8",
2014-07-11 03:44:32 WARNING [399] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-11 03:44:32 WARNING [415] Trying to reconnect to message broker. Retry 5 sec later..
2014-07-11 03:44:32 WARNING [418] Trying to reconnect to message broker. Retry 5 sec later..
2014-07-11 03:44:32 DEBUG [415] Dispatching message: {"args"=>{"task_uuid"=>"b5577130-20a9-452c-a25d-908e6f62e1f8"
2014-07-11 03:44:32 DEBUG [418] Dispatching message: {"args"=>{"task_uuid"=>"b5577130-20a9-452c-a25d-908e6f62e1f8"

It is look like almost every time when worker reconnect, he got a copy of last message from queue.

tags: added: astute
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/106714

Changed in fuel:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/106714
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=18a6861ecf96e20ead17782ef4ea8d62e7449c14
Submitter: Jenkins
Branch: master

commit 18a6861ecf96e20ead17782ef4ea8d62e7449c14
Author: Vladimir Sharshov <email address hidden>
Date: Mon Jul 14 12:52:58 2014 +0400

    Prevent to requeue main task (run same main task in parallel).

    If worker have problem with AMQP connection it can sometimes
    do not get acknowledgements for message(main task) instantly
    and if worker disconect at this time, message will be requeued
    and we get several working tasks running in parallel.

    Change-Id: If038079e263feb811d019d08944ff1dacd11f30a
    Closes-Bug: #1340704

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/106899

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

QA team vote to not merge this changes to 5.0 (Nastya Urlapova) yesterday.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Okay, lets backport fix to 5.0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/5.0)

Reviewed: https://review.openstack.org/106899
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=9a74b788be9a7c5682f1c52a892df36e4766ce3f
Submitter: Jenkins
Branch: stable/5.0

commit 9a74b788be9a7c5682f1c52a892df36e4766ce3f
Author: Vladimir Sharshov <email address hidden>
Date: Mon Jul 14 12:52:58 2014 +0400

    Prevent to requeue main task (run same main task in parallel).

    If worker have problem with AMQP connection it can sometimes
    do not get acknowledgements for message(main task) instantly
    and if worker disconect at this time, message will be requeued
    and we get several working tasks running in parallel.

    Change-Id: If038079e263feb811d019d08944ff1dacd11f30a
    Closes-Bug: #1340704

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #393
"build_id": "2014-08-05_02-01-14",
"ostf_sha": "a3fa823ea0e4e03beb637ae07a91adea82c33182",
"build_number": "393",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "5d1e8e38ea3f2cd88ceca8e0927d53e9770598bc",
"production": "docker",
"fuelmain_sha": "6b05280621c23ab158e9943f88b6f71a22fd4c69",
"astute_sha": "99a790ad1b7526cbbd5bf8add0cb2b4e503fccd4",
"feature_groups": ["mirantis"],
"release": "5.1",
"fuellib_sha": "25eb632c70fe0bb3b2c1053314b558a160e4af50"

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.