[9.0][BVT] Too many nodes failed to provision

Bug #1653737 reported by Roman Podoliaka
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Vladimir Sharshov
Mitaka
Fix Released
High
Vladimir Sharshov
Newton
Fix Released
High
Vladimir Sharshov
Ocata
Fix Committed
High
Vladimir Sharshov

Bug Description

BVT build https://product-ci.infra.mirantis.net/job/9.x.main.ubuntu.bvt_2/716 failed with:

AssertionError: Task 'deploy' has incorrect status. error != ready, 'Provision has failed. Too many nodes failed to provision'

astute.log contains the following error:

2017-01-03 03:47:50 ERROR [29198] Nodes ["2"] reached the response timeout
2017-01-03 03:47:50 WARNING [29198] Unable to get actual status of puppet on node 2. Reason: Nodes ["2"] reached the response timeout
2017-01-03 03:47:50 ERROR [29198] Node 2, task provision_2, manifest /etc/puppet/shell_manifests/provision_2_manifest.pp, status: undefined
2017-01-03 03:47:50 WARNING [29198] Error to get puppet status. Node 2, task provision_2, manifest /etc/puppet/shell_manifests/provision_2_mani
fest.pp.
2017-01-03 03:47:50 ERROR [29198] Node has failed to deploy. There is no more retries for puppet run. Node 2, task provision_2, manifest /etc/p
uppet/shell_manifests/provision_2_manifest.pp

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in fuel:
status: Confirmed → New
milestone: none → 9.2
Revision history for this message
Vladimir Kozhukalov (kozhukalov) wrote :

We encouter the same issue here as we had here https://bugs.launchpad.net/fuel/+bug/1644618

There are errors in mcollective log:

2017-01-03T03:47:12.862357+00:00 err: 03:47:12.614364 #1760] ERROR -- : rabbitmq.rb:50:in `on_hbread_fail' Heartbeat read failed from 'stomp://mcollective@10.109.21.2:61613': {"ticker_i
nterval"=>29.5, "read_fail_count"=>0, "lock_fail"=>true, "lock_fail_count"=>1}

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

It seems that it recovered from the problem above and, the reason is a bit different:

2017-01-03 03:47:52 ERROR [29198] 3c273858-0f41-470b-abf0-50fcafeeefa1: Provision command returned non zero exit code on node: 2
(from astute.log)

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Network problem definitely took place, however it seems that there were no retry attempts afterwards:

2017-01-03 03:47:50 ERROR [29198] Nodes ["2"] reached the response timeout
2017-01-03 03:47:50 WARNING [29198] Unable to get actual status of puppet on node 2. Reason: Nodes ["2"] reached the response timeout
2017-01-03 03:47:50 ERROR [29198] Node 2, task provision_2, manifest /etc/puppet/shell_manifests/provision_2_manifest.pp, status: undefined
2017-01-03 03:47:50 WARNING [29198] Error to get puppet status. Node 2, task provision_2, manifest /etc/puppet/shell_manifests/provision_2_manifest.pp.
2017-01-03 03:47:50 ERROR [29198] Node has failed to deploy. There is no more retries for puppet run. Node 2, task provision_2, manifest /etc/puppet/shell_manifests/provision_2_manifest.pp

and that is it, in case of more retries I would expect to see more messages like "Nodes ["2"] reached the response timeout"

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Vladimir Sharshov (vsharshov)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/415860
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=7c0485eb1a8c75a57936eaecce935091659429e9
Submitter: Jenkins
Branch: master

commit 7c0485eb1a8c75a57936eaecce935091659429e9
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Dec 30 14:11:36 2016 +0300

    Network problem tolerance puppet status check

    Connection between node and Astute can be lost some
    times, so we need more tries to get info about task
    status on node.

    Two changes:

    - instead of 1 try Astute will run 6 tries with 10
    timeout for every attempt;
    - it will process such behavior for puppet using separately
      retries: puppet_undefined_retries

    Instead of full puppet retry status retry is safety because
    it is idempotent.

    Puppet undefined retries can be setup using Astute config
    or sending undefined_retries in puppet task parameters same
    way as for usual retries. Most important thing: it will refresh
    to original value every time when Astute get defined answer.

    Change-Id: Ie86576a3400be5a6b11041c8e6acf89abf3bbd51
    Related-Bug: #1653210
    Closes-Bug: #1653737

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/417358

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/417359

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/mitaka)

Reviewed: https://review.openstack.org/417359
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=1b2bf8eaeec647cb1810347ebd4bdf96b0055c8e
Submitter: Jenkins
Branch: stable/mitaka

commit 1b2bf8eaeec647cb1810347ebd4bdf96b0055c8e
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Dec 30 14:11:36 2016 +0300

    Network problem tolerance puppet status check

    Connection between node and Astute can be lost some
    times, so we need more tries to get info about task
    status on node.

    Two changes:

    - instead of 1 try Astute will run 6 tries with 10
    timeout for every attempt;
    - it will process such behavior for puppet using separately
      retries: puppet_undefined_retries

    Instead of full puppet retry status retry is safety because
    it is idempotent.

    Puppet undefined retries can be setup using Astute config
    or sending undefined_retries in puppet task parameters same
    way as for usual retries. Most important thing: it will refresh
    to original value every time when Astute get defined answer.

    Change-Id: Ie86576a3400be5a6b11041c8e6acf89abf3bbd51
    Related-Bug: #1653210
    Closes-Bug: #1653737
    (cherry picked from commit 7c0485eb1a8c75a57936eaecce935091659429e9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/newton)

Reviewed: https://review.openstack.org/417358
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=0497a610d2deeee528ea3af68cd95b1b8165dd41
Submitter: Jenkins
Branch: stable/newton

commit 0497a610d2deeee528ea3af68cd95b1b8165dd41
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Dec 30 14:11:36 2016 +0300

    Network problem tolerance puppet status check

    Connection between node and Astute can be lost some
    times, so we need more tries to get info about task
    status on node.

    Two changes:

    - instead of 1 try Astute will run 6 tries with 10
    timeout for every attempt;
    - it will process such behavior for puppet using separately
      retries: puppet_undefined_retries

    Instead of full puppet retry status retry is safety because
    it is idempotent.

    Puppet undefined retries can be setup using Astute config
    or sending undefined_retries in puppet task parameters same
    way as for usual retries. Most important thing: it will refresh
    to original value every time when Astute get defined answer.

    Change-Id: Ie86576a3400be5a6b11041c8e6acf89abf3bbd51
    Related-Bug: #1653210
    Closes-Bug: #1653737
    (cherry picked from commit 7c0485eb1a8c75a57936eaecce935091659429e9)

tags: added: on-verification
Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on 9.2 snapshot #741.

Currently BVT build is passed:
https://product-ci.infra.mirantis.net/job/9.x.main.ubuntu.bvt_2/744.

Described issue with failed provisioning is not observed since #730 BVT build.

tags: removed: on-verification
tags: added: swarm-blocker
Revision history for this message
Kyrylo Romanenko (kromanenko) wrote :

Reopening this bug for 9.2.

Steps:
1. Deployed 9.0 with additional parameters:
PUPPET_TIMEOUT=10000
WAIT_FOR_PROVISIONING_TIMEOUT=1800
IRONIC_ENABLED=true
IRONIC_NODES_COUNT=1

2. Upgraded to 9.2 following official instructions:
https://docs.mirantis.com/openstack/fuel/fuel-master/release-notes/update-product/update-prerequisites-9-2.html#update-prerequisites-9-2
https://docs.mirantis.com/openstack/fuel/fuel-master/release-notes/update-product/update-master-9-2.html#update-master-9-2

$ yum install -y http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/9.2-updates/x86_64/Packages/mos-release-9.2-1.el7.x86_64.rpm
$ yum clean all
$ yum install -y mos-playbooks
$ ansible-playbook playbooks/mos9_prepare_fuel.yml
$ ansible-playbook playbooks/update_fuel.yml -e '{"rebuild_bootstrap":false}'
$ ansible-playbook playbooks/mos9_fuel_upgrade_kernel_4.4.yml

3. Setted-up cluster with roles: 1 Controller, 1 Compute, 1 Cinder, 1 Ironic.
Neutron VLAN,
4. Started deployment.
And got the same errors like described in current issue:
http://paste.openstack.org/show/596852/

And that's what i can see on Agent log of Controller:
http://paste.openstack.org/show/596853/

Revision history for this message
Kyrylo Romanenko (kromanenko) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I've checked the log and see strong connection problem between master node and node. Astute try to connect and check status of node from

2017-01-27 14:55:55 INFO [29965] Start puppet with timeout 5400 sec. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 14:56:06 DEBUG [29965] Retry #1 to run mcollective agent on nodes: '1'
2017-01-27 14:56:17 DEBUG [29965] Retry #2 to run mcollective agent on nodes: '1'
2017-01-27 14:56:28 DEBUG [29965] Retry #3 to run mcollective agent on nodes: '1'
2017-01-27 14:56:38 DEBUG [29965] Retry #4 to run mcollective agent on nodes: '1'
2017-01-27 14:56:48 DEBUG [29965] Retry #5 to run mcollective agent on nodes: '1'
2017-01-27 14:56:58 DEBUG [29965] Retry #6 to run mcollective agent on nodes: '1'

It is repeat tries 3 times which has 6 tries inside.

2017-01-27 15:02:09 DEBUG [29965] Puppet on node has undefined status. 2 retries remained. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 15:07:03 DEBUG [29965] Puppet on node has undefined status. 1 retries remained. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 15:11:57 DEBUG [29965] Puppet on node has undefined status. 0 retries remained. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 15:16:48 ERROR [29965] Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp, status: undefined

Same behavior for all 4 nodes.

Resolution: Astute done 12 tries for every 4 nodes and it takes 21 minutes. Nodes do not answered.
Looks like we have serious problem with network/provision client/mcollective.

Interesting details: we have error on mcollective log on node:
2017-01-27T15:15:36.255295+00:00 debug: 15:15:35.807222 #1693] DEBUG -- : runner.rb:54:in `block in run' PLMC6: Message does not pass filters, ignoring

Looking for the code of mcollective: https://github.com/puppetlabs/marionette-collective/blob/master/lib/mcollective/runner.rb#L196 looks like we can get such error if node id changed somehow to unexpected.

Can you provide access to env where error was reproduced?

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Error was not reproduced. Move task to previous state.

We can create new bug where we can try to investigate this error after changes which we made in this bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-astute 11.0.0.0rc1

This issue was fixed in the openstack/fuel-astute 11.0.0.0rc1 release candidate.

Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on 10.0 build #1571.

Described issue is not observed long time for BVT runs.
https://product-ci.infra.mirantis.net/job/10.0.main.ubuntu.bvt_2/1485/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.