Bug #1653737 “[9.0][BVT] Too many nodes failed to provision” : Series ocata : Bugs : Fuel for OpenStack

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2017-01-03:

#1

fail_error_ceph_rados_gw-fuel-snapshot-2017-01-03_03-51-15.tar Edit (12.8 MiB, application/x-tar)

Changed in fuel:
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2017-01-03:

#2

Another failure - https://product-ci.infra.mirantis.net/job/9.x.main.ubuntu.bvt_2/708/

Changed in fuel:
status:	Confirmed → New
milestone:	none → 9.2

Revision history for this message

Vladimir Kozhukalov (kozhukalov) wrote on 2017-01-04:

#3

We encouter the same issue here as we had here https://bugs.launchpad.net/fuel/+bug/1644618

There are errors in mcollective log:

2017-01-03T03:47:12.862357+00:00 err: 03:47:12.614364 #1760] ERROR -- : rabbitmq.rb:50:in `on_hbread_fail' Heartbeat read failed from 'stomp://mcollective@10.109.21.2:61613': {"ticker_i
nterval"=>29.5, "read_fail_count"=>0, "lock_fail"=>true, "lock_fail_count"=>1}

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-04:

#4

It seems that it recovered from the problem above and, the reason is a bit different:

2017-01-03 03:47:52 ERROR [29198] 3c273858-0f41-470b-abf0-50fcafeeefa1: Provision command returned non zero exit code on node: 2
(from astute.log)

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-04:

#5

Network problem definitely took place, however it seems that there were no retry attempts afterwards:

2017-01-03 03:47:50 ERROR [29198] Nodes ["2"] reached the response timeout
2017-01-03 03:47:50 WARNING [29198] Unable to get actual status of puppet on node 2. Reason: Nodes ["2"] reached the response timeout
2017-01-03 03:47:50 ERROR [29198] Node 2, task provision_2, manifest /etc/puppet/shell_manifests/provision_2_manifest.pp, status: undefined
2017-01-03 03:47:50 WARNING [29198] Error to get puppet status. Node 2, task provision_2, manifest /etc/puppet/shell_manifests/provision_2_manifest.pp.
2017-01-03 03:47:50 ERROR [29198] Node has failed to deploy. There is no more retries for puppet run. Node 2, task provision_2, manifest /etc/puppet/shell_manifests/provision_2_manifest.pp

and that is it, in case of more retries I would expect to see more messages like "Nodes ["2"] reached the response timeout"

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2017-01-05:

#6

BVT failed again:

https://product-ci.infra.mirantis.net/job/9.x.main.ubuntu.bvt_2/725
https://product-ci.infra.mirantis.net/job/9.x.main.ubuntu.bvt_2/726

Changed in fuel:
status:	New → Confirmed

OpenStack Infra (hudson-openstack) on 2017-01-05

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Vladimir Sharshov (vsharshov)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-06: Fix merged to fuel-astute (master)

#7

Reviewed: https://review.openstack.org/415860
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=7c0485eb1a8c75a57936eaecce935091659429e9
Submitter: Jenkins
Branch: master

commit 7c0485eb1a8c75a57936eaecce935091659429e9
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Dec 30 14:11:36 2016 +0300

Network problem tolerance puppet status check

    Connection between node and Astute can be lost some
    times, so we need more tries to get info about task
    status on node.

Two changes:

    - instead of 1 try Astute will run 6 tries with 10
    timeout for every attempt;
    - it will process such behavior for puppet using separately
      retries: puppet_undefined_retries

Instead of full puppet retry status retry is safety because
it is idempotent.

    Puppet undefined retries can be setup using Astute config
    or sending undefined_retries in puppet task parameters same
    way as for usual retries. Most important thing: it will refresh
    to original value every time when Astute get defined answer.

    Change-Id: Ie86576a3400be5a6b11041c8e6acf89abf3bbd51
    Related-Bug: #1653210
    Closes-Bug: #1653737

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-06: Fix proposed to fuel-astute (stable/newton)

#8

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/417358

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-06: Fix proposed to fuel-astute (stable/mitaka)

#9

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/417359

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-06: Fix merged to fuel-astute (stable/mitaka)

#10

Reviewed: https://review.openstack.org/417359
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=1b2bf8eaeec647cb1810347ebd4bdf96b0055c8e
Submitter: Jenkins
Branch: stable/mitaka

commit 1b2bf8eaeec647cb1810347ebd4bdf96b0055c8e
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Dec 30 14:11:36 2016 +0300

Network problem tolerance puppet status check

    Connection between node and Astute can be lost some
    times, so we need more tries to get info about task
    status on node.

Two changes:

    - instead of 1 try Astute will run 6 tries with 10
    timeout for every attempt;
    - it will process such behavior for puppet using separately
      retries: puppet_undefined_retries

Instead of full puppet retry status retry is safety because
it is idempotent.

    Puppet undefined retries can be setup using Astute config
    or sending undefined_retries in puppet task parameters same
    way as for usual retries. Most important thing: it will refresh
    to original value every time when Astute get defined answer.

    Change-Id: Ie86576a3400be5a6b11041c8e6acf89abf3bbd51
    Related-Bug: #1653210
    Closes-Bug: #1653737
    (cherry picked from commit 7c0485eb1a8c75a57936eaecce935091659429e9)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-06: Fix merged to fuel-astute (stable/newton)

#11

Reviewed: https://review.openstack.org/417358
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=0497a610d2deeee528ea3af68cd95b1b8165dd41
Submitter: Jenkins
Branch: stable/newton

commit 0497a610d2deeee528ea3af68cd95b1b8165dd41
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Dec 30 14:11:36 2016 +0300

Network problem tolerance puppet status check

    Connection between node and Astute can be lost some
    times, so we need more tries to get info about task
    status on node.

Two changes:

    - instead of 1 try Astute will run 6 tries with 10
    timeout for every attempt;
    - it will process such behavior for puppet using separately
      retries: puppet_undefined_retries

Instead of full puppet retry status retry is safety because
it is idempotent.

    Puppet undefined retries can be setup using Astute config
    or sending undefined_retries in puppet task parameters same
    way as for usual retries. Most important thing: it will refresh
    to original value every time when Astute get defined answer.

    Change-Id: Ie86576a3400be5a6b11041c8e6acf89abf3bbd51
    Related-Bug: #1653210
    Closes-Bug: #1653737
    (cherry picked from commit 7c0485eb1a8c75a57936eaecce935091659429e9)

TatyanaGladysheva (tgladysheva) on 2017-01-10

tags:

added: on-verification

Revision history for this message

TatyanaGladysheva (tgladysheva) wrote on 2017-01-10:

#12

Verified on 9.2 snapshot #741.

Currently BVT build is passed:
https://product-ci.infra.mirantis.net/job/9.x.main.ubuntu.bvt_2/744.

Described issue with failed provisioning is not observed since #730 BVT build.

tags:

removed: on-verification

Nastya Urlapova (aurlapova) on 2017-01-10

tags:

added: swarm-blocker

Revision history for this message

Kyrylo Romanenko (kromanenko) wrote on 2017-01-30:

#13

Reopening this bug for 9.2.

Steps:
1. Deployed 9.0 with additional parameters:
PUPPET_TIMEOUT=10000
WAIT_FOR_PROVISIONING_TIMEOUT=1800
IRONIC_ENABLED=true
IRONIC_NODES_COUNT=1

2. Upgraded to 9.2 following official instructions:
https://docs.mirantis.com/openstack/fuel/fuel-master/release-notes/update-product/update-prerequisites-9-2.html#update-prerequisites-9-2
https://docs.mirantis.com/openstack/fuel/fuel-master/release-notes/update-product/update-master-9-2.html#update-master-9-2

$ yum install -y http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/9.2-updates/x86_64/Packages/mos-release-9.2-1.el7.x86_64.rpm
$ yum clean all
$ yum install -y mos-playbooks
$ ansible-playbook playbooks/mos9_prepare_fuel.yml
$ ansible-playbook playbooks/update_fuel.yml -e '{"rebuild_bootstrap":false}'
$ ansible-playbook playbooks/mos9_fuel_upgrade_kernel_4.4.yml

3. Setted-up cluster with roles: 1 Controller, 1 Compute, 1 Cinder, 1 Ironic.
Neutron VLAN,
4. Started deployment.
And got the same errors like described in current issue:
http://paste.openstack.org/show/596852/

And that's what i can see on Agent log of Controller:
http://paste.openstack.org/show/596853/

Revision history for this message

Kyrylo Romanenko (kromanenko) wrote on 2017-01-30:

#14

Diagnostic snapshot: https://drive.google.com/open?id=0B6E70aHvCcRQTXg4a20zRF84SWs

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2017-01-30:

#15

I've checked the log and see strong connection problem between master node and node. Astute try to connect and check status of node from

2017-01-27 14:55:55 INFO [29965] Start puppet with timeout 5400 sec. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 14:56:06 DEBUG [29965] Retry #1 to run mcollective agent on nodes: '1'
2017-01-27 14:56:17 DEBUG [29965] Retry #2 to run mcollective agent on nodes: '1'
2017-01-27 14:56:28 DEBUG [29965] Retry #3 to run mcollective agent on nodes: '1'
2017-01-27 14:56:38 DEBUG [29965] Retry #4 to run mcollective agent on nodes: '1'
2017-01-27 14:56:48 DEBUG [29965] Retry #5 to run mcollective agent on nodes: '1'
2017-01-27 14:56:58 DEBUG [29965] Retry #6 to run mcollective agent on nodes: '1'

It is repeat tries 3 times which has 6 tries inside.

2017-01-27 15:02:09 DEBUG [29965] Puppet on node has undefined status. 2 retries remained. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 15:07:03 DEBUG [29965] Puppet on node has undefined status. 1 retries remained. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 15:11:57 DEBUG [29965] Puppet on node has undefined status. 0 retries remained. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 15:16:48 ERROR [29965] Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp, status: undefined

Same behavior for all 4 nodes.

Resolution: Astute done 12 tries for every 4 nodes and it takes 21 minutes. Nodes do not answered.
Looks like we have serious problem with network/provision client/mcollective.

Interesting details: we have error on mcollective log on node:
2017-01-27T15:15:36.255295+00:00 debug: 15:15:35.807222 #1693] DEBUG -- : runner.rb:54:in `block in run' PLMC6: Message does not pass filters, ignoring

Looking for the code of mcollective: https://github.com/puppetlabs/marionette-collective/blob/master/lib/mcollective/runner.rb#L196 looks like we can get such error if node id changed somehow to unexpected.

Can you provide access to env where error was reproduced?

I've checked the log and see strong connection problem between master node and node. Astute try to connect and check status of node from

2017-01-27 14:55:55 INFO [29965] Start puppet with timeout 5400 sec. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 14:56:06 DEBUG [29965] Retry #1 to run mcollective agent on nodes: '1'
2017-01-27 14:56:17 DEBUG [29965] Retry #2 to run mcollective agent on nodes: '1'
2017-01-27 14:56:28 DEBUG [29965] Retry #3 to run mcollective agent on nodes: '1'
2017-01-27 14:56:38 DEBUG [29965] Retry #4 to run mcollective agent on nodes: '1'
2017-01-27 14:56:48 DEBUG [29965] Retry #5 to run mcollective agent on nodes: '1'
2017-01-27 14:56:58 DEBUG [29965] Retry #6 to run mcollective agent on nodes: '1'

It is repeat tries 3 times which has 6 tries inside.

2017-01-27 15:02:09 DEBUG [29965] Puppet on node has undefined status. 2 retries remained. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 15:07:03 DEBUG [29965] Puppet on node has undefined status. 1 retries remained. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 15:11:57 DEBUG [29965] Puppet on node has undefined status. 0 retries remained. Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp
2017-01-27 15:16:48 ERROR [29965] Node 1, task provision_1, manifest /etc/puppet/shell_manifests/provision_1_manifest.pp, status: undefined

Same behavior for all 4 nodes.

Resolution: Astute done 12 tries for every 4 nodes and it takes 21 minutes. Nodes do not answered.
Looks like we have serious problem with network/provision client/mcollective.

Interesting details: we have error on mcollective log on node: 
2017-01-27T15:15:36.255295+00:00 debug: 15:15:35.807222 #1693] DEBUG -- : runner.rb:54:in `block in run' PLMC6: Message does not pass filters, ignoring

Looking for the code of mcollective: https://github.com/puppetlabs/marionette-collective/blob/master/lib/mcollective/runner.rb#L196 looks like we can get such error if node id changed somehow to unexpected.

Can you provide access to env where error was reproduced?

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2017-01-30:

#16

Error was not reproduced. Move task to previous state.

We can create new bug where we can try to investigate this error after changes which we made in this bug.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-27: Fix included in openstack/fuel-astute 11.0.0.0rc1

#17

This issue was fixed in the openstack/fuel-astute 11.0.0.0rc1 release candidate.

Revision history for this message

TatyanaGladysheva (tgladysheva) wrote on 2017-04-06:

#18

Verified on 10.0 build #1571.

Described issue is not observed long time for BVT runs.
https://product-ci.infra.mirantis.net/job/10.0.main.ubuntu.bvt_2/1485/

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Vladimir Sharshov	Fuel for OpenStack 11.0
Mitaka	Fix Released	High	Vladimir Sharshov	Fuel for OpenStack 9.x-updates
Newton	Fix Released	High	Vladimir Sharshov	Fuel for OpenStack 10.1
Ocata	Fix Committed	High	Vladimir Sharshov	Fuel for OpenStack 11.0

Fuel for OpenStack

[9.0][BVT] Too many nodes failed to provision

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches