Fuel for OpenStack

Incorrect failed message in Notification [9.2]

Bug #1657716 reported by Sergey Galkin on 2017-01-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	High	Vladimir Sharshov	Fuel for OpenStack 9.x-updates

Bug Description

Steps to reproduce:
1. Install 9.0
2. Upgrade to 9.2 from http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2017-01-13-184421/x86_64
3. Deploy big cluster (~300 nodes in my case)
4. Try to add 100 nodes to cluster.
5. During deployment two nodes gone offline

Deployment failed with messages
09:45:00
Graph execution has been successfully completed.You can check deployment history for detailed information.
09:44:59
Node 'Untitled (b1:f0)' failed: Task sriov_iommu_check failed on node 2183
09:44:59
Node 'Untitled (b2:70)' failed: Task sriov_iommu_check failed on node 2123
09:44:45
Node 'Untitled (15:30)' failed: Task netconfig failed on node 1992
09:36:56
Node 'Untitled (c1:f0)' failed: Task netconfig failed on node 2362
09:32:02
Node 'Untitled (15:30)' has gone away
09:32:02
Node 'Untitled (c1:f0)' has gone away
09:12:39
Node 'Untitled (89:60)' is back online
09:12:38
Node 'Untitled (ab:e0)' is back online

In this same time in logs on the node-2183

2017-01-18 09:36:58 +0000 Scope(Class[main]) (notice): MODULAR: sriov_iommu_check.pp
2017-01-18 09:36:58 +0000 Puppet (notice): Compiled catalog for node-2183.domain.tld in environment production in 0.02 seconds
2017-01-18 09:36:59 +0000 /Stage[main]/Main/Exec[sriov_iommu_check]/returns (notice): executed successfully
2017-01-18 09:36:59 +0000 Puppet (notice): Finished catalog run in 0.49 seconds

Tags:

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2017-01-19:

fuel-srv-error.png Edit (62.7 KiB, image/png)

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2017-01-19:

Fuel logs available on

http://mos-scale-share.mirantis.com/fuel-9.2-2017-01-19-logs.tar.gz

node-2183 logs available on
http://mos-scale-share.mirantis.com/node-2183-2017-01-19-logs.tar.gz

Oleksiy Molchanov (omolchanov) on 2017-01-19

Changed in fuel:
milestone:	none → 9.2
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)

Oleksiy Molchanov (omolchanov) on 2017-01-20

Changed in fuel:
milestone:	9.2 → 9.3
importance:	Undecided → High

Vladimir Sharshov (vsharshov) on 2017-01-24

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Vladimir Sharshov (vsharshov)

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2017-01-26:

This is expected behaviour due to fault tolerance settings

https://github.com/openstack/fuel-library/blob/stable/mitaka/deployment/puppet/deployment_groups/tasks.yaml

We run deployment for 300 nodes (even when we add 100 nodes only we need to adjust things on the other nodes), thus we have 6 compute nodes fault tolerance.

If you want to change this for a particular cluster, please do the following: copy paste deployment groups definition from
https://github.com/openstack/fuel-library/blob/stable/mitaka/deployment/puppet/deployment_groups/tasks.yaml

into a file called, e.g. default_cluster_<cluster_id>.yaml
then upload it into nailgun with

fuel2 graph upload -t default -e <cluster_id> -f default_cluster_<cluster_id>.yaml

Changed in fuel:
status:	New → Invalid

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2017-01-26:

Yep, Vladimir is absolutely right.

tags:

added: feature-task-based module-astute

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2017-01-27:

The issue is about report for failed task while actually task completed successfully.

Changed in fuel:
status:	Invalid → New

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2017-01-30:

It is expected and message is correct. Node has temporary network problem which block normal workflow between master node and deploying node. So task was run without any problem on node but Astute could not read status from node.

Proof from Astute log.

2017-01-18 08:04:39 DEBUG [5506] Retry #1 to run mcollective agent on nodes: '2118'
2017-01-18 08:05:21 DEBUG [5506] Retry #2 to run mcollective agent on nodes: '2118'
2017-01-18 08:06:04 DEBUG [5506] Retry #3 to run mcollective agent on nodes: '2118'
2017-01-18 08:06:46 DEBUG [5506] Retry #4 to run mcollective agent on nodes: '2118'
2017-01-18 08:07:28 DEBUG [5506] Retry #5 to run mcollective agent on nodes: '2118'
2017-01-18 08:08:11 DEBUG [5506] Retry #6 to run mcollective agent on nodes: '2118'
2017-01-18 08:08:53 DEBUG [5506] Retry #7 to run mcollective agent on nodes: '2118'
2017-01-18 08:09:35 DEBUG [5506] Retry #8 to run mcollective agent on nodes: '2118'
2017-01-18 08:10:18 DEBUG [5506] Retry #9 to run mcollective agent on nodes: '2118'
2017-01-18 08:11:01 DEBUG [5506] Retry #10 to run mcollective agent on nodes: '2118'

We already prepare fix for such behavior which was done for this bug: https://bugs.launchpad.net/fuel/+bug/1653737

It it present in 9.3 and 10 releases.

If for some reason you still fail after this fix and puppet on nodes ended with succeed, please edit /etc/astute/astuted.conf on master node and add parameter

puppet_undefined_retries: 10

By default this value is 3. Feel free to change this value to protect system from temporary network problem.

Review: https://review.openstack.org/#/c/417359/