Incorrect failed message in Notification [9.2]

Bug #1657716 reported by Sergey Galkin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Vladimir Sharshov

Bug Description

Steps to reproduce:
1. Install 9.0
2. Upgrade to 9.2 from http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2017-01-13-184421/x86_64
3. Deploy big cluster (~300 nodes in my case)
4. Try to add 100 nodes to cluster.
5. During deployment two nodes gone offline

Deployment failed with messages
09:45:00
Graph execution has been successfully completed.You can check deployment history for detailed information.
09:44:59
Node 'Untitled (b1:f0)' failed: Task sriov_iommu_check failed on node 2183
09:44:59
Node 'Untitled (b2:70)' failed: Task sriov_iommu_check failed on node 2123
09:44:45
Node 'Untitled (15:30)' failed: Task netconfig failed on node 1992
09:36:56
Node 'Untitled (c1:f0)' failed: Task netconfig failed on node 2362
09:32:02
Node 'Untitled (15:30)' has gone away
09:32:02
Node 'Untitled (c1:f0)' has gone away
09:12:39
Node 'Untitled (89:60)' is back online
09:12:38
Node 'Untitled (ab:e0)' is back online

In this same time in logs on the node-2183

2017-01-18 09:36:58 +0000 Scope(Class[main]) (notice): MODULAR: sriov_iommu_check.pp
2017-01-18 09:36:58 +0000 Puppet (notice): Compiled catalog for node-2183.domain.tld in environment production in 0.02 seconds
2017-01-18 09:36:59 +0000 /Stage[main]/Main/Exec[sriov_iommu_check]/returns (notice): executed successfully
2017-01-18 09:36:59 +0000 Puppet (notice): Finished catalog run in 0.49 seconds

Revision history for this message
Sergey Galkin (sgalkin) wrote :
Revision history for this message
Sergey Galkin (sgalkin) wrote :
Changed in fuel:
milestone: none → 9.2
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Changed in fuel:
milestone: 9.2 → 9.3
importance: Undecided → High
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This is expected behaviour due to fault tolerance settings

https://github.com/openstack/fuel-library/blob/stable/mitaka/deployment/puppet/deployment_groups/tasks.yaml

We run deployment for 300 nodes (even when we add 100 nodes only we need to adjust things on the other nodes), thus we have 6 compute nodes fault tolerance.

If you want to change this for a particular cluster, please do the following: copy paste deployment groups definition from
https://github.com/openstack/fuel-library/blob/stable/mitaka/deployment/puppet/deployment_groups/tasks.yaml

into a file called, e.g. default_cluster_<cluster_id>.yaml
then upload it into nailgun with

fuel2 graph upload -t default -e <cluster_id> -f default_cluster_<cluster_id>.yaml

Changed in fuel:
status: New → Invalid
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Yep, Vladimir is absolutely right.

tags: added: feature-task-based module-astute
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

The issue is about report for failed task while actually task completed successfully.

Changed in fuel:
status: Invalid → New
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

It is expected and message is correct. Node has temporary network problem which block normal workflow between master node and deploying node. So task was run without any problem on node but Astute could not read status from node.

Proof from Astute log.

2017-01-18 08:04:39 DEBUG [5506] Retry #1 to run mcollective agent on nodes: '2118'
2017-01-18 08:05:21 DEBUG [5506] Retry #2 to run mcollective agent on nodes: '2118'
2017-01-18 08:06:04 DEBUG [5506] Retry #3 to run mcollective agent on nodes: '2118'
2017-01-18 08:06:46 DEBUG [5506] Retry #4 to run mcollective agent on nodes: '2118'
2017-01-18 08:07:28 DEBUG [5506] Retry #5 to run mcollective agent on nodes: '2118'
2017-01-18 08:08:11 DEBUG [5506] Retry #6 to run mcollective agent on nodes: '2118'
2017-01-18 08:08:53 DEBUG [5506] Retry #7 to run mcollective agent on nodes: '2118'
2017-01-18 08:09:35 DEBUG [5506] Retry #8 to run mcollective agent on nodes: '2118'
2017-01-18 08:10:18 DEBUG [5506] Retry #9 to run mcollective agent on nodes: '2118'
2017-01-18 08:11:01 DEBUG [5506] Retry #10 to run mcollective agent on nodes: '2118'

We already prepare fix for such behavior which was done for this bug: https://bugs.launchpad.net/fuel/+bug/1653737

It it present in 9.3 and 10 releases.

If for some reason you still fail after this fix and puppet on nodes ended with succeed, please edit /etc/astute/astuted.conf on master node and add parameter

    puppet_undefined_retries: 10

By default this value is 3. Feel free to change this value to protect system from temporary network problem.

Review: https://review.openstack.org/#/c/417359/

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Small additional: you need to restart Astute service after config change to apply changes. You can do it using this command:

      service astute restart

Please pay attention: do not reboot Astute service if you have any task which is in running status.
You can check it running command:

      fuel task | grep 'running'

Changed in fuel:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.