Compute node in error state without any reasons

Bug #1389651 reported by Sergey Galkin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Vladimir Sharshov

Bug Description

Snapshot attached to https://bugs.launchpad.net/fuel/+bug/1389640

api: '1.0'
astute_sha: c72dac7b31646fbedbfc56a2a87676c6d5713fcf
auth_required: true
build_id: 2014-11-02_21-27-58
build_number: '69'
feature_groups:
- mirantis
fuellib_sha: 45ad9b42666d7e3e14ab9af2911808e6c8806842
fuelmain_sha: ac3ba5f5c6073b7776ec69fc3cb4dd3c56df36c5
nailgun_sha: 35946b1f225c984f11915ba8e985584160f0b129
ostf_sha: 9c6fadca272427bb933bc459e14bb1bad7f614aa
production: docker
release: '6.0'

Steps to reproduce:
1. Try to Install 100 nodes cluster with 3 controller in HA + 97 compute with cinder iscsi and neutron gre

Deployment has failed with message:
"Deployment has failed. Method deploy. Disabling the upload of disk image because glance was not installed properly.
Inspect Astute logs for the details"

Compute_65 node has error state without any reasons

I have found only common to all compute nodes errors

2014-11-05 10:37:58 ERR
ntpdate[1416]: no server suitable for synchronization found
2014-11-05 10:37:38 ERR
ntpdate[1414]: no server suitable for synchronization found
2014-11-05 10:37:28 ERR
ntpdate[1412]: no server suitable for synchronization found
2014-11-05 10:37:28 ERR
ntpdate[1411]: no server suitable for synchronization found
2014-11-05 10:37:28 ERR
kernel: mei_me 0000:00:16.0: initialization failed.

Screenshot attached

Tags: scale
Revision history for this message
Sergey Galkin (sgalkin) wrote :
Changed in fuel:
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 6.0
status: New → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We have to investigate why this node-100 (compute_65) failed provision.
Also, nodes failed provisioning task, should *not* stop the deployment (according to design sessions meeting minutes) but should be skipped to be added later as an additional 'scaling'.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

What's clear is that it's not clear at all what broke here, causing the compute node to go in error state. You saw that node-100 (compute_65) had an error, but none of the logs indicate this. All nodes provisioned okay with no issues and astute logs prove it. In fact, receiverd.log in nailgun has a message showing that the cluster was provisioned:
receiverd.log:2014-11-05 11:04:06.755 INFO [7f39aa25e700] (receiver) RPC method provision_resp received: {"status": "ready", "progress": 100, "task_uuid": "2801508b-ba51-4155-836b-e140dcf8d440", "nodes": [{"status": "provisioned", "progress": 100, "uid": "28"}]}

The deploy task did fail with something about glance image upload, but the root of this actual deploy failure is pacemaker service provider is broken:
err: (/Stage[main]/Rabbitmq::Service/Service[p_rabbitmq-server]) Provider pacemaker is not functional on this host

We really should reproduce this issue on a known good ISO where puppet failures can be ruled out.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

These lines are the most important in astute.log:
2014-11-05T11:05:20 info: [427] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"bbd9275c-9fe7-4ec3-908a-0e21c9617d35", "nodes"=>[{"uid"=>"32", "status"=>"deploying", "role"=>"primary-controller", "progress"=>0}]}}
2014-11-05T11:14:06 info: [427] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"bbd9275c-9fe7-4ec3-908a-0e21c9617d35", "nodes"=>[{"uid"=>"32", "status"=>"error", "error_type"=>"deploy", "role"=>"primary-controller"}]}}
2014-11-05T11:14:49 info: [427] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"bbd9275c-9fe7-4ec3-908a-0e21c9617d35", "nodes"=>[{"uid"=>"100", "status"=>"deploying", "role"=>"compute", "progress"=>50}]}}
2014-11-05T11:14:49 info: [427] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"bbd9275c-9fe7-4ec3-908a-0e21c9617d35", "status"=>"error", "error"=>"Method deploy. Disabling the upload of disk image because glance was not installed properly.\nInspect Astute logs for the details"}}

Astute should fail its task when primary controller deployment fails, and not continue with trying to deploy a compute node or attempt glance image upload. In this case, it continues and creates unexpected errors in the UI.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Sharshov (vsharshov)
importance: High → Medium
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Look like a duplicate of bug #1389308.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/133438

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/133438
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=ad1a56c4a872ac1f73d79bb2d54a04bed1bc3dc3
Submitter: Jenkins
Branch: master

commit ad1a56c4a872ac1f73d79bb2d54a04bed1bc3dc3
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Mon Nov 10 13:39:58 2014 +0300

    Correct informing about glance problem

    Instead of always inform about glance image
    problem in every fail deployment now we report
    following such rules:

    - report about last node only if controller do not present in task.
      This change prevent to unexpecting error report about compute node
      generally.
    - report about controller only if controller do not mark as error
      before. This change prevent error about expecting glance problem
      because deployment on contoller already failed.

    Change-Id: I5079bcc11b70d889e432fab57ece81ff956b7115
    Closes-Bug: #1389651
    Closes-Bug: #1389640

Changed in fuel:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.