Wrong node-discover accessibility check logic

Bug #1538055 reported by Aleksey Zvyagintsev
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Fuel Sustaining
8.0.x
Won't Fix
High
Fuel Python (Deprecated)
Mitaka
Won't Fix
High
Fuel Python (Deprecated)

Bug Description

Currently, is no way to check mcollvective status before reporting to fuel-master 'Node discovered and on-line'

It's raise errors in flow :

1) Bootstrap node started
2)Nailgun-agent report that node discovered [1]
3)Fuel-qa\user add nodes to cluster and press 'Deploy'
4)Mcollective still not ran [0] due some reason (long start-up\etc)
Like result- error will be raised:
[Method verify_networks. Network verification not avaliable because nodes ["1", "2", "3", "4", "5"] not avaliable via mcollective. ]

I propose to fix issue in simplest way:
Add lock\check for mcollective status in nailgun-agent.cron before run nailgun-agent.

Other way: improve agent to check mcollective itself
(this soluthion more complex and maybe be not necessary - mcollective can be deprecated in fuel-future.)

[0]
https://github.com/openstack/fuel-agent/blob/master/contrib/fuel_bootstrap/files/trusty/usr/bin/fix-configs-on-startup#L65

[1]
https://github.com/openstack/fuel-nailgun-agent/blob/master/nailgun-agent.cron

tags: added: bootstrap module-nailgun-agent tech-debt
Changed in fuel:
status: New → Confirmed
milestone: none → 8.0
importance: Undecided → High
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

agree with additional checks in the nailgun-agent.cron

Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

Also, problem should be investigated for already deployed system - looks like they also affected

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

they can't be affected because status has changed from 'discovered' to something else

tags: added: area-astute area-python
Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

Nope, they can :
Node discovered => provisioned (not deployed) => nailgun-agent ran, but mcollective still in progress.

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

From my point of view this piece of logic should be moved into nailgun agent code.
btw. we already have mcollective related code into nailgun agent code.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

This bug cannot be tech-debt and critical. It must be one or the other.

tags: added: area-ruby
removed: area-astute area-python
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

@Slava, can You propose a solution in the nailgun?

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

@Maksim, check status of mcollective service more correctly than https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L120

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

This bug is the root of randomly failed BVT tests, it is marked as Critical for MOS 8.0

no longer affects: fuel/mitaka
Changed in fuel:
milestone: 8.0 → 9.0
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@TImur, could you attach bvt job example which you mentioned ?

Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

I'm against fixing it in a week before HCF. Let's not backport it.

tags: added: area-python
removed: area-ruby
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

And I don't think that creating of one more workaround around another workaround is a good idea. Workflow needs to be redesigned. Mcollective config is updated by at least three tools. Obviously there is a space for improvement here.

Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

@DPyznov:
its critical issue, and should be fixed in 8.0 also - otherwise we will always catch random problems with tests.

Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

use in nailgun-agent a kind of the following check:
def get_mco_status
  Process.kill 0, %x(status mcollective).split.last.to_i
rescue
 0
end

# returns 1 if process found and active, otherwise 0

Revision history for this message
Dmitry Bilunov (dbilunov) wrote :

We can check if mcollective is available without any additional tools and patches just by inserting a one-line condition into nailgun-agent.cron:

# timeout 5 ruby -rmcollective -e 'include MCollective::RPC; rpcclient("version", :options => { :config => "/etc/mcollective/server.cfg" }).get_version'
# echo $?
0
(now setting invalid configuration in /etc/mcollective/server.cfg)
# timeout 5 ruby -rmcollective -e 'include MCollective::RPC; rpcclient("version", :options => { :config => "/etc/mcollective/server.cfg" }).get_version'
Could not create RPC client: Could not connect to RabbitMQ Server: SIGTERM
# echo $?
124

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

@Nastya, it looks like my mistake, the BVT on #484 failed because of other issue, https://bugs.launchpad.net/fuel/+bug/1533082

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Guys, nailgun agent creates a record in nailgun, takes id on the node, then updates mcollective config and restarts mcollective. Your extra check will ruin the workflow. Please update tests, add a wait till 'mco ping' returns all nodes.

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

@Dmitry, and what is about users? Should we add this ....issue into the release notes?

Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

@Dmitry, fix fuel-qa didn't fix deployment process
(https://bugs.launchpad.net/fuel/+bug/1538055/comments/4)
Which actually (my assume) root cause of other problems (like ntpd granular timeout)

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

@slava, @Aleksey, could you provide use case that affects users? The only case that I see is when a user wants to deploy a cluster automatically as soon as all nodes are available. I guess we can live with workaround for this case until we redesign our flow.

Changed in fuel:
milestone: 9.0 → 10.0
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Technical debt, doesn't affect real users. Removing from Mitaka release.

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python (Deprecated) (fuel-python) → Fuel Sustaining (fuel-sustaining-team)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.