Fuel for OpenStack

Wrong node-discover accessibility check logic

Bug #1538055 reported by Aleksey Zvyagintsev on 2016-01-26

28

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	High	Fuel Sustaining	Fuel for OpenStack 10.0
	8.0.x	Won't Fix	High	Fuel Python (Deprecated)	Fuel for OpenStack 8.0
	Mitaka	Won't Fix	High	Fuel Python (Deprecated)	Fuel for OpenStack 9.0

Bug Description

Currently, is no way to check mcollvective status before reporting to fuel-master 'Node discovered and on-line'

It's raise errors in flow :

1) Bootstrap node started
2)Nailgun-agent report that node discovered [1]
3)Fuel-qa\user add nodes to cluster and press 'Deploy'
4)Mcollective still not ran [0] due some reason (long start-up\etc)
Like result- error will be raised:
[Method verify_networks. Network verification not avaliable because nodes ["1", "2", "3", "4", "5"] not avaliable via mcollective. ]

I propose to fix issue in simplest way:
Add lock\check for mcollective status in nailgun-agent.cron before run nailgun-agent.

Other way: improve agent to check mcollective itself
(this soluthion more complex and maybe be not necessary - mcollective can be deprecated in fuel-future.)

[0]
https://github.com/openstack/fuel-agent/blob/master/contrib/fuel_bootstrap/files/trusty/usr/bin/fix-configs-on-startup#L65

[1]
https://github.com/openstack/fuel-nailgun-agent/blob/master/nailgun-agent.cron

Tags:

Aleksey Zvyagintsev (azvyagintsev) on 2016-01-26

tags:	added: bootstrap module-nailgun-agent tech-debt
Changed in fuel:
status:	New → Confirmed
milestone:	none → 8.0
importance:	Undecided → High

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-01-26:

#1

agree with additional checks in the nailgun-agent.cron

Revision history for this message

Aleksey Zvyagintsev (azvyagintsev) wrote on 2016-01-26:

#2

Also, problem should be investigated for already deployed system - looks like they also affected

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-01-26:

#3

they can't be affected because status has changed from 'discovered' to something else

Ivan Ponomarev (ivanzipfer) on 2016-01-26

tags:

added: area-astute area-python

Revision history for this message

Aleksey Zvyagintsev (azvyagintsev) wrote on 2016-01-26:

#4

Nope, they can :
Node discovered => provisioned (not deployed) => nailgun-agent ran, but mcollective still in progress.

Revision history for this message

slava valyavskiy (slava-val-al) wrote on 2016-01-26:

#5

From my point of view this piece of logic should be moved into nailgun agent code.
btw. we already have mcollective related code into nailgun agent code.

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2016-01-26:

#6

This bug cannot be tech-debt and critical. It must be one or the other.

Aleksey Zvyagintsev (azvyagintsev) on 2016-01-26

tags:

added: area-ruby
removed: area-astute area-python

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-01-26:

#7

@Slava, can You propose a solution in the nailgun?

Revision history for this message

slava valyavskiy (slava-val-al) wrote on 2016-01-26:

#8

@Maksim, check status of mcollective service more correctly than https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L120

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-01-26:

#9

This bug is the root of randomly failed BVT tests, it is marked as Critical for MOS 8.0

no longer affects:	fuel/mitaka
Changed in fuel:
milestone:	8.0 → 9.0

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-01-26:

#10

@TImur, could you attach bvt job example which you mentioned ?

Revision history for this message

Aleksey Zvyagintsev (azvyagintsev) wrote on 2016-01-26:

#11

Related:
https://bugs.launchpad.net/fuel/+bug/1538055/comments/8 =>>>
https://bugs.launchpad.net/fuel/+bug/1518306/comments/12

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-01-26:

#12

I'm against fixing it in a week before HCF. Let's not backport it.

tags:

added: area-python
removed: area-ruby

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-01-26:

#13

And I don't think that creating of one more workaround around another workaround is a good idea. Workflow needs to be redesigned. Mcollective config is updated by at least three tools. Obviously there is a space for improvement here.

Revision history for this message

Aleksey Zvyagintsev (azvyagintsev) wrote on 2016-01-26:

#14

@DPyznov:
its critical issue, and should be fixed in 8.0 also - otherwise we will always catch random problems with tests.

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2016-01-26:

#15

use in nailgun-agent a kind of the following check:
def get_mco_status
Process.kill 0, %x(status mcollective).split.last.to_i
rescue
0
end

# returns 1 if process found and active, otherwise 0

Revision history for this message

Dmitry Bilunov (dbilunov) wrote on 2016-01-26:

#16

We can check if mcollective is available without any additional tools and patches just by inserting a one-line condition into nailgun-agent.cron:

# timeout 5 ruby -rmcollective -e 'include MCollective::RPC; rpcclient("version", :options => { :config => "/etc/mcollective/server.cfg" }).get_version'
# echo $?
0
(now setting invalid configuration in /etc/mcollective/server.cfg)
# timeout 5 ruby -rmcollective -e 'include MCollective::RPC; rpcclient("version", :options => { :config => "/etc/mcollective/server.cfg" }).get_version'
Could not create RPC client: Could not connect to RabbitMQ Server: SIGTERM
# echo $?
124

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-01-26:

#17

@Nastya, it looks like my mistake, the BVT on #484 failed because of other issue, https://bugs.launchpad.net/fuel/+bug/1533082

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-01-26:

#18

Guys, nailgun agent creates a record in nailgun, takes id on the node, then updates mcollective config and restarts mcollective. Your extra check will ruin the workflow. Please update tests, add a wait till 'mco ping' returns all nodes.

Revision history for this message

slava valyavskiy (slava-val-al) wrote on 2016-01-26:

#19

@Dmitry, and what is about users? Should we add this ....issue into the release notes?

Revision history for this message

Aleksey Zvyagintsev (azvyagintsev) wrote on 2016-01-26:

#20

@Dmitry, fix fuel-qa didn't fix deployment process
(https://bugs.launchpad.net/fuel/+bug/1538055/comments/4)
Which actually (my assume) root cause of other problems (like ntpd granular timeout)

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-03-03:

#21

@slava, @Aleksey, could you provide use case that affects users? The only case that I see is when a user wants to deploy a cluster automatically as soon as all nodes are available. I guess we can live with workaround for this case until we redesign our flow.

Fuel Devops McRobotson (fuel-devops-robot) on 2016-04-19

Changed in fuel:
milestone:	9.0 → 10.0

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-04-20:

#22

Technical debt, doesn't affect real users. Removing from Mitaka release.

Dmitry Pyzhov (dpyzhov) on 2016-06-22

Changed in fuel:
assignee:	Fuel Python (Deprecated) (fuel-python) → Fuel Sustaining (fuel-sustaining-team)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.