Fuel for OpenStack

nailgun-agent reports status back while node is deleting

Bug #1371225 reported by Tatyana Dubyk on 2014-09-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	Medium	Nikolay Markov	Fuel for OpenStack 6.1

Bug Description

On Ubuntu in simple mode on vcenter's machine, when deploy of openstack has already been stopped on one of nodes then this node is bootstrapped, but provisioning is not started.

==============vcenter settings===========================
export VCENTER_IP='172.16.0.254'
export <email address hidden>'
export VCENTER_PASSWORD='Qwer!1234'
export VCENTER_CLUSTERS='Cluster1,Cluster2'
=====================================================
Configuration:
===================================================
steps to reproduce:
1.set up lab on vcenter's machine from 5.1-11(RC5) iso
2.create env and start deploy:
   OS: Ubuntu (simple mode)
   create nodes with roles: 1st - controller,
                            2nd - cinder (vmdk)
3. start deployment
4. check that openstack has already been installed on node with controller role
   and status is 'ready'
6. while openstack process is continued on node with cinder role
   stop deploy by clicking button in Fuel UI.
7. wait until node with cinder role will be in 'offline' status
8. then wait until node with cinder role will be in 'pending addition' status
9. then start re-deploy on this node again.
10. check that node has already been bootstapped, but provisioning is not started
    and in fuel ui we can see that still nothing happend on this node
    but on master node 'fuel nodes list' command gives info that on this node provisioning is started.
11. And this node hangs on.

Expected result: Deployment process of openstack on each of nodes will be finished successfully
Actual result: deployment of openstack on node with controller role is hanging on after bootstarping.
--------------------------Logs------------------------------------
---------------------fuel-version---------------------------------
[root@nailgun ~]# fuel nodes list
id | status | name | cluster | ip | mac | roles | pending_roles | online
---|--------------|------------------|---------|---------------|-------------------|------------|---------------|-------
6 | ready | Untitled (c5:cf) | 4 | 10.108.10.3 | 64:8f:b2:8e:c5:cf | controller | | True
7 | discover | Untitled (49:62) | None | 10.108.10.130 | 64:ed:3f:72:49:62 | | | True
8 | discover | Untitled (08:a4) | None | 10.108.10.227 | 64:38:d7:88:08:a4 | | | True
10 | discover | Untitled (87:a4) | None | 10.108.10.191 | 64:ba:f4:e0:87:a4 | | | True
9 | provisioning | Untitled (b7:e2) | 4 | 10.108.10.197 | 64:47:93:c5:b7:e2 | cinder | | True

The reason:

It was because node-9 sent their status to Nailgun after the node was started to reboot by MCagent but before an actual reboot occured. It was because nailgun-agent checks the '/var/run/nodiscover' file and then sleeps the rand(30) interval. In case of this bug Astute created the /var/run/nodiscover file after nailgun-agent had checked the file but before the sleep finished.

How to fix:
We should just move this sleep before /var/run/nodiscover file checking.

See original description

Tags:

Revision history for this message

Tatyana Dubyk (tdubyk) wrote on 2014-09-18:

Diagnostic snapshot Edit (10.9 MiB, application/x-tar)

description:

updated

Bogdan Dobrelya (bogdando) on 2014-09-18

Changed in fuel:
assignee:	nobody → Fuel Python Team (fuel-python)

Ihor Kalnytskyi (ikalnytskyi) on 2014-09-19

Changed in fuel:
milestone:	none → 6.0
importance:	Undecided → Medium

Revision history for this message

Irina Povolotskaya (ipovolotskaya) wrote on 2014-09-19:

Should it be put into Release notes?

Revision history for this message

Evgeniya Shumakher (eshumakher) wrote on 2014-09-19:

Irina, no, it's a Medium bug.

Dmitry Pyzhov (dpyzhov) on 2014-09-23

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Andrey Danin (gcon-monolake)

Evgeniya Shumakher (eshumakher) on 2014-09-23

Changed in fuel:
assignee:	Andrey Danin (gcon-monolake) → Fuel Partner Integration Team (fuel-partner)

Revision history for this message

Andrey Danin (gcon-monolake) wrote on 2014-10-23:

Cobbler didn't reboot node-9 by some reason. But then Astute started the provision.

Saving system node-9
2014-09-18T16:42:15 debug: [442] Cobbler syncing
2014-09-18T16:42:17 debug: [442] Trying to reboot node: node-9
2014-09-18T16:42:18 debug: [442] Cobbler syncing
2014-09-18T16:42:19 debug: [442] Waiting for reboot to be complete: nodes: ["node-9"]
2014-09-18T16:42:19 debug: [442] Reboot task status: node: node-9 status: [1411054938.23411, "Power management (reboot)", "running", []]
2014-09-18T16:43:04 debug: [442] Reboot task status: node: node-9 status: [1411054938.23411, "Power management (reboot)", "complete", []]
2014-09-18T16:43:04 debug: [442] Successfully rebooted: node-9
2014-09-18T16:43:09 debug: [442] Run shell command ' if [ -r /etc/nailgun_systemtype ]; then
          NODE_TYPE=$(cat /etc/nailgun_systemtype)
        else
          NODE_TYPE="provisioning"
        fi

        # Check what was mounted to '/': drive (provisioned node)
        # or init ramdisk (bootsrapped/provisioning node)
        if grep -Eq 'root=[^[:blank:]]+' /proc/cmdline; then
          echo "Run node rebooting command using 'SB' to sysrq-trigger"
          echo "1" > /proc/sys/kernel/panic_on_oops
          echo "10" > /proc/sys/kernel/panic
          echo "b" > /proc/sysrq-trigger
        else
          echo "Do not reboot $NODE_TYPE node using shell"
        fi
' using ssh
2014-09-18T16:43:09 debug: [442] Run shell command using ssh. Retry 0
2014-09-18T16:43:09 debug: [442] Affected nodes: ["10.108.10.4"]
2014-09-18T16:43:12 debug: [442] Retry result: success nodes: [], error nodes: [], inaccessible nodes: ["10.108.10.4"]
2014-09-18T16:43:42 warning: [442] 42f1294c-a31b-4a0c-aa38-842e344c3c5f: Running shell command on nodes ["9"] finished with errors. Nodes [{"uid"=>"9"}] are inaccessible
2014-09-18T16:43:42 info: [442] 42f1294c-a31b-4a0c-aa38-842e344c3c5f: Finished running shell command: ["9"]
2014-09-18T16:43:42 info: [442] Starting OS provisioning for nodes: 9
2014-09-18T16:43:52 debug: [442] 42f1294c-a31b-4a0c-aa38-842e344c3c5f: MC agent 'systemtype', method 'get_type', results: {:sender=>"9", :statuscode=>0, :statusmsg=>"OK", :data=>{:node_type=>"bootstrap\n"}}
2014-09-18T16:43:52 debug: [442] Got node types: uid=9 type=bootstrap

Cobbler didn't reboot node-9 by some reason. But then Astute started the provision.

Saving system node-9
2014-09-18T16:42:15 debug: [442] Cobbler syncing
2014-09-18T16:42:17 debug: [442] Trying to reboot node: node-9
2014-09-18T16:42:18 debug: [442] Cobbler syncing
2014-09-18T16:42:19 debug: [442] Waiting for reboot to be complete: nodes: ["node-9"]
2014-09-18T16:42:19 debug: [442] Reboot task status: node: node-9 status: [1411054938.23411, "Power management (reboot)", "running", []]
2014-09-18T16:43:04 debug: [442] Reboot task status: node: node-9 status: [1411054938.23411, "Power management (reboot)", "complete", []]
2014-09-18T16:43:04 debug: [442] Successfully rebooted: node-9
2014-09-18T16:43:09 debug: [442] Run shell command '        if [ -r /etc/nailgun_systemtype ]; then
          NODE_TYPE=$(cat /etc/nailgun_systemtype)
        else
          NODE_TYPE="provisioning"
        fi

Revision history for this message

Andrey Danin (gcon-monolake) wrote on 2014-10-23:

Node-9 (mac 64:47:93:C5:B7:E2) had IP address 10.108.10.4 before it was rebooted to Bootstrap. After it got back as Bootstrap node in Nailgun it still had old IP, but node-9 became 10.108.10.197. So, a wrong IP was used to register the node in Cobbler. That's why Cobbler didn't reboot it via ssh. The last question still bother me: why node-9 became online but it's IP address wasn't renewed in DB?

Changed in fuel:
status:	New → Confirmed
assignee:	Fuel Partner Integration Team (fuel-partner) → nobody

Revision history for this message

Andrey Danin (gcon-monolake) wrote on 2014-10-23:

Okay. Finally I discovered an issue.
It was because node-9 sent their status to Nailgun after it was starting to reboot by MCagent but before actual reboot occured. It was because nailgun-agent checks the '/var/run/nodiscover' file and then sleeps rand(30) interval. In the case of this bug Astute created the /var/run/nodiscover file after nailgun-agent had checked the file but before the sleep finished. We should just move this sleep before /var/run/nodiscover file checking.

Changed in fuel:
status:	Confirmed → Triaged
status:	Triaged → Confirmed
summary:	- On Ubuntu in simple mode on vcenter's machine, when deploy of openstack - has already been stopped on one of nodes then this node is bootstrapped, - but provisioning is not started. + nailgun-agent reports status back while node is deleting

Matthew Mosesohn (raytrac3r) on 2014-10-28

Changed in fuel:
assignee:	nobody → Fuel Python Team (fuel-python)

Andrey Danin (gcon-monolake) on 2014-10-29

tags:

removed: vcenter

Roman Prykhodchenko (romcheg) on 2014-11-19

Changed in fuel:
milestone:	6.0 → 6.1

Revision history for this message

Andrey Danin (gcon-monolake) wrote on 2014-12-17:

Won't fix for 6.0 because we beyond the Soft Code Freeze and a bug priority is Medium.

Revision history for this message

Andrey Danin (gcon-monolake) wrote on 2014-12-17:

Moved to 6.1.

description:

updated

Andrey Danin (gcon-monolake) on 2014-12-17

Changed in fuel:
status:	Confirmed → Won't Fix
no longer affects:	fuel/6.0.x

Dmitry Pyzhov (dpyzhov) on 2015-02-19

no longer affects:	fuel/6.1.x
Changed in fuel:
status:	Won't Fix → Confirmed
milestone:	6.0 → 6.1

Dmitry Pyzhov (dpyzhov) on 2015-03-27

tags:

added: module-nailgun-agent

Nikolay Markov (nmarkov) on 2015-04-02

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Nikolay Markov (nmarkov)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-03: Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/170514

Nikolay Markov (nmarkov) on 2015-04-03

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-08: Related fix merged to fuel-web (master)

#10

Reviewed: https://review.openstack.org/170514
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=6e3ca9592f6638d5bc8473ee7110efbdb1babe81
Submitter: Jenkins
Branch: master

commit 6e3ca9592f6638d5bc8473ee7110efbdb1babe81
Author: Nikolay Markov <email address hidden>
Date: Fri Apr 3 17:18:08 2015 +0300

Moved sleep in Nailgun agent

    Node should not wait after checking '/var/run/nodiscover' file
    when reboot is started, it should give Astute time to react
    (and to create this file) and only then check it and reboot.

Change-Id: I2c6464262e100244bc525f8e8b456eb26d35e3a1
Related-Bug: #1371225

Nikolay Markov (nmarkov) on 2015-04-08