nailgun-agent reports status back while node is deleting

Bug #1371225 reported by Tatyana Dubyk
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Nikolay Markov

Bug Description

On Ubuntu in simple mode on vcenter's machine, when deploy of openstack has already been stopped on one of nodes then this node is bootstrapped, but provisioning is not started.

==============vcenter settings===========================
export VCENTER_IP='172.16.0.254'
export <email address hidden>'
export VCENTER_PASSWORD='Qwer!1234'
export VCENTER_CLUSTERS='Cluster1,Cluster2'
=====================================================
Configuration:
===================================================
steps to reproduce:
1.set up lab on vcenter's machine from 5.1-11(RC5) iso
2.create env and start deploy:
   OS: Ubuntu (simple mode)
   create nodes with roles: 1st - controller,
                            2nd - cinder (vmdk)
3. start deployment
4. check that openstack has already been installed on node with controller role
   and status is 'ready'
6. while openstack process is continued on node with cinder role
   stop deploy by clicking button in Fuel UI.
7. wait until node with cinder role will be in 'offline' status
8. then wait until node with cinder role will be in 'pending addition' status
9. then start re-deploy on this node again.
10. check that node has already been bootstapped, but provisioning is not started
    and in fuel ui we can see that still nothing happend on this node
    but on master node 'fuel nodes list' command gives info that on this node provisioning is started.
11. And this node hangs on.

Expected result: Deployment process of openstack on each of nodes will be finished successfully
Actual result: deployment of openstack on node with controller role is hanging on after bootstarping.
--------------------------Logs------------------------------------
---------------------fuel-version---------------------------------
[root@nailgun ~]# fuel nodes list
id | status | name | cluster | ip | mac | roles | pending_roles | online
 ---|--------------|------------------|---------|---------------|-------------------|------------|---------------|-------
6 | ready | Untitled (c5:cf) | 4 | 10.108.10.3 | 64:8f:b2:8e:c5:cf | controller | | True
7 | discover | Untitled (49:62) | None | 10.108.10.130 | 64:ed:3f:72:49:62 | | | True
8 | discover | Untitled (08:a4) | None | 10.108.10.227 | 64:38:d7:88:08:a4 | | | True
10 | discover | Untitled (87:a4) | None | 10.108.10.191 | 64:ba:f4:e0:87:a4 | | | True
9 | provisioning | Untitled (b7:e2) | 4 | 10.108.10.197 | 64:47:93:c5:b7:e2 | cinder | | True

The reason:

It was because node-9 sent their status to Nailgun after the node was started to reboot by MCagent but before an actual reboot occured. It was because nailgun-agent checks the '/var/run/nodiscover' file and then sleeps the rand(30) interval. In case of this bug Astute created the /var/run/nodiscover file after nailgun-agent had checked the file but before the sleep finished.

How to fix:
We should just move this sleep before /var/run/nodiscover file checking.

Revision history for this message
Tatyana Dubyk (tdubyk) wrote :
description: updated
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
Changed in fuel:
milestone: none → 6.0
importance: Undecided → Medium
Revision history for this message
Irina Povolotskaya (ipovolotskaya) wrote :

Should it be put into Release notes?

Revision history for this message
Evgeniya Shumakher (eshumakher) wrote :

Irina, no, it's a Medium bug.

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Andrey Danin (gcon-monolake)
Changed in fuel:
assignee: Andrey Danin (gcon-monolake) → Fuel Partner Integration Team (fuel-partner)
Revision history for this message
Andrey Danin (gcon-monolake) wrote :

Cobbler didn't reboot node-9 by some reason. But then Astute started the provision.

Saving system node-9
2014-09-18T16:42:15 debug: [442] Cobbler syncing
2014-09-18T16:42:17 debug: [442] Trying to reboot node: node-9
2014-09-18T16:42:18 debug: [442] Cobbler syncing
2014-09-18T16:42:19 debug: [442] Waiting for reboot to be complete: nodes: ["node-9"]
2014-09-18T16:42:19 debug: [442] Reboot task status: node: node-9 status: [1411054938.23411, "Power management (reboot)", "running", []]
2014-09-18T16:43:04 debug: [442] Reboot task status: node: node-9 status: [1411054938.23411, "Power management (reboot)", "complete", []]
2014-09-18T16:43:04 debug: [442] Successfully rebooted: node-9
2014-09-18T16:43:09 debug: [442] Run shell command ' if [ -r /etc/nailgun_systemtype ]; then
          NODE_TYPE=$(cat /etc/nailgun_systemtype)
        else
          NODE_TYPE="provisioning"
        fi

        # Check what was mounted to '/': drive (provisioned node)
        # or init ramdisk (bootsrapped/provisioning node)
        if grep -Eq 'root=[^[:blank:]]+' /proc/cmdline; then
          echo "Run node rebooting command using 'SB' to sysrq-trigger"
          echo "1" > /proc/sys/kernel/panic_on_oops
          echo "10" > /proc/sys/kernel/panic
          echo "b" > /proc/sysrq-trigger
        else
          echo "Do not reboot $NODE_TYPE node using shell"
        fi
' using ssh
2014-09-18T16:43:09 debug: [442] Run shell command using ssh. Retry 0
2014-09-18T16:43:09 debug: [442] Affected nodes: ["10.108.10.4"]
2014-09-18T16:43:12 debug: [442] Retry result: success nodes: [], error nodes: [], inaccessible nodes: ["10.108.10.4"]
2014-09-18T16:43:42 warning: [442] 42f1294c-a31b-4a0c-aa38-842e344c3c5f: Running shell command on nodes ["9"] finished with errors. Nodes [{"uid"=>"9"}] are inaccessible
2014-09-18T16:43:42 info: [442] 42f1294c-a31b-4a0c-aa38-842e344c3c5f: Finished running shell command: ["9"]
2014-09-18T16:43:42 info: [442] Starting OS provisioning for nodes: 9
2014-09-18T16:43:52 debug: [442] 42f1294c-a31b-4a0c-aa38-842e344c3c5f: MC agent 'systemtype', method 'get_type', results: {:sender=>"9", :statuscode=>0, :statusmsg=>"OK", :data=>{:node_type=>"bootstrap\n"}}
2014-09-18T16:43:52 debug: [442] Got node types: uid=9 type=bootstrap

Revision history for this message
Andrey Danin (gcon-monolake) wrote :

Node-9 (mac 64:47:93:C5:B7:E2) had IP address 10.108.10.4 before it was rebooted to Bootstrap. After it got back as Bootstrap node in Nailgun it still had old IP, but node-9 became 10.108.10.197. So, a wrong IP was used to register the node in Cobbler. That's why Cobbler didn't reboot it via ssh. The last question still bother me: why node-9 became online but it's IP address wasn't renewed in DB?

Changed in fuel:
status: New → Confirmed
assignee: Fuel Partner Integration Team (fuel-partner) → nobody
Revision history for this message
Andrey Danin (gcon-monolake) wrote :

Okay. Finally I discovered an issue.
It was because node-9 sent their status to Nailgun after it was starting to reboot by MCagent but before actual reboot occured. It was because nailgun-agent checks the '/var/run/nodiscover' file and then sleeps rand(30) interval. In the case of this bug Astute created the /var/run/nodiscover file after nailgun-agent had checked the file but before the sleep finished. We should just move this sleep before /var/run/nodiscover file checking.

Changed in fuel:
status: Confirmed → Triaged
status: Triaged → Confirmed
summary: - On Ubuntu in simple mode on vcenter's machine, when deploy of openstack
- has already been stopped on one of nodes then this node is bootstrapped,
- but provisioning is not started.
+ nailgun-agent reports status back while node is deleting
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
tags: removed: vcenter
Changed in fuel:
milestone: 6.0 → 6.1
Revision history for this message
Andrey Danin (gcon-monolake) wrote :

Won't fix for 6.0 because we beyond the Soft Code Freeze and a bug priority is Medium.

Revision history for this message
Andrey Danin (gcon-monolake) wrote :

Moved to 6.1.

description: updated
Changed in fuel:
status: Confirmed → Won't Fix
no longer affects: fuel/6.0.x
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/6.1.x
Changed in fuel:
status: Won't Fix → Confirmed
milestone: 6.0 → 6.1
Dmitry Pyzhov (dpyzhov)
tags: added: module-nailgun-agent
Nikolay Markov (nmarkov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Nikolay Markov (nmarkov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/170514

Nikolay Markov (nmarkov)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/170514
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=6e3ca9592f6638d5bc8473ee7110efbdb1babe81
Submitter: Jenkins
Branch: master

commit 6e3ca9592f6638d5bc8473ee7110efbdb1babe81
Author: Nikolay Markov <email address hidden>
Date: Fri Apr 3 17:18:08 2015 +0300

    Moved sleep in Nailgun agent

    Node should not wait after checking '/var/run/nodiscover' file
    when reboot is started, it should give Astute time to react
    (and to create this file) and only then check it and reboot.

    Change-Id: I2c6464262e100244bc525f8e8b456eb26d35e3a1
    Related-Bug: #1371225

Nikolay Markov (nmarkov)
Changed in fuel:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.