astute didn't reboot node in the middle of provisioning

Bug #1463035 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Matthew Mosesohn

Bug Description

Node (node-2) hasn't been came back from provisioning status. It still in bootstrap.
Probably mcollctive configuration is wrong on node-2(10.20.0.9)
[root@bootstrap ~]# cat /etc/mcollective/server.cfg:
http://paste.openstack.org/show/274412/

For example from node-1, which has been provisioned successfully and reported about that to Fuel:
root@node-1:~# cat /etc/mcollective/server.cfg
http://paste.openstack.org/show/274413/

Also on node-1 exists server.cfg.old file
root@node-1:~# cat /etc/mcollective/server.cfg.old:
http://paste.openstack.org/show/274414/
This file is missing on wrong node-2.

Configuration:
Baremetal,Centos,IBP, Ubuntu-vlan,Сeph-all,Nova-debug,nova-quotas,6.1_521
Controllers:3 Computes:3

api: '1.0'
astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479
auth_required: true
build_id: 2015-06-08_06-13-27
build_number: '521'
feature_groups:
- mirantis
fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d
fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761
fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24
nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b
openstack_version: 2014.2.2-6.1
production: docker
python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
release: '6.1'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-08_13-15-09.tar.xz

Tags: scale
Revision history for this message
Leontii Istomin (listomin) wrote :

I've restarted mcollective on node-2 around 2015-06-08 13:09. The issue still exists.
mco ping works well:
[root@fuel ~]# mco ping
2 time=36.25 ms
master time=37.18 ms
6 time=39.78 ms
5 time=40.76 ms
4 time=41.94 ms
1 time=43.39 ms
3 time=44.67 ms

---- ping statistics ----
7 replies max: 44.67 min: 36.25 avg: 40.57

Dina Belova (dbelova)
Changed in fuel:
milestone: none → 6.1
status: New → Confirmed
importance: Undecided → High
Changed in fuel:
assignee: nobody → Fuel Astute Team (fuel-astute)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I've checked env. The problem not it Mcollective, because it answered via mco ping and mco rpc. Also mcollective on problem node has been restarted, which can cancel provision operation because it runs via mcollective.

I think we should discovery fuel agent logs and try to find core reason.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Fuel provisioning team (fuel-provisioning)
summary: - mcollective configuration hasn't been updated by some reason
+ Node hasn't been came back from provisioning status.
Revision history for this message
Alexander Gordeev (a-gordeev) wrote : Re: Node hasn't been came back from provisioning status.

Provisioning scripts finished without any flaws. No error, no traces. Everything was as usual.

But then, astute failed to reboot node.

It recognized that 'reboot was successful', but in fact, a node wasn't rebooted.

Node type was still 'image' and was reported even faster than types from other nodes.

Related piece of astute logs attached.

summary: - Node hasn't been came back from provisioning status.
+ astute didn't reboot node in the middle of provisioning
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Looks like something wrong in lib/astute/cobbler_manager.rb

Assigning to fuel-astute team

Changed in fuel:
assignee: Fuel provisioning team (fuel-provisioning) → Fuel Astute Team (fuel-astute)
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

I think for IBP it will be enough to check node type in order to realize false positive reboot.

It node type is still 'image', astute should try to make another attempts of node reboot. After few unsuccessful retries it should give up.

Revision history for this message
Leontii Istomin (listomin) wrote :

Provisioning step was successfully done. I just reseted the environment and click "deply changes" button. At the moment deployment step is running.

Revision history for this message
Vladimir Kozhukalov (kozhukalov) wrote :

Something strange happened. Looks like nodes had wrong IPs. For example, node-2 was supposed to have 10.20.0.3 but there were logs from node-4 with 10.20.0.7 in /var/log/remote/node-2.domain.tld. Astute didn't even try to reboot the node with 10.20.0.9.

Leontiy, please try to reproduce this issue and try to describe steps to reproduce in more details. And please check twice that slave nodes aren't doing something specific while master node is re-deployed and that they are rebooted with new master node installation. And if you are able to reproduce, please don't reset env and poke me immediately so I can take a careful look at this lab.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Fuel provisioning team (fuel-provisioning)
Revision history for this message
Vladimir Kozhukalov (kozhukalov) wrote :

By the way, taking into account some strange things around IP addresses assigning, it is likely that the root cause of the issue is dnsmasq's inability to deal with large networks. If so, we need to consider substituting dnsmasq with something more suitable like isc-dhcp-server (cobbler does support it).

Andrey Maximov (maximov)
Changed in fuel:
assignee: Fuel provisioning team (fuel-provisioning) → Vladimir Sharshov (vsharshov)
Andrey Maximov (maximov)
Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Matthew Mosesohn (raytrac3r)
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

We are encountering some issues with DNS availability while deploying large environments because dnsmasq restarts on cobbler sync when nodes in classic provisioning report deployment complete. This isn't related to dnsmasq or cobbler sync, though. fence_ssh that cobbler calls is completely independent of any sync operation and dnsmasq. Changing the dns/dhcp provider won't impact this bug.

Here's what I can tell about node-2
It did get fuel-agent to apply the image and reached the end
Astute did tell Cobbler to reboot the node (and reported success)
Cobbler did run fence_ssh on node-2
node-2's logs say it did ssh
It looks like node-2 started a reboot
The IP address 10.20.0.3 matches node-2's logs, along with cobbler's and astute's
Mcollective's config is just fine on node-2. There's nothing wrong there.

I find it really hard to believe these logs are right. node-2 shows mcollective, rsyslog, sshd all shutting down for a poweroff/reboot. I can't see that you ran mco ping or anything after noticing that node-2 was misbehaving.
My best guess is there was an issue with the reboot command after imaging.

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

@Matthew Mosesohn (raytrac3r):
> My best guess is there was an issue with the reboot command after imaging.

'image' node type in /etc/nailgun_systemtype indicates that reboot never happened.

https://review.openstack.org/#/c/160891/ astute places 'image' type on 'bootstrap' loaded node. Since contents on rootfs of 'bootstrap' node won't be restored on reboot, it proves that this node wasn't rebooted.

also notice that 'get_type' from node-2 takes 3 minutes less than for the rest of the nodes.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

I believe we can't fix this or get more information without console access to a bootstrapped node that failed to reboot after imaging. Moving to incomplete.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Dina Belova (dbelova) wrote :

We have tried to reproduce this several times - no success. It looks like that was kind of sporadic issue with no clear reasons. Moving to invalid now.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.