Bug #1656150 “Mcollective fails to start in bootstrap of baremet...” : Bugs : Fuel for OpenStack

Oleksiy Molchanov (omolchanov) on 2017-01-13

Changed in fuel:
milestone:	none → 10.1
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)
milestone:	10.1 → 11.0
importance:	Undecided → High

Georgy Kibardin (gkibardin) on 2017-01-13

Changed in fuel:
status:	New → In Progress

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-17:

#1

Here the FUEL Snapshot

https://drive.google.com/file/d/0B67I6XmqaMeRM1BIUkhrdE91c0k/view?usp=sharing

I tried to start mcollective manually, it worked only on 6 servers of 10.
In the others the service is masked, and the only way to make them work seems restarting, and try to start it manually until problem is solved

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-18:

#2

From what I see in logs rc.local started but, for some reason mcollective unmasking part wasn't run. So, it looks like it hung somewhere in the middle. To understand where it could have happen I need to more files which are not a part of the snapshot: /etc/ntp.conf and /etc/send2syslog.conf.

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-18:

#3

[root@fuel ~]# cat /etc/ntp.conf
# ntp.conf: Managed by puppet.
#
# Enable next tinker options:
# panic - keep ntpd from panicking in the event of a large clock skew
# when a VM guest is suspended and resumed;
# stepout - allow ntpd change offset faster
tinker panic 0 stepout 5

disable monitor

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict -6 ::1

# Set up servers for ntpd with next options:
# server - IP address or DNS name of upstream NTP server
# iburst - allow send sync packages faster if upstream unavailable
# prefer - select preferrable server
# minpoll - set minimal update frequency
# maxpoll - set maximal update frequency
server 192.65.130.2 iburst minpoll 3 maxpoll 9
server 192.65.130.1 iburst minpoll 3 maxpoll 9

# Driftfile.
driftfile /var/lib/ntp/drift

[root@fuel ~]# cat /etc/send2syslog.conf
cat: /etc/send2syslog.conf: No such file or directory

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-18:

#4

Hm... It looks like it hasn't been executed at all. BTW, what is in /etc/rc.local on failed nodes?
P.S. What ISO are you using - I want to try to reproduce the whole thing.

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-18:

#5

[root@fuel ~]# cat /etc/ntp.conf
# ntp.conf: Managed by puppet.
#
# Enable next tinker options:
# panic - keep ntpd from panicking in the event of a large clock skew
# when a VM guest is suspended and resumed;
# stepout - allow ntpd change offset faster
tinker panic 0 stepout 5

disable monitor

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict -6 ::1

# Set up servers for ntpd with next options:
# server - IP address or DNS name of upstream NTP server
# iburst - allow send sync packages faster if upstream unavailable
# prefer - select preferrable server
# minpoll - set minimal update frequency
# maxpoll - set maximal update frequency
server 192.65.130.2 iburst minpoll 3 maxpoll 9
server 192.65.130.1 iburst minpoll 3 maxpoll 9

# Driftfile.
driftfile /var/lib/ntp/drift

[root@fuel ~]# cat /etc/send2syslog.conf
cat: /etc/send2syslog.conf: No such file or directory

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-18: Re: [Bug 1656150] Re: Mcollective fails to start in bootstrap of baremetal servers

#6

[root@fuel ~]# ssh 146.118.52.62
root@bootstrap:~# cat /etc/rc.local
#!/bin/sh -e

# Perform fuel bootstrap configuration
fix-configs-on-startup || true

# Unlocking nailgun-agent for discovering
rm -f /etc/nailgun-agent/nodiscover
-- (that is after a reboot)

I'm using the FUEL 10 ISO downloaded from the fuel-infra website.

I did reset the environment, and it seems a problem on only again 4 on 5
nodes on a specific hardware model. It reproduced exactly as a previous
deployment.

On 18/01/17 14:54, Georgy Kibardin wrote:
> Hm... It looks like it hasn't been executed at all. BTW, what is in /etc/rc.local on failed nodes?
> P.S. What ISO are you using - I want to try to reproduce the whole thing.
>

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-18:

#7

Does running /etc/rc.local manually afterwards fixes the problem?
It must fix mcollective config, unmask mcollective.service and then in ~30 sec. nailgun agent must successfully start mcollective. Also /etc/send2syslog.conf must appear.
To understand where it fails I think it would be usefule to run it as bash -x /etc/rc.local.

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-20:

#8

This is the output of "bash -x /etc/rc.local"

root@bootstrap:~# bash -x /etc/rc.local
+ fix-configs-on-startup
Failed to retrieve unit: Connection timed out
Failed to execute operation: Connection timed out
+ true
+ rm -f /etc/nailgun-agent/nodiscover

Waited more than 5 minutes but nailgun does not restart mcollective.

root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
Loaded: loaded (/lib/systemd/system/mcollective.service; enabled;
vendor preset: enabled)
Active: inactive (dead)

Jan 18 08:05:35 bootstrap systemd[1]: mcollective.service: Trying to
enqueue job mcollective.service/restart/replace
Jan 20 00:35:43 bootstrap systemd[1]: mcollective.service: Trying to
enqueue job mcollective.service/start/replace
Jan 20 00:38:20 bootstrap systemd[1]: mcollective.service: Failed to
send unit change signal for mcollective.service: Transport endpoint is
not connected
Jan 20 00:39:04 bootstrap systemd[1]: mcollective.service: Failed to
send unit change signal for mcollective.service: Transport endpoint is
not connected

Running /etc/rc.local though enabled me to start manually mcollective, a
great deal for my deployment for the moment.

root@bootstrap:~# service mcollective start
root@bootstrap:~#

[root@fuel ~]# mco ping
7 time=61.62 ms
10 time=62.61 ms
11 time=63.48 ms
12 time=64.41 ms
13 time=65.31 ms
9 time=66.21 ms
8 time=67.11 ms
6 time=67.91 ms
2 time=68.84 ms
master time=69.70 ms
1 time=74.22 ms

On 18/01/17 18:45, Georgy Kibardin wrote:
> Does running /etc/rc.local manually afterwards fixes the problem?
> It must fix mcollective config, unmask mcollective.service and then in ~30 sec. nailgun agent must successfully start mcollective. Also /etc/send2syslog.conf must appear.
> To understand where it fails I think it would be usefule to run it as bash -x /etc/rc.local.
>

This is the output of "bash -x /etc/rc.local"

root@bootstrap:~# bash -x /etc/rc.local
+ fix-configs-on-startup
Failed to retrieve unit: Connection timed out
Failed to execute operation: Connection timed out
+ true
+ rm -f /etc/nailgun-agent/nodiscover

Waited more than 5 minutes but nailgun does not restart mcollective.

root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
    Loaded: loaded (/lib/systemd/system/mcollective.service; enabled; 
vendor preset: enabled)
    Active: inactive (dead)

Jan 18 08:05:35 bootstrap systemd[1]: mcollective.service: Trying to 
enqueue job mcollective.service/restart/replace
Jan 20 00:35:43 bootstrap systemd[1]: mcollective.service: Trying to 
enqueue job mcollective.service/start/replace
Jan 20 00:38:20 bootstrap systemd[1]: mcollective.service: Failed to 
send unit change signal for mcollective.service: Transport endpoint is 
not connected
Jan 20 00:39:04 bootstrap systemd[1]: mcollective.service: Failed to 
send unit change signal for mcollective.service: Transport endpoint is 
not connected

Running /etc/rc.local though enabled me to start manually mcollective, a 
great deal for my deployment for the moment.

root@bootstrap:~# service mcollective start
root@bootstrap:~#

[root@fuel ~]# mco ping
7                                        time=61.62 ms
10                                       time=62.61 ms
11                                       time=63.48 ms
12                                       time=64.41 ms
13                                       time=65.31 ms
9                                        time=66.21 ms
8                                        time=67.11 ms
6                                        time=67.91 ms
2                                        time=68.84 ms
master                                   time=69.70 ms
1                                        time=74.22 ms

On 18/01/17 18:45, Georgy Kibardin wrote:
> Does running /etc/rc.local manually afterwards fixes the problem?
> It must fix mcollective config, unmask mcollective.service and then in ~30 sec. nailgun agent must successfully start mcollective. Also /etc/send2syslog.conf must appear.
> To understand where it fails I think it would be usefule to run it as bash -x /etc/rc.local.
>

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-20:

#9

Did, at least /etc/send2syslog.conf appear?
Anyway lets figure what are those error messages:

bash -x /etc/send2syslog.conf

must give us some clue.

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-20:

#10

Download full text (3.6 KiB)

It did,

I am in the middle of the provisioning though, I paste it anyway:

root@bootstrap:~# cat /etc/send2syslog.conf
{
"watchlist": [
     {"servers": [ {"host": "146.118.52.2"} ],
         "watchfiles": [
             {"tag": "bootstrap/kern.log", "files": ["/var/log/kern.log"]},
             {"tag": "bootstrap/udev", "files": ["/var/log/udev"]},
             {"tag": "bootstrap/dmesg", "files": ["/var/log/dmesg"]},
             {"tag": "bootstrap/secure", "files": ["/var/log/secure"]},
             {"tag": "bootstrap/messages", "files": ["/var/log/messages"]},
             {"tag": "bootstrap/fuel-agent", "files":
["/var/log/fuel-agent.log"]},
             {"tag": "bootstrap/syslog", "files": ["/var/log/syslog"]},
             {"tag": "bootstrap/auth", "files": ["/var/log/auth.log"]},
             {"tag": "bootstrap/mcollective", "log_type": "ruby",
                 "files": ["/var/log/mcollective.log"]},
             {"tag": "bootstrap/agent", "log_type": "ruby",
                 "files": ["/var/log/nailgun-agent.log"]},
             {"tag": "bootstrap/netprobe_sender", "log_type": "netprobe",
                 "files": ["/var/log/netprobe_sender.log"]},
             {"tag": "bootstrap/netprobe_listener", "log_type": "netprobe",
                 "files": ["/var/log/netprobe_listener.log"]}
         ]
     }
]
}
root@bootstrap:~#
root@bootstrap:~# bash -x /etc/send2syslog.conf
+ watchlist: '['
/etc/send2syslog.conf: line 2: watchlist:: command not found
+ '{servers:' '[' '{host:' '146.118.52.2}' '],'
/etc/send2syslog.conf: line 3: {servers:: command not found
+ watchfiles: '['
/etc/send2syslog.conf: line 4: watchfiles:: command not found
+ '{tag:' bootstrap/kern.log, files: '[/var/log/kern.log]},'
/etc/send2syslog.conf: line 5: {tag:: command not found
+ '{tag:' bootstrap/udev, files: '[/var/log/udev]},'
/etc/send2syslog.conf: line 6: {tag:: command not found
+ '{tag:' bootstrap/dmesg, files: '[/var/log/dmesg]},'
/etc/send2syslog.conf: line 7: {tag:: command not found
+ '{tag:' bootstrap/secure, files: '[/var/log/secure]},'
/etc/send2syslog.conf: line 8: {tag:: command not found
+ '{tag:' bootstrap/messages, files: '[/var/log/messages]},'
/etc/send2syslog.conf: line 9: {tag:: command not found
+ '{tag:' bootstrap/fuel-agent, files: '[/var/log/fuel-agent.log]},'
/etc/send2syslog.conf: line 10: {tag:: command not found
+ '{tag:' bootstrap/syslog, files: '[/var/log/syslog]},'
/etc/send2syslog.conf: line 11: {tag:: command not found
+ '{tag:' bootstrap/auth, files: '[/var/log/auth.log]},'
/etc/send2syslog.conf: line 12: {tag:: command not found
+ '{tag:' bootstrap/mcollective, log_type: ruby,
/etc/send2syslog.conf: line 13: {tag:: command not found
+ files: '[/var/log/mcollective.log]},'
/etc/send2syslog.conf: line 14: files:: command not found
+ '{tag:' bootstrap/agent, log_type: ruby,
/etc/send2syslog.conf: line 15: {tag:: command not found
+ files: '[/var/log/nailgun-agent.log]},'
/etc/send2syslog.conf: line 16: files:: command not found
+ '{tag:' bootstrap/netprobe_sender, log_type: netprobe,
/etc/send2syslog.conf: line 17: {tag:: command not found
+ files: '[/var/log/netprobe_sender.log]},'
/etc/send2syslog.conf: line 18: files:: command...

It did,

I am in the middle of the provisioning though, I paste it anyway:

root@bootstrap:~# cat /etc/send2syslog.conf
{
"watchlist": [
     {"servers": [ {"host": "146.118.52.2"} ],
         "watchfiles": [
             {"tag": "bootstrap/kern.log", "files": ["/var/log/kern.log"]},
             {"tag": "bootstrap/udev", "files": ["/var/log/udev"]},
             {"tag": "bootstrap/dmesg", "files": ["/var/log/dmesg"]},
             {"tag": "bootstrap/secure", "files": ["/var/log/secure"]},
             {"tag": "bootstrap/messages", "files": ["/var/log/messages"]},
             {"tag": "bootstrap/fuel-agent", "files": 
["/var/log/fuel-agent.log"]},
             {"tag": "bootstrap/syslog", "files": ["/var/log/syslog"]},
             {"tag": "bootstrap/auth", "files": ["/var/log/auth.log"]},
             {"tag": "bootstrap/mcollective", "log_type": "ruby",
                 "files": ["/var/log/mcollective.log"]},
             {"tag": "bootstrap/agent", "log_type": "ruby",
                 "files": ["/var/log/nailgun-agent.log"]},
             {"tag": "bootstrap/netprobe_sender", "log_type": "netprobe",
                 "files": ["/var/log/netprobe_sender.log"]},
             {"tag": "bootstrap/netprobe_listener", "log_type": "netprobe",
                 "files": ["/var/log/netprobe_listener.log"]}
         ]
     }
]
}
root@bootstrap:~#
root@bootstrap:~# bash -x /etc/send2syslog.conf
+ watchlist: '['
/etc/send2syslog.conf: line 2: watchlist:: command not found
+ '{servers:' '[' '{host:' '146.118.52.2}' '],'
/etc/send2syslog.conf: line 3: {servers:: command not found
+ watchfiles: '['
/etc/send2syslog.conf: line 4: watchfiles:: command not found
+ '{tag:' bootstrap/kern.log, files: '[/var/log/kern.log]},'
/etc/send2syslog.conf: line 5: {tag:: command not found
+ '{tag:' bootstrap/udev, files: '[/var/log/udev]},'
/etc/send2syslog.conf: line 6: {tag:: command not found
+ '{tag:' bootstrap/dmesg, files: '[/var/log/dmesg]},'
/etc/send2syslog.conf: line 7: {tag:: command not found
+ '{tag:' bootstrap/secure, files: '[/var/log/secure]},'
/etc/send2syslog.conf: line 8: {tag:: command not found
+ '{tag:' bootstrap/messages, files: '[/var/log/messages]},'
/etc/send2syslog.conf: line 9: {tag:: command not found
+ '{tag:' bootstrap/fuel-agent, files: '[/var/log/fuel-agent.log]},'
/etc/send2syslog.conf: line 10: {tag:: command not found
+ '{tag:' bootstrap/syslog, files: '[/var/log/syslog]},'
/etc/send2syslog.conf: line 11: {tag:: command not found
+ '{tag:' bootstrap/auth, files: '[/var/log/auth.log]},'
/etc/send2syslog.conf: line 12: {tag:: command not found
+ '{tag:' bootstrap/mcollective, log_type: ruby,
/etc/send2syslog.conf: line 13: {tag:: command not found
+ files: '[/var/log/mcollective.log]},'
/etc/send2syslog.conf: line 14: files:: command not found
+ '{tag:' bootstrap/agent, log_type: ruby,
/etc/send2syslog.conf: line 15: {tag:: command not found
+ files: '[/var/log/nailgun-agent.log]},'
/etc/send2syslog.conf: line 16: files:: command not found
+ '{tag:' bootstrap/netprobe_sender, log_type: netprobe,
/etc/send2syslog.conf: line 17: {tag:: command not found
+ files: '[/var/log/netprobe_sender.log]},'
/etc/send2syslog.conf: line 18: files:: command not found
+ '{tag:' bootstrap/netprobe_listener, log_type: netprobe,
/etc/send2syslog.conf: line 19: {tag:: command not found
+ files: '[/var/log/netprobe_listener.log]}'
/etc/send2syslog.conf: line 20: files:: command not found
+ ']'
/etc/send2syslog.conf: line 21: ]: command not found
+ ']'
/etc/send2syslog.conf: line 23: ]: command not found
/etc/send2syslog.conf: line 24: syntax error near unexpected token `}'
/etc/send2syslog.conf: line 24: `}'

On 20/01/17 15:54, Georgy Kibardin wrote:
> bash -x /etc/send2syslog.conf

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-20:

#11

sorry, I meant bash -x fix-configs-on-startup. But before that:
1. Is mcollective unmasked now?
2. What is in cat /etc/mcollective/server.cfg ?

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-23:

#12

in the /etc/mcollective/server.cfg

[root@fuel ~]# cat /etc/mcollective/server.cfg
classesfile = /var/lib/puppet/state/classes.txt
collectives = mcollective
connector = rabbitmq
daemonize = 1
direct_addressing = 1
factsource = yaml
identity = master
libdir = /usr/local/libexec/mcollective:/usr/libexec/mcollective
logfile = /var/log/mcollective.log
loglevel = debug
main_collective = mcollective
plugin.actionpolicy.allow_unconfigured = 1
plugin.psk = unset
plugin.rabbitmq.heartbeat_interval = 30
plugin.rabbitmq.max_hbrlck_fails = 0
plugin.rabbitmq.pool.1.host = 146.118.52.2
plugin.rabbitmq.pool.1.password = XozR2WWv4yHltFIYZF7N3Nem
plugin.rabbitmq.pool.1.port = 61613
plugin.rabbitmq.pool.1.user = mcollective
plugin.rabbitmq.pool.size = 1
plugin.rabbitmq.randomize = true
plugin.rabbitmq.vhost = mcollective
plugin.rpcaudit.logfile = /var/log/mcollective-audit.log
plugin.yaml = /etc/mcollective/facts.yaml
rpcaudit = 1
rpcauditprovider = logfile
rpcauthorization = 1
rpcauthprovider = action_policy
securityprovider = psk
ttl = 4294957

The script, in the "problematic boxes" is not "linked" properly and I
have to execute it with the absolute path
/usr/bin/fix-configs-on-startup , while in the boxes in which the
problem was not present, it was enough executing it. A bit weird. Anyway
here the output.

root@bootstrap:~# bash -x /usr/bin/fix-configs-on-startup
+ set -e
++ sed -rn 's/^.*url=http:\/\/(([0-9]{1,3}\.){3}[0-9]{1,3}).*$/\1/ p'
/proc/cmdline
+ masternode_ip=146.118.52.2
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_user
++ awk -F= '{print $2}'
+ mco_user=mcollective
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_pass
++ awk -F= '{print $2}'
+ mco_pass=XozR2WWv4yHltFIYZF7N3Nem
+ '[' -z mcollective ']'
+ '[' -z XozR2WWv4yHltFIYZF7N3Nem ']'
+ cat
+ /usr/bin/send2syslog.py -i
+ sed -i '/^\s*tinker panic/ d' /etc/ntp.conf
+ sed -i '1 i tinker panic 0' /etc/ntp.conf
+ mkdir -p /var/lib/ntp
+ chown ntp: /var/lib/ntp
+ echo 0
+ chown ntp: /var/lib/ntp/ntp.drift
+ sed -i '/^\s*server\b/ d' /etc/ntp.conf
+ echo 'server 146.118.52.2 burst iburst'
+ service ntp restart
+ sed -i
's/^plugin.rabbitmq.pool.1.host\b.*$/plugin.rabbitmq.pool.1.host =
146.118.52.2/' /etc/mcollective/server.cfg
+ sed -i
's/^plugin.rabbitmq.pool.1.user\b.*$/plugin.rabbitmq.pool.1.user =
mcollective/' /etc/mcollective/server.cfg
+ sed -i
's/^plugin.rabbitmq.pool.1.password\b.*$/plugin.rabbitmq.pool.1.password=
XozR2WWv4yHltFIYZF7N3Nem/' /etc/mcollective/server.cfg
+ systemctl unmask mcollective
Failed to execute operation: Connection timed out

About the mcollective unmasking, for one server only I needed to do the
procedured two or three times after the reboot because was not
working... though for three of them worked at the first shot.

On 20/01/17 16:21, Georgy Kibardin wrote:
> sorry, I meant bash -x fix-configs-on-startup. But before that:
> 1. Is mcollective unmasked now?
> 2. What is in cat /etc/mcollective/server.cfg ?
>

in the /etc/mcollective/server.cfg

[root@fuel ~]# cat /etc/mcollective/server.cfg
classesfile = /var/lib/puppet/state/classes.txt
collectives = mcollective
connector = rabbitmq
daemonize = 1
direct_addressing = 1
factsource = yaml
identity = master
libdir = /usr/local/libexec/mcollective:/usr/libexec/mcollective
logfile = /var/log/mcollective.log
loglevel = debug
main_collective = mcollective
plugin.actionpolicy.allow_unconfigured = 1
plugin.psk = unset
plugin.rabbitmq.heartbeat_interval = 30
plugin.rabbitmq.max_hbrlck_fails = 0
plugin.rabbitmq.pool.1.host = 146.118.52.2
plugin.rabbitmq.pool.1.password = XozR2WWv4yHltFIYZF7N3Nem
plugin.rabbitmq.pool.1.port = 61613
plugin.rabbitmq.pool.1.user = mcollective
plugin.rabbitmq.pool.size = 1
plugin.rabbitmq.randomize = true
plugin.rabbitmq.vhost = mcollective
plugin.rpcaudit.logfile = /var/log/mcollective-audit.log
plugin.yaml = /etc/mcollective/facts.yaml
rpcaudit = 1
rpcauditprovider = logfile
rpcauthorization = 1
rpcauthprovider = action_policy
securityprovider = psk
ttl = 4294957

The script, in the "problematic boxes" is not "linked" properly and I 
have to execute it with the absolute path 
/usr/bin/fix-configs-on-startup , while in the boxes in which the 
problem was not present, it was enough executing it. A bit weird. Anyway 
here the output.

root@bootstrap:~#  bash -x /usr/bin/fix-configs-on-startup
+ set -e
++ sed -rn 's/^.*url=http:\/\/(([0-9]{1,3}\.){3}[0-9]{1,3}).*$/\1/ p' 
/proc/cmdline
+ masternode_ip=146.118.52.2
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_user
++ awk -F= '{print $2}'
+ mco_user=mcollective
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_pass
++ awk -F= '{print $2}'
+ mco_pass=XozR2WWv4yHltFIYZF7N3Nem
+ '[' -z mcollective ']'
+ '[' -z XozR2WWv4yHltFIYZF7N3Nem ']'
+ cat
+ /usr/bin/send2syslog.py -i
+ sed -i '/^\s*tinker panic/ d' /etc/ntp.conf
+ sed -i '1 i tinker panic 0' /etc/ntp.conf
+ mkdir -p /var/lib/ntp
+ chown ntp: /var/lib/ntp
+ echo 0
+ chown ntp: /var/lib/ntp/ntp.drift
+ sed -i '/^\s*server\b/ d' /etc/ntp.conf
+ echo 'server 146.118.52.2 burst iburst'
+ service ntp restart
+ sed -i 
's/^plugin.rabbitmq.pool.1.host\b.*$/plugin.rabbitmq.pool.1.host = 
146.118.52.2/' /etc/mcollective/server.cfg
+ sed -i 
's/^plugin.rabbitmq.pool.1.user\b.*$/plugin.rabbitmq.pool.1.user = 
mcollective/' /etc/mcollective/server.cfg
+ sed -i 
's/^plugin.rabbitmq.pool.1.password\b.*$/plugin.rabbitmq.pool.1.password= 
XozR2WWv4yHltFIYZF7N3Nem/' /etc/mcollective/server.cfg
+ systemctl unmask mcollective
Failed to execute operation: Connection timed out

About the mcollective unmasking, for one server only I needed to do the 
procedured two or three times after the reboot because was not 
working... though for three of them worked at the first shot.

On 20/01/17 16:21, Georgy Kibardin wrote:
> sorry, I meant bash -x fix-configs-on-startup. But before that:
> 1. Is mcollective unmasked now?
> 2. What is in cat /etc/mcollective/server.cfg ?
>

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-23:

#13

> The script, in the "problematic boxes" is not "linked" properly and I have to execute it with the absolute path /usr/bin/fix-configs-on-startup
That puzzles me a bit - if /usr/bin is not in PATH, something must be broken badly. Figuring what could have affected that.

> + systemctl unmask mcollective
> Failed to execute operation: Connection timed out
Systemd also seems to not function properly.

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-24:

#14

let me know if you need more logs for this issue, I kept some servers
aside just to investigate on this problem while in bootstrap.

On 23/01/17 17:58, Georgy Kibardin wrote:
>> The script, in the "problematic boxes" is not "linked" properly and I have to execute it with the absolute path /usr/bin/fix-configs-on-startup
> That puzzles me a bit - if /usr/bin is not in PATH, something must be broken badly. Figuring what could have affected that.
>
>> + systemctl unmask mcollective
>> Failed to execute operation: Connection timed out
> Systemd also seems to not function properly.
>

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-25:

#15

I actually got a problem even in the nodes from the hardware models that
seemed always ok.

root@bootstrap:~# bash -x fix-configs-on-startup
+ set -e
++ sed -rn 's/^.*url=http:\/\/(([0-9]{1,3}\.){3}[0-9]{1,3}).*$/\1/ p'
/proc/cmdline
+ masternode_ip=146.118.52.2
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_user
++ awk -F= '{print $2}'
+ mco_user=mcollective
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_pass
++ awk -F= '{print $2}'
+ mco_pass=XozR2WWv4yHltFIYZF7N3Nem
+ '[' -z mcollective ']'
+ '[' -z XozR2WWv4yHltFIYZF7N3Nem ']'
+ cat
+ /usr/bin/send2syslog.py -i
+ sed -i '/^\s*tinker panic/ d' /etc/ntp.conf
+ sed -i '1 i tinker panic 0' /etc/ntp.conf
+ mkdir -p /var/lib/ntp
+ chown ntp: /var/lib/ntp
+ echo 0
+ chown ntp: /var/lib/ntp/ntp.drift
+ sed -i '/^\s*server\b/ d' /etc/ntp.conf
+ echo 'server 146.118.52.2 burst iburst'
+ service ntp restart
+ sed -i
's/^plugin.rabbitmq.pool.1.host\b.*$/plugin.rabbitmq.pool.1.host =
146.118.52.2/' /etc/mcollective/server.cfg
+ sed -i
's/^plugin.rabbitmq.pool.1.user\b.*$/plugin.rabbitmq.pool.1.user =
mcollective/' /etc/mcollective/server.cfg
+ sed -i
's/^plugin.rabbitmq.pool.1.password\b.*$/plugin.rabbitmq.pool.1.password=
XozR2WWv4yHltFIYZF7N3Nem/' /etc/mcollective/server.cfg
+ systemctl unmask mcollective
Failed to execute operation: Connection timed out

On 23/01/17 17:58, Georgy Kibardin wrote:
>> The script, in the "problematic boxes" is not "linked" properly and I have to execute it with the absolute path /usr/bin/fix-configs-on-startup
> That puzzles me a bit - if /usr/bin is not in PATH, something must be broken badly. Figuring what could have affected that.
>
>> + systemctl unmask mcollective
>> Failed to execute operation: Connection timed out
> Systemd also seems to not function properly.
>

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-30:

#16

Is there any news on this or a workaround I can use to make the deployment work?
If I run /etc/rc.local manually it can sometimes fix the problem though after it is run, the deployment fails on each node it has been run.

It is very damaging having this issue because I never know if the deployment is failing for configuration issues or for a bug issue and each time I provision/deploy the servers it takes more than an hour to understand if it is actually failing or not.

I can provide more logs or diagnostics if needed. Thanks!

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-30:

#18

Sorry I'm a bit busy - we are releasing 9.2. I believe I'll be back to the issue tomorrow.

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-01-30:

#19

Many thanks Georgy

On 30/01/17 12:31, Georgy Kibardin wrote:
> Sorry I'm a bit busy - we are releasing 9.2. I believe I'll be back to
> the issue tomorrow.
>

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-31:

#20

Ok, lets start with path issue first. On master node edit /usr/share/fuel_bootstrap_cli/files/xenial/etc/rc.local add

export PATH=$PATH:/bin:/usr/bin

before fix-configs-on-startup line. Then rebuild a bootstrap:

fuel-bootstrap build --activate

Then reboot problem nodes and check whether mcollective is started. If not, lets check, at least its config is correct.

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-02-01:

#21

Download full text (5.9 KiB)

So, after rebuilding the bootstrap, the export is correctly in the
/etc/rc.local but still mcollective does not start.

1) in the first box (same hardware):

root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
Loaded: loaded (/lib/systemd/system/mcollective.service; enabled;
vendor preset: enabled)
Active: inactive (dead)

root@bootstrap:~# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

root@bootstrap:~# service mcollective start
root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
    Loaded: loaded (/lib/systemd/system/mcollective.service; enabled;
vendor preset: enabled)
    Active: active (running) since Wed 2017-02-01 01:26:48 UTC; 4s ago
   Process: 7521 ExecStart=/usr/sbin/mcollectived
--config=/etc/mcollective/server.cfg --pidfile=/var/run/mcollective.pid
(code=exited, status=0/SUCCESS)
  Main PID: 7527 (ruby)
    CGroup: /system.slice/mcollective.service
            └─7527 ruby /usr/sbin/mcollectived
--config=/etc/mcollective/server.cfg --pidfile=/var/run/mcollective.pid

Feb 01 01:26:48 bootstrap systemd[7521]: mcollective.service: Executing:
/usr/sbin/mcollectived --config=/etc/mcollective/server.cfg
--pidfile=/var/run/mcollective.pid
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Child 7521
belongs to mcollective.service
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Control
process exited, code=exited status=0
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Got final
SIGCHLD for state start.
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Main PID
loaded: 7527
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Supervising
process 7527 which is not our child. We'll most likely not notice when
it exits.
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Changed start
-> running
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Job
mcollective.service/start finished, result=done
Feb 01 01:26:48 bootstrap systemd[1]: Started The Marionette Collective.
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Child 7524
belongs to mcollective.service

-- AND THE SERVICE STARTED CORRECTLY (pingable via mco ping)

2) the second box:

root@bootstrap:~# service mcollective status
● mcollective.service
Loaded: masked (/dev/null; bad)
Active: inactive (dead)

Feb 01 01:18:34 bootstrap systemd[1]: mcollective.service: Trying to
enqueue job mcollective.service/restart/replace
Warning: mcollective.service changed on disk. Run 'systemctl
daemon-reload' to reload units.

root@bootstrap:~# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

root@bootstrap:~# service mcollective start
Failed to start mcollective.service: Unit mcollective.service is masked.

--- but if I run manually the fix config it can be started manually.

root@bootstrap:~# fix-configs-on-startup
Removed symlink /etc/systemd/system/mcollective.service.
Failed to execute operation: Connection timed out
root@bootstrap:~# sleep 2m; service mcollective start
root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Coll...

So, after rebuilding the bootstrap, the export is correctly in the 
/etc/rc.local but still mcollective does not start.

1) in the first box (same hardware):

root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
    Loaded: loaded (/lib/systemd/system/mcollective.service; enabled; 
vendor preset: enabled)
    Active: inactive (dead)

root@bootstrap:~# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

root@bootstrap:~# service mcollective start
root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
    Loaded: loaded (/lib/systemd/system/mcollective.service; enabled; 
vendor preset: enabled)
    Active: active (running) since Wed 2017-02-01 01:26:48 UTC; 4s ago
   Process: 7521 ExecStart=/usr/sbin/mcollectived 
--config=/etc/mcollective/server.cfg --pidfile=/var/run/mcollective.pid 
(code=exited, status=0/SUCCESS)
  Main PID: 7527 (ruby)
    CGroup: /system.slice/mcollective.service
            └─7527 ruby /usr/sbin/mcollectived 
--config=/etc/mcollective/server.cfg --pidfile=/var/run/mcollective.pid

Feb 01 01:26:48 bootstrap systemd[7521]: mcollective.service: Executing: 
/usr/sbin/mcollectived --config=/etc/mcollective/server.cfg 
--pidfile=/var/run/mcollective.pid
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Child 7521 
belongs to mcollective.service
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Control 
process exited, code=exited status=0
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Got final 
SIGCHLD for state start.
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Main PID 
loaded: 7527
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Supervising 
process 7527 which is not our child. We'll most likely not notice when 
it exits.
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Changed start 
-> running
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Job 
mcollective.service/start finished, result=done
Feb 01 01:26:48 bootstrap systemd[1]: Started The Marionette Collective.
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Child 7524 
belongs to mcollective.service

-- AND THE SERVICE STARTED CORRECTLY (pingable via mco ping)

2) the second box:

root@bootstrap:~# service mcollective status
● mcollective.service
    Loaded: masked (/dev/null; bad)
    Active: inactive (dead)

Feb 01 01:18:34 bootstrap systemd[1]: mcollective.service: Trying to 
enqueue job mcollective.service/restart/replace
Warning: mcollective.service changed on disk. Run 'systemctl 
daemon-reload' to reload units.

root@bootstrap:~# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

root@bootstrap:~# service mcollective start
Failed to start mcollective.service: Unit mcollective.service is masked.

--- but if I run manually the fix config it can be started manually.

root@bootstrap:~# fix-configs-on-startup
Removed symlink /etc/systemd/system/mcollective.service.
Failed to execute operation: Connection timed out
root@bootstrap:~# sleep 2m; service mcollective start
root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
    Loaded: loaded (/lib/systemd/system/mcollective.service; enabled; 
vendor preset: enabled)
    Active: active (running) since Wed 2017-02-01 01:33:45 UTC; 3min 52s ago
   Process: 12371 ExecStart=/usr/sbin/mcollectived 
--config=/etc/mcollective/server.cfg --pidfile=/var/run/mcollective.pid 
(code=exited, status=0/SUCCESS)
  Main PID: 12377 (ruby)
    CGroup: /system.slice/mcollective.service
            └─12377 ruby /usr/sbin/mcollectived 
--config=/etc/mcollective/server.cfg --pidfile=/var/run/mcollective.pid

Feb 01 01:33:45 bootstrap systemd[12371]: mcollective.service: 
Executing: /usr/sbin/mcollectived --config=/etc/mcollective/server.cfg 
--pidfile=/var/run/mcollective.pid
Feb 01 01:33:45 bootstrap systemd[1]: mcollective.service: Child 12371 
belongs to mcollective.service
Feb 01 01:33:45 bootstrap systemd[1]: mcollective.service: Control 
process exited, code=exited status=0
Feb 01 01:33:45 bootstrap systemd[1]: mcollective.service: Got final 
SIGCHLD for state start.
Feb 01 01:33:45 bootstrap systemd[1]: mcollective.service: Main PID 
loaded: 12377
Feb 01 01:33:45 bootstrap systemd[1]: mcollective.service: Supervising 
process 12377 which is not our child. We'll most likely not notice when 
it exits.
Feb 01 01:33:45 bootstrap systemd[1]: mcollective.service: Changed start 
-> running
Feb 01 01:33:45 bootstrap systemd[1]: mcollective.service: Job 
mcollective.service/start finished, result=done
Feb 01 01:33:45 bootstrap systemd[1]: Started The Marionette Collective.
Feb 01 01:33:45 bootstrap systemd[1]: mcollective.service: Child 12374 
belongs to mcollective.service

mcollective server.cfg (if this what you want)

main_collective = mcollective
collectives = mcollective
libdir = /usr/share/mcollective/plugins
logfile = /var/log/mcollective.log
loglevel = debug
direct_addressing = 1
daemonize = 1

# Set TTL to 1.5 hours
ttl = 5400

# Plugins
securityprovider = psk
plugin.psk = unset

connector = rabbitmq
plugin.rabbitmq.vhost = mcollective
plugin.rabbitmq.pool.size = 1
plugin.rabbitmq.pool.1.host = 146.118.52.2
plugin.rabbitmq.pool.1.port = 61613
plugin.rabbitmq.pool.1.user = mcollective
plugin.rabbitmq.pool.1.password= 24RuZorKwTIJruYZAFQFJx03
plugin.rabbitmq.heartbeat_interval = 30
plugin.rabbitmq.max_hbrlck_fails = 0

# Facts
factsource = yaml
plugin.yaml = /etc/mcollective/facts.yaml

identity = 32

On 31/01/17 16:56, Georgy Kibardin wrote:
> Ok, lets start with path issue first. On master node edit
> /usr/share/fuel_bootstrap_cli/files/xenial/etc/rc.local add
>
> export PATH=$PATH:/bin:/usr/bin
>
> before fix-configs-on-startup line. Then rebuild a bootstrap:
>
> fuel-bootstrap build --activate
>
> Then reboot problem nodes and check whether mcollective is started. If
> not, lets check, at least its config is correct.
>

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-02-01:

#22

Aha, so it seems the depending on timings (its different on VMs and the real hardware) rc.local maybe launched without proper PATH variable set. I suspect that something similar happens on one of our test labs.
Thinking about the hard part - systemd "Connection timeout"

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-07: Fix proposed to fuel-agent (master)

#23

Fix proposed to branch: master
Review: https://review.openstack.org/430138

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-09: Fix merged to fuel-agent (master)

#24

Reviewed: https://review.openstack.org/430138
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=5c612875a6881b1c93c4dac04759f9e806a3ee1d
Submitter: Jenkins
Branch: master

commit 5c612875a6881b1c93c4dac04759f9e806a3ee1d
Author: Georgy Kibardin <email address hidden>
Date: Tue Feb 7 12:12:25 2017 +0300

Use explicit path in Xenial

    Nobody promised PATH to be set during rc.local execution by systemd. And
    it happens that this is the case on some systems, i.e. the PATH is not
    set by the time of rc.local execution.

Change-Id: I4806df3217dd09430ad45b13cda49fab0925cb88
Partial-Bug: #1656150

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-09: Fix proposed to fuel-agent (stable/newton)

#25

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/431414

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-09: Fix merged to fuel-agent (stable/newton)

#26

Reviewed: https://review.openstack.org/431414
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=9526f477d81eff85f45e3279bc09778bfe3df930
Submitter: Jenkins
Branch: stable/newton

commit 9526f477d81eff85f45e3279bc09778bfe3df930
Author: Georgy Kibardin <email address hidden>
Date: Tue Feb 7 12:12:25 2017 +0300

Use explicit path in Xenial

    Nobody promised PATH to be set during rc.local execution by systemd. And
    it happens that this is the case on some systems, i.e. the PATH is not
    set by the time of rc.local execution.

    Change-Id: I4806df3217dd09430ad45b13cda49fab0925cb88
    Partial-Bug: #1656150
    (cherry picked from commit 5c612875a6881b1c93c4dac04759f9e806a3ee1d)

tags:

added: in-stable-newton

Revision history for this message

Luca Cervigni (cervigni) wrote on 2017-02-15:

#27

Georgy,

Excuse my ignorance, but is this fix already being pushed in the nightly build of the fuel10.0 community or do I have to wait the release of 10.1 ? and if the case when should it be?

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-02-16:

#28

Luca,

Of course you may try community nightly build. The latest one is here http://seed.fuel-infra.org/fuelweb-iso/fuel-10.0-community-1316-2017-02-15_23-14-00.iso.torrent, it contains the first fix.
However, we are still in progress of fixing one race condition in bootstrap build and systemd part of your problem is not resolved yet.

Revision history for this message

Burkhard Linke (blinke) wrote on 2017-03-08:

#29

I'm having a similar problem with fuel-community-10.0.

Some nodes do not start mcollective. The workaround described above (running /etc/rc.local manually) unmasked the mcollective service, but it is not able to start it.

systemctl status mcollective prints:

# systemctl status mcollective
● mcollective.service - The Marionette Collective
Loaded: loaded (/lib/systemd/system/mcollective.service; enabled; vendor preset: enabled)
Active: inactive (dead)

Mar 08 09:12:33 bootstrap systemd[1]: mcollective.service: Failed to send unit change signal for mcollective.service: Transport endpoint is not connected
Mar 08 09:31:26 bootstrap systemd[1]: mcollective.service: Failed to send unit change signal for mcollective.service: Transport endpoint is not connected
Mar 08 09:37:20 bootstrap systemd[1]: mcollective.service: Failed to send unit change signal for mcollective.service: Transport endpoint is not connected

The "Transport endpoint is not connected" message is printed for each every time /etc/rc.local was executed.

Manually starting mcollective via systemctl start works, but the hosts are printed as 'bootstrap':
[root@fuel etc]# mco ping
bootstrap time=34.90 ms
17 time=37.08 ms
master time=39.54 ms
21 time=43.94 ms
25 time=45.40 ms
bootstrap time=46.36 ms
22 time=47.25 ms
23 time=63.95 ms

---- ping statistics ----
8 replies max: 63.95 min: 34.90 avg: 44.80

They are also not recognized during network connectivity check stating that nodes 11 and 20 are not available via mcollective.

Revision history for this message

Burkhard Linke (blinke) wrote on 2017-03-08:

#30

Download full text (6.6 KiB)

I also notice a very high CPU load in the hosts after running /etc/rc.local:

top - 11:00:15 up 6 min, 1 user, load average: 9.88, 4.09, 1.62
Tasks: 369 total, 9 running, 354 sleeping, 0 stopped, 6 zombie
%Cpu(s): 0.1 us, 5.7 sy, 0.0 ni, 94.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13188960+total, 13057571+free, 746472 used, 567416 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13023952+avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
  884 root 20 0 52032 21652 17148 R 100.0 0.0 2:06.06 systemd-journal
2235 root 20 0 0 0 0 S 25.9 0.0 0:29.72 kworker/1:2
  270 root 20 0 0 0 0 S 24.3 0.0 0:28.07 kworker/17:1
  272 root 20 0 0 0 0 R 23.9 0.0 0:27.08 kworker/4:1
  273 root 20 0 0 0 0 S 23.9 0.0 0:28.75 kworker/5:1
  267 root 20 0 0 0 0 S 23.3 0.0 0:27.92 kworker/3:1
   45 root 20 0 0 0 0 S 22.9 0.0 0:27.52 kworker/7:0
  235 root 20 0 0 0 0 S 22.3 0.0 0:25.20 kworker/0:2
  266 root 20 0 0 0 0 S 22.3 0.0 0:26.04 kworker/2:1
  274 root 20 0 0 0 0 S 22.3 0.0 0:26.84 kworker/21:1
  260 root 20 0 0 0 0 S 21.6 0.0 0:26.14 kworker/6:1
  276 root 20 0 0 0 0 S 21.3 0.0 0:23.96 kworker/20:1
  314 root 20 0 0 0 0 S 21.3 0.0 0:25.15 kworker/19:1
  316 root 20 0 0 0 0 S 20.6 0.0 0:24.92 kworker/23:1 ...

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	In Progress	High	Fuel Sustaining	Fuel for OpenStack 11.0
Nominated for Ocata by Oleksiy Molchanov
	Mitaka	Invalid	High	Fuel Sustaining	Fuel for OpenStack 9.2
	Newton	In Progress	High	Fuel Sustaining	Fuel for OpenStack 10.1

Fuel for OpenStack

Mcollective fails to start in bootstrap of baremetal servers

Bug Description

Other bug subscribers

Remote bug watches

Changed in fuel:
assignee:	Georgy Kibardin (gkibardin) → Fuel Sustaining (fuel-sustaining-team)