Mcollective fails to start in bootstrap of baremetal servers

Bug #1656150 reported by Luca Cervigni
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
In Progress
High
Fuel Sustaining
Nominated for Ocata by Oleksiy Molchanov
Mitaka
Invalid
High
Fuel Sustaining
Newton
In Progress
High
Fuel Sustaining

Bug Description

Fuel 10.0
Mixed environment of VMS and BARE METAL SERVERS.
After the bare metal servers boot under bootstrap mcollective fails to start making the deployment impossible.
In some servers it is possible to manually start mcollective with "service mcollective start".
In other server it fails saying that the service is masked, unmasking fails as well and the only way to fix is rebooting several time until the problem solves itself.

Nailgun seems not receiving the id from mcollective. (thanks to Georgy Kibardin that helped me investigating the bug)

from nailgun logs:

D, [2017-01-12T07:39:26.098120 #2426] DEBUG -- : Found block device: ram15
D, [2017-01-12T07:39:26.098141 #2426] DEBUG -- : Block device info: {:size=>"131072", :removable=>"0"}
D, [2017-01-12T07:39:26.098190 #2426] DEBUG -- : Detailed meta disks: [{:name=>"sda", :model=>"ServeRAID M1015", :size=>298999349248, :disk=>"disk/by-path/pci-0000:04:00.0-scsi-0:2:0:0", :extra=>["disk/by-id/wwn-0x600605b007f4ebf01bb775a21c663bfc", "disk/by-id/scsi-SIBM_ServeRAID_M
1015_00fc3b661ca275b71bf0ebf407b00506", "disk/by-id/scsi-3600605b007f4ebf01bb775a21c663bfc"], :removable=>"0", :paths=>nil, :opt_io=>512}]
D, [2017-01-12T07:39:26.510170 #2426] DEBUG -- : Response: status: 200 body: {"id": 13}
I, [2017-01-12T07:39:26.514777 #2426] INFO -- : Identity in mcollective server.cfg has not been found. Setting to '13'
Sent message type=method_call sender=n/a destination=org.freedesktop.systemd1 object=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=GetUnit cookie=1 reply_cookie=0 error=n/a
Got message type=method_return sender=n/a destination=n/a object=n/a interface=n/a member=n/a cookie=1 reply_cookie=1 error=n/a
Sent message type=method_call sender=n/a destination=org.freedesktop.systemd1 object=/org/freedesktop/systemd1/unit/multi_2duser_2etarget interface=org.freedesktop.DBus.Properties member=Get cookie=2 reply_cookie=0 error=n/a
Got message type=method_return sender=n/a destination=n/a object=n/a interface=n/a member=n/a cookie=2 reply_cookie=2 error=n/a
Calling manager for RestartUnit on mcollective.service, replace
Sent message type=method_call sender=n/a destination=org.freedesktop.systemd1 object=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=RestartUnit cookie=1 reply_cookie=0 error=n/a
Got message type=error sender=n/a destination=n/a object=n/a interface=n/a member=n/a cookie=1 reply_cookie=1 error=Unit mcollective.service is masked.
Failed to restart mcollective.service: Unit mcollective.service is masked.

This bug prevents the deployment of an environment on bare metal servers.

Changed in fuel:
milestone: none → 10.1
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
milestone: 10.1 → 11.0
importance: Undecided → High
Changed in fuel:
status: New → In Progress
Revision history for this message
Luca Cervigni (cervigni) wrote :

Here the FUEL Snapshot

https://drive.google.com/file/d/0B67I6XmqaMeRM1BIUkhrdE91c0k/view?usp=sharing

I tried to start mcollective manually, it worked only on 6 servers of 10.
In the others the service is masked, and the only way to make them work seems restarting, and try to start it manually until problem is solved

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

From what I see in logs rc.local started but, for some reason mcollective unmasking part wasn't run. So, it looks like it hung somewhere in the middle. To understand where it could have happen I need to more files which are not a part of the snapshot: /etc/ntp.conf and /etc/send2syslog.conf.

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)
Revision history for this message
Luca Cervigni (cervigni) wrote :

[root@fuel ~]# cat /etc/ntp.conf
# ntp.conf: Managed by puppet.
#
# Enable next tinker options:
# panic - keep ntpd from panicking in the event of a large clock skew
# when a VM guest is suspended and resumed;
# stepout - allow ntpd change offset faster
tinker panic 0 stepout 5

disable monitor

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict -6 ::1

# Set up servers for ntpd with next options:
# server - IP address or DNS name of upstream NTP server
# iburst - allow send sync packages faster if upstream unavailable
# prefer - select preferrable server
# minpoll - set minimal update frequency
# maxpoll - set maximal update frequency
server 192.65.130.2 iburst minpoll 3 maxpoll 9
server 192.65.130.1 iburst minpoll 3 maxpoll 9

# Driftfile.
driftfile /var/lib/ntp/drift

[root@fuel ~]# cat /etc/send2syslog.conf
cat: /etc/send2syslog.conf: No such file or directory

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Hm... It looks like it hasn't been executed at all. BTW, what is in /etc/rc.local on failed nodes?
P.S. What ISO are you using - I want to try to reproduce the whole thing.

Revision history for this message
Luca Cervigni (cervigni) wrote :

[root@fuel ~]# cat /etc/ntp.conf
# ntp.conf: Managed by puppet.
#
# Enable next tinker options:
# panic - keep ntpd from panicking in the event of a large clock skew
# when a VM guest is suspended and resumed;
# stepout - allow ntpd change offset faster
tinker panic 0 stepout 5

disable monitor

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict -6 ::1

# Set up servers for ntpd with next options:
# server - IP address or DNS name of upstream NTP server
# iburst - allow send sync packages faster if upstream unavailable
# prefer - select preferrable server
# minpoll - set minimal update frequency
# maxpoll - set maximal update frequency
server 192.65.130.2 iburst minpoll 3 maxpoll 9
server 192.65.130.1 iburst minpoll 3 maxpoll 9

# Driftfile.
driftfile /var/lib/ntp/drift

[root@fuel ~]# cat /etc/send2syslog.conf
cat: /etc/send2syslog.conf: No such file or directory

Revision history for this message
Luca Cervigni (cervigni) wrote : Re: [Bug 1656150] Re: Mcollective fails to start in bootstrap of baremetal servers

[root@fuel ~]# ssh 146.118.52.62
root@bootstrap:~# cat /etc/rc.local
#!/bin/sh -e

# Perform fuel bootstrap configuration
fix-configs-on-startup || true

# Unlocking nailgun-agent for discovering
rm -f /etc/nailgun-agent/nodiscover
-- (that is after a reboot)

I'm using the FUEL 10 ISO downloaded from the fuel-infra website.

I did reset the environment, and it seems a problem on only again 4 on 5
nodes on a specific hardware model. It reproduced exactly as a previous
deployment.

On 18/01/17 14:54, Georgy Kibardin wrote:
> Hm... It looks like it hasn't been executed at all. BTW, what is in /etc/rc.local on failed nodes?
> P.S. What ISO are you using - I want to try to reproduce the whole thing.
>

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Does running /etc/rc.local manually afterwards fixes the problem?
It must fix mcollective config, unmask mcollective.service and then in ~30 sec. nailgun agent must successfully start mcollective. Also /etc/send2syslog.conf must appear.
To understand where it fails I think it would be usefule to run it as bash -x /etc/rc.local.

Revision history for this message
Luca Cervigni (cervigni) wrote :

This is the output of "bash -x /etc/rc.local"

root@bootstrap:~# bash -x /etc/rc.local
+ fix-configs-on-startup
Failed to retrieve unit: Connection timed out
Failed to execute operation: Connection timed out
+ true
+ rm -f /etc/nailgun-agent/nodiscover

Waited more than 5 minutes but nailgun does not restart mcollective.

root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
    Loaded: loaded (/lib/systemd/system/mcollective.service; enabled;
vendor preset: enabled)
    Active: inactive (dead)

Jan 18 08:05:35 bootstrap systemd[1]: mcollective.service: Trying to
enqueue job mcollective.service/restart/replace
Jan 20 00:35:43 bootstrap systemd[1]: mcollective.service: Trying to
enqueue job mcollective.service/start/replace
Jan 20 00:38:20 bootstrap systemd[1]: mcollective.service: Failed to
send unit change signal for mcollective.service: Transport endpoint is
not connected
Jan 20 00:39:04 bootstrap systemd[1]: mcollective.service: Failed to
send unit change signal for mcollective.service: Transport endpoint is
not connected

Running /etc/rc.local though enabled me to start manually mcollective, a
great deal for my deployment for the moment.

root@bootstrap:~# service mcollective start
root@bootstrap:~#

[root@fuel ~]# mco ping
7 time=61.62 ms
10 time=62.61 ms
11 time=63.48 ms
12 time=64.41 ms
13 time=65.31 ms
9 time=66.21 ms
8 time=67.11 ms
6 time=67.91 ms
2 time=68.84 ms
master time=69.70 ms
1 time=74.22 ms

On 18/01/17 18:45, Georgy Kibardin wrote:
> Does running /etc/rc.local manually afterwards fixes the problem?
> It must fix mcollective config, unmask mcollective.service and then in ~30 sec. nailgun agent must successfully start mcollective. Also /etc/send2syslog.conf must appear.
> To understand where it fails I think it would be usefule to run it as bash -x /etc/rc.local.
>

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Did, at least /etc/send2syslog.conf appear?
Anyway lets figure what are those error messages:

bash -x /etc/send2syslog.conf

must give us some clue.

Revision history for this message
Luca Cervigni (cervigni) wrote :
Download full text (3.6 KiB)

It did,

I am in the middle of the provisioning though, I paste it anyway:

root@bootstrap:~# cat /etc/send2syslog.conf
{
"watchlist": [
     {"servers": [ {"host": "146.118.52.2"} ],
         "watchfiles": [
             {"tag": "bootstrap/kern.log", "files": ["/var/log/kern.log"]},
             {"tag": "bootstrap/udev", "files": ["/var/log/udev"]},
             {"tag": "bootstrap/dmesg", "files": ["/var/log/dmesg"]},
             {"tag": "bootstrap/secure", "files": ["/var/log/secure"]},
             {"tag": "bootstrap/messages", "files": ["/var/log/messages"]},
             {"tag": "bootstrap/fuel-agent", "files":
["/var/log/fuel-agent.log"]},
             {"tag": "bootstrap/syslog", "files": ["/var/log/syslog"]},
             {"tag": "bootstrap/auth", "files": ["/var/log/auth.log"]},
             {"tag": "bootstrap/mcollective", "log_type": "ruby",
                 "files": ["/var/log/mcollective.log"]},
             {"tag": "bootstrap/agent", "log_type": "ruby",
                 "files": ["/var/log/nailgun-agent.log"]},
             {"tag": "bootstrap/netprobe_sender", "log_type": "netprobe",
                 "files": ["/var/log/netprobe_sender.log"]},
             {"tag": "bootstrap/netprobe_listener", "log_type": "netprobe",
                 "files": ["/var/log/netprobe_listener.log"]}
         ]
     }
]
}
root@bootstrap:~#
root@bootstrap:~# bash -x /etc/send2syslog.conf
+ watchlist: '['
/etc/send2syslog.conf: line 2: watchlist:: command not found
+ '{servers:' '[' '{host:' '146.118.52.2}' '],'
/etc/send2syslog.conf: line 3: {servers:: command not found
+ watchfiles: '['
/etc/send2syslog.conf: line 4: watchfiles:: command not found
+ '{tag:' bootstrap/kern.log, files: '[/var/log/kern.log]},'
/etc/send2syslog.conf: line 5: {tag:: command not found
+ '{tag:' bootstrap/udev, files: '[/var/log/udev]},'
/etc/send2syslog.conf: line 6: {tag:: command not found
+ '{tag:' bootstrap/dmesg, files: '[/var/log/dmesg]},'
/etc/send2syslog.conf: line 7: {tag:: command not found
+ '{tag:' bootstrap/secure, files: '[/var/log/secure]},'
/etc/send2syslog.conf: line 8: {tag:: command not found
+ '{tag:' bootstrap/messages, files: '[/var/log/messages]},'
/etc/send2syslog.conf: line 9: {tag:: command not found
+ '{tag:' bootstrap/fuel-agent, files: '[/var/log/fuel-agent.log]},'
/etc/send2syslog.conf: line 10: {tag:: command not found
+ '{tag:' bootstrap/syslog, files: '[/var/log/syslog]},'
/etc/send2syslog.conf: line 11: {tag:: command not found
+ '{tag:' bootstrap/auth, files: '[/var/log/auth.log]},'
/etc/send2syslog.conf: line 12: {tag:: command not found
+ '{tag:' bootstrap/mcollective, log_type: ruby,
/etc/send2syslog.conf: line 13: {tag:: command not found
+ files: '[/var/log/mcollective.log]},'
/etc/send2syslog.conf: line 14: files:: command not found
+ '{tag:' bootstrap/agent, log_type: ruby,
/etc/send2syslog.conf: line 15: {tag:: command not found
+ files: '[/var/log/nailgun-agent.log]},'
/etc/send2syslog.conf: line 16: files:: command not found
+ '{tag:' bootstrap/netprobe_sender, log_type: netprobe,
/etc/send2syslog.conf: line 17: {tag:: command not found
+ files: '[/var/log/netprobe_sender.log]},'
/etc/send2syslog.conf: line 18: files:: command...

Read more...

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

sorry, I meant bash -x fix-configs-on-startup. But before that:
1. Is mcollective unmasked now?
2. What is in cat /etc/mcollective/server.cfg ?

Revision history for this message
Luca Cervigni (cervigni) wrote :

in the /etc/mcollective/server.cfg

[root@fuel ~]# cat /etc/mcollective/server.cfg
classesfile = /var/lib/puppet/state/classes.txt
collectives = mcollective
connector = rabbitmq
daemonize = 1
direct_addressing = 1
factsource = yaml
identity = master
libdir = /usr/local/libexec/mcollective:/usr/libexec/mcollective
logfile = /var/log/mcollective.log
loglevel = debug
main_collective = mcollective
plugin.actionpolicy.allow_unconfigured = 1
plugin.psk = unset
plugin.rabbitmq.heartbeat_interval = 30
plugin.rabbitmq.max_hbrlck_fails = 0
plugin.rabbitmq.pool.1.host = 146.118.52.2
plugin.rabbitmq.pool.1.password = XozR2WWv4yHltFIYZF7N3Nem
plugin.rabbitmq.pool.1.port = 61613
plugin.rabbitmq.pool.1.user = mcollective
plugin.rabbitmq.pool.size = 1
plugin.rabbitmq.randomize = true
plugin.rabbitmq.vhost = mcollective
plugin.rpcaudit.logfile = /var/log/mcollective-audit.log
plugin.yaml = /etc/mcollective/facts.yaml
rpcaudit = 1
rpcauditprovider = logfile
rpcauthorization = 1
rpcauthprovider = action_policy
securityprovider = psk
ttl = 4294957

The script, in the "problematic boxes" is not "linked" properly and I
have to execute it with the absolute path
/usr/bin/fix-configs-on-startup , while in the boxes in which the
problem was not present, it was enough executing it. A bit weird. Anyway
here the output.

root@bootstrap:~# bash -x /usr/bin/fix-configs-on-startup
+ set -e
++ sed -rn 's/^.*url=http:\/\/(([0-9]{1,3}\.){3}[0-9]{1,3}).*$/\1/ p'
/proc/cmdline
+ masternode_ip=146.118.52.2
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_user
++ awk -F= '{print $2}'
+ mco_user=mcollective
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_pass
++ awk -F= '{print $2}'
+ mco_pass=XozR2WWv4yHltFIYZF7N3Nem
+ '[' -z mcollective ']'
+ '[' -z XozR2WWv4yHltFIYZF7N3Nem ']'
+ cat
+ /usr/bin/send2syslog.py -i
+ sed -i '/^\s*tinker panic/ d' /etc/ntp.conf
+ sed -i '1 i tinker panic 0' /etc/ntp.conf
+ mkdir -p /var/lib/ntp
+ chown ntp: /var/lib/ntp
+ echo 0
+ chown ntp: /var/lib/ntp/ntp.drift
+ sed -i '/^\s*server\b/ d' /etc/ntp.conf
+ echo 'server 146.118.52.2 burst iburst'
+ service ntp restart
+ sed -i
's/^plugin.rabbitmq.pool.1.host\b.*$/plugin.rabbitmq.pool.1.host =
146.118.52.2/' /etc/mcollective/server.cfg
+ sed -i
's/^plugin.rabbitmq.pool.1.user\b.*$/plugin.rabbitmq.pool.1.user =
mcollective/' /etc/mcollective/server.cfg
+ sed -i
's/^plugin.rabbitmq.pool.1.password\b.*$/plugin.rabbitmq.pool.1.password=
XozR2WWv4yHltFIYZF7N3Nem/' /etc/mcollective/server.cfg
+ systemctl unmask mcollective
Failed to execute operation: Connection timed out

About the mcollective unmasking, for one server only I needed to do the
procedured two or three times after the reboot because was not
working... though for three of them worked at the first shot.

On 20/01/17 16:21, Georgy Kibardin wrote:
> sorry, I meant bash -x fix-configs-on-startup. But before that:
> 1. Is mcollective unmasked now?
> 2. What is in cat /etc/mcollective/server.cfg ?
>

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

> The script, in the "problematic boxes" is not "linked" properly and I have to execute it with the absolute path /usr/bin/fix-configs-on-startup
That puzzles me a bit - if /usr/bin is not in PATH, something must be broken badly. Figuring what could have affected that.

> + systemctl unmask mcollective
> Failed to execute operation: Connection timed out
Systemd also seems to not function properly.

Revision history for this message
Luca Cervigni (cervigni) wrote :

let me know if you need more logs for this issue, I kept some servers
aside just to investigate on this problem while in bootstrap.

On 23/01/17 17:58, Georgy Kibardin wrote:
>> The script, in the "problematic boxes" is not "linked" properly and I have to execute it with the absolute path /usr/bin/fix-configs-on-startup
> That puzzles me a bit - if /usr/bin is not in PATH, something must be broken badly. Figuring what could have affected that.
>
>> + systemctl unmask mcollective
>> Failed to execute operation: Connection timed out
> Systemd also seems to not function properly.
>

Revision history for this message
Luca Cervigni (cervigni) wrote :

I actually got a problem even in the nodes from the hardware models that
seemed always ok.

root@bootstrap:~# bash -x fix-configs-on-startup
+ set -e
++ sed -rn 's/^.*url=http:\/\/(([0-9]{1,3}\.){3}[0-9]{1,3}).*$/\1/ p'
/proc/cmdline
+ masternode_ip=146.118.52.2
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_user
++ awk -F= '{print $2}'
+ mco_user=mcollective
++ sed 's/\ /\n/g' /proc/cmdline
++ grep mco_pass
++ awk -F= '{print $2}'
+ mco_pass=XozR2WWv4yHltFIYZF7N3Nem
+ '[' -z mcollective ']'
+ '[' -z XozR2WWv4yHltFIYZF7N3Nem ']'
+ cat
+ /usr/bin/send2syslog.py -i
+ sed -i '/^\s*tinker panic/ d' /etc/ntp.conf
+ sed -i '1 i tinker panic 0' /etc/ntp.conf
+ mkdir -p /var/lib/ntp
+ chown ntp: /var/lib/ntp
+ echo 0
+ chown ntp: /var/lib/ntp/ntp.drift
+ sed -i '/^\s*server\b/ d' /etc/ntp.conf
+ echo 'server 146.118.52.2 burst iburst'
+ service ntp restart
+ sed -i
's/^plugin.rabbitmq.pool.1.host\b.*$/plugin.rabbitmq.pool.1.host =
146.118.52.2/' /etc/mcollective/server.cfg
+ sed -i
's/^plugin.rabbitmq.pool.1.user\b.*$/plugin.rabbitmq.pool.1.user =
mcollective/' /etc/mcollective/server.cfg
+ sed -i
's/^plugin.rabbitmq.pool.1.password\b.*$/plugin.rabbitmq.pool.1.password=
XozR2WWv4yHltFIYZF7N3Nem/' /etc/mcollective/server.cfg
+ systemctl unmask mcollective
Failed to execute operation: Connection timed out

On 23/01/17 17:58, Georgy Kibardin wrote:
>> The script, in the "problematic boxes" is not "linked" properly and I have to execute it with the absolute path /usr/bin/fix-configs-on-startup
> That puzzles me a bit - if /usr/bin is not in PATH, something must be broken badly. Figuring what could have affected that.
>
>> + systemctl unmask mcollective
>> Failed to execute operation: Connection timed out
> Systemd also seems to not function properly.
>

Revision history for this message
Luca Cervigni (cervigni) wrote :

Is there any news on this or a workaround I can use to make the deployment work?
If I run /etc/rc.local manually it can sometimes fix the problem though after it is run, the deployment fails on each node it has been run.

It is very damaging having this issue because I never know if the deployment is failing for configuration issues or for a bug issue and each time I provision/deploy the servers it takes more than an hour to understand if it is actually failing or not.

I can provide more logs or diagnostics if needed. Thanks!

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Sorry I'm a bit busy - we are releasing 9.2. I believe I'll be back to the issue tomorrow.

Revision history for this message
Luca Cervigni (cervigni) wrote :

Many thanks Georgy

On 30/01/17 12:31, Georgy Kibardin wrote:
> Sorry I'm a bit busy - we are releasing 9.2. I believe I'll be back to
> the issue tomorrow.
>

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Ok, lets start with path issue first. On master node edit /usr/share/fuel_bootstrap_cli/files/xenial/etc/rc.local add

export PATH=$PATH:/bin:/usr/bin

before fix-configs-on-startup line. Then rebuild a bootstrap:

fuel-bootstrap build --activate

Then reboot problem nodes and check whether mcollective is started. If not, lets check, at least its config is correct.

Revision history for this message
Luca Cervigni (cervigni) wrote :
Download full text (5.9 KiB)

So, after rebuilding the bootstrap, the export is correctly in the
/etc/rc.local but still mcollective does not start.

1) in the first box (same hardware):

root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
    Loaded: loaded (/lib/systemd/system/mcollective.service; enabled;
vendor preset: enabled)
    Active: inactive (dead)

root@bootstrap:~# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

root@bootstrap:~# service mcollective start
root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Collective
    Loaded: loaded (/lib/systemd/system/mcollective.service; enabled;
vendor preset: enabled)
    Active: active (running) since Wed 2017-02-01 01:26:48 UTC; 4s ago
   Process: 7521 ExecStart=/usr/sbin/mcollectived
--config=/etc/mcollective/server.cfg --pidfile=/var/run/mcollective.pid
(code=exited, status=0/SUCCESS)
  Main PID: 7527 (ruby)
    CGroup: /system.slice/mcollective.service
            └─7527 ruby /usr/sbin/mcollectived
--config=/etc/mcollective/server.cfg --pidfile=/var/run/mcollective.pid

Feb 01 01:26:48 bootstrap systemd[7521]: mcollective.service: Executing:
/usr/sbin/mcollectived --config=/etc/mcollective/server.cfg
--pidfile=/var/run/mcollective.pid
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Child 7521
belongs to mcollective.service
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Control
process exited, code=exited status=0
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Got final
SIGCHLD for state start.
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Main PID
loaded: 7527
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Supervising
process 7527 which is not our child. We'll most likely not notice when
it exits.
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Changed start
-> running
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Job
mcollective.service/start finished, result=done
Feb 01 01:26:48 bootstrap systemd[1]: Started The Marionette Collective.
Feb 01 01:26:48 bootstrap systemd[1]: mcollective.service: Child 7524
belongs to mcollective.service

-- AND THE SERVICE STARTED CORRECTLY (pingable via mco ping)

2) the second box:

root@bootstrap:~# service mcollective status
● mcollective.service
    Loaded: masked (/dev/null; bad)
    Active: inactive (dead)

Feb 01 01:18:34 bootstrap systemd[1]: mcollective.service: Trying to
enqueue job mcollective.service/restart/replace
Warning: mcollective.service changed on disk. Run 'systemctl
daemon-reload' to reload units.

root@bootstrap:~# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

root@bootstrap:~# service mcollective start
Failed to start mcollective.service: Unit mcollective.service is masked.

--- but if I run manually the fix config it can be started manually.

root@bootstrap:~# fix-configs-on-startup
Removed symlink /etc/systemd/system/mcollective.service.
Failed to execute operation: Connection timed out
root@bootstrap:~# sleep 2m; service mcollective start
root@bootstrap:~# service mcollective status
● mcollective.service - The Marionette Coll...

Read more...

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Aha, so it seems the depending on timings (its different on VMs and the real hardware) rc.local maybe launched without proper PATH variable set. I suspect that something similar happens on one of our test labs.
Thinking about the hard part - systemd "Connection timeout"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/430138

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-agent (master)

Reviewed: https://review.openstack.org/430138
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=5c612875a6881b1c93c4dac04759f9e806a3ee1d
Submitter: Jenkins
Branch: master

commit 5c612875a6881b1c93c4dac04759f9e806a3ee1d
Author: Georgy Kibardin <email address hidden>
Date: Tue Feb 7 12:12:25 2017 +0300

    Use explicit path in Xenial

    Nobody promised PATH to be set during rc.local execution by systemd. And
    it happens that this is the case on some systems, i.e. the PATH is not
    set by the time of rc.local execution.

    Change-Id: I4806df3217dd09430ad45b13cda49fab0925cb88
    Partial-Bug: #1656150

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-agent (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/431414

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-agent (stable/newton)

Reviewed: https://review.openstack.org/431414
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=9526f477d81eff85f45e3279bc09778bfe3df930
Submitter: Jenkins
Branch: stable/newton

commit 9526f477d81eff85f45e3279bc09778bfe3df930
Author: Georgy Kibardin <email address hidden>
Date: Tue Feb 7 12:12:25 2017 +0300

    Use explicit path in Xenial

    Nobody promised PATH to be set during rc.local execution by systemd. And
    it happens that this is the case on some systems, i.e. the PATH is not
    set by the time of rc.local execution.

    Change-Id: I4806df3217dd09430ad45b13cda49fab0925cb88
    Partial-Bug: #1656150
    (cherry picked from commit 5c612875a6881b1c93c4dac04759f9e806a3ee1d)

tags: added: in-stable-newton
Revision history for this message
Luca Cervigni (cervigni) wrote :

Georgy,

Excuse my ignorance, but is this fix already being pushed in the nightly build of the fuel10.0 community or do I have to wait the release of 10.1 ? and if the case when should it be?

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Luca,

Of course you may try community nightly build. The latest one is here http://seed.fuel-infra.org/fuelweb-iso/fuel-10.0-community-1316-2017-02-15_23-14-00.iso.torrent, it contains the first fix.
However, we are still in progress of fixing one race condition in bootstrap build and systemd part of your problem is not resolved yet.

Revision history for this message
Burkhard Linke (blinke) wrote :

I'm having a similar problem with fuel-community-10.0.

Some nodes do not start mcollective. The workaround described above (running /etc/rc.local manually) unmasked the mcollective service, but it is not able to start it.

systemctl status mcollective prints:

# systemctl status mcollective
● mcollective.service - The Marionette Collective
   Loaded: loaded (/lib/systemd/system/mcollective.service; enabled; vendor preset: enabled)
   Active: inactive (dead)

Mar 08 09:12:33 bootstrap systemd[1]: mcollective.service: Failed to send unit change signal for mcollective.service: Transport endpoint is not connected
Mar 08 09:31:26 bootstrap systemd[1]: mcollective.service: Failed to send unit change signal for mcollective.service: Transport endpoint is not connected
Mar 08 09:37:20 bootstrap systemd[1]: mcollective.service: Failed to send unit change signal for mcollective.service: Transport endpoint is not connected

The "Transport endpoint is not connected" message is printed for each every time /etc/rc.local was executed.

Manually starting mcollective via systemctl start works, but the hosts are printed as 'bootstrap':
[root@fuel etc]# mco ping
bootstrap time=34.90 ms
17 time=37.08 ms
master time=39.54 ms
21 time=43.94 ms
25 time=45.40 ms
bootstrap time=46.36 ms
22 time=47.25 ms
23 time=63.95 ms

---- ping statistics ----
8 replies max: 63.95 min: 34.90 avg: 44.80

They are also not recognized during network connectivity check stating that nodes 11 and 20 are not available via mcollective.

Revision history for this message
Burkhard Linke (blinke) wrote :
Download full text (6.6 KiB)

I also notice a very high CPU load in the hosts after running /etc/rc.local:

top - 11:00:15 up 6 min, 1 user, load average: 9.88, 4.09, 1.62
Tasks: 369 total, 9 running, 354 sleeping, 0 stopped, 6 zombie
%Cpu(s): 0.1 us, 5.7 sy, 0.0 ni, 94.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13188960+total, 13057571+free, 746472 used, 567416 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13023952+avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
  884 root 20 0 52032 21652 17148 R 100.0 0.0 2:06.06 systemd-journal
 2235 root 20 0 0 0 0 S 25.9 0.0 0:29.72 kworker/1:2
  270 root 20 0 0 0 0 S 24.3 0.0 0:28.07 kworker/17:1
  272 root 20 0 0 0 0 R 23.9 0.0 0:27.08 kworker/4:1
  273 root 20 0 0 0 0 S 23.9 0.0 0:28.75 kworker/5:1
  267 root 20 0 0 0 0 S 23.3 0.0 0:27.92 kworker/3:1
   45 root 20 0 0 0 0 S 22.9 0.0 0:27.52 kworker/7:0
  235 root 20 0 0 0 0 S 22.3 0.0 0:25.20 kworker/0:2
  266 root 20 0 0 0 0 S 22.3 0.0 0:26.04 kworker/2:1
  274 root 20 0 0 0 0 S 22.3 0.0 0:26.84 kworker/21:1
  260 root 20 0 0 0 0 S 21.6 0.0 0:26.14 kworker/6:1
  276 root 20 0 0 0 0 S 21.3 0.0 0:23.96 kworker/20:1
  314 root 20 0 0 0 0 S 21.3 0.0 0:25.15 kworker/19:1
  316 root 20 0 0 0 0 S 20.6 0.0 0:24.92 kworker/23:1 ...

Read more...

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Burkhard, could you please try the latest ISO from here https://ci.fuel-infra.org/view/BVT/job/10.0-community.all/1401

Revision history for this message
Burkhard Linke (blinke) wrote :

There's no downloadable iso on that site, so I used the corresponding git checkout for building the image myself (second chunk of yumdownloader patch does not apply and isolinux pathes needs to be adopted for ubuntu 16.04, but that's a different story...).

- zapped fuel host
- reinstalled fuel host from new iso
- rebooted all discovered nodes
- checked mcollective:
...---- ping statistics ----
19 replies max: 80.50 min: 45.96 avg: 61.81

 18 nodes + master -> ok after first reboot

- rebooted a node
- mcollective is started correctly on node after reboot

Rebooting and starting used to take several minutes with the official iso. This time is also reduced significantly now.

Revision history for this message
Georgy Kibardin (gkibardin) wrote :
Changed in fuel:
assignee: Georgy Kibardin (gkibardin) → Fuel Sustaining (fuel-sustaining-team)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.