Hi,
I was trying to deploy masakari with kolla-ansible with debian-binary images and it's buggy currently.
Containers:
root@controller0:/home/ubuntu# docker run --rm -i -v /var/run/docker.sock:/var/run/docker.sock nexdrew/rekcod masakari_hostmonitor
docker run --name masakari_hostmonitor --runtime runc -v /etc/kolla/masakari-hostmonitor/:/var/lib/kolla/config_files/:ro -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro -v kolla_logs:/var/log/kolla/:rw --net host --restart unless-stopped --security-opt 'label=disable' -h controller0 -l build-date='20210630' -l kolla_version='12.0.1' -l maintainer='Kolla Project (https://launchpad.net/kolla)' -l name='masakari-monitors' -e 'KOLLA_CONFIG_STRATEGY=COPY_ALWAYS' -e 'KOLLA_SERVICE_NAME=masakari-hostmonitor' -e 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin' -e 'LANG=en_US.UTF-8' -e 'KOLLA_BASE_DISTRO=debian' -e 'KOLLA_DISTRO_PYTHON_VERSION=3.9' -e 'KOLLA_BASE_ARCH=x86_64' -e 'PS1=$(tput bold)($(printenv KOLLA_SERVICE_NAME))$(tput sgr0)[$(id -un)@$(hostname -s) $(pwd)]$ ' -e 'DEBIAN_FRONTEND=noninteractive' -e 'KOLLA_INSTALL_TYPE=binary' -e 'KOLLA_INSTALL_METATYPE=rdo' -d --entrypoint "dumb-init --single-child --" dockerhub.ultimum.io:443/kolla-dev/debian-binary-masakari-monitors:wallaby 'kolla_start'
root@controller0:/home/ubuntu# docker run --rm -i -v /var/run/docker.sock:/var/run/docker.sock nexdrew/rekcod hacluster_pacemaker
docker run --name hacluster_pacemaker --runtime runc -v /etc/kolla/hacluster-pacemaker/:/var/lib/kolla/config_files/:ro -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro -v kolla_logs:/var/log/kolla/:rw -v hacluster_pacemaker:/var/lib/pacemaker:rw --net host --restart unless-stopped --security-opt 'label=disable' -h controller0 -l build-date='20210630' -l kolla_version='12.0.1' -l maintainer='Kolla Project (https://launchpad.net/kolla)' -l name='hacluster-pacemaker' -e 'KOLLA_CONFIG_STRATEGY=COPY_ALWAYS' -e 'PCMK_logfile=/var/log/kolla/hacluster/pacemaker.log' -e 'PCMK_debug=off' -e 'KOLLA_SERVICE_NAME=hacluster-pacemaker' -e 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin' -e 'LANG=en_US.UTF-8' -e 'KOLLA_BASE_DISTRO=debian' -e 'KOLLA_DISTRO_PYTHON_VERSION=3.9' -e 'KOLLA_BASE_ARCH=x86_64' -e 'PS1=$(tput bold)($(printenv KOLLA_SERVICE_NAME))$(tput sgr0)[$(id -un)@$(hostname -s) $(pwd)]$ ' -e 'DEBIAN_FRONTEND=noninteractive' -d --entrypoint "dumb-init --single-child --" dockerhub.ultimum.io:443/kolla-dev/debian-binary-hacluster-pacemaker:wallaby 'kolla_start'
Issue (crmadmin not working in masakari_hostmonitor container) is that masakri_hostmonitor can't do the work:
(hacluster-pacemaker)[root@controller0 /]# crmadmin -V -S controller0
Status of crmd@controller0: S_NOT_DC (ok)
(hacluster-pacemaker)[root@controller0 /]# ^C
(hacluster-pacemaker)[root@controller0 /]#
exit
root@controller0:/home/ubuntu# docker exec -itu root masakari_hostmonitor bash
(masakari-hostmonitor)[root@controller0 /]# crmadmin -S controller0
error: Could not connect to controller: Transport endpoint is not connected
(masakari-hostmonitor)[root@controller0 /]#
Problem is that packages in debian images are creating hacluster and haclient uuid/gid dynamically:
1. hacluster_pacemaker is FROM base image, installing pacemaker-cli-utils which has dependency to pacemaker-common which is creating user,group
addgroup --system haclient
adduser --system hacluster --ingroup haclient --home /var/lib/pacemaker --no-create-home
2. Masakari-hostmonitor is FROM openstack_base image and needs crmadmin command which is from pacemaker-cli-utils which has dependency to pacemaker-common which is creating user/group also as above ..
So, main problem is that there are two different uuids/gid in both images -> So pacemaker is OK with crmadmin and has proper permissions for itself inside container ..but masakari-hostmonitor NOT ..as uuid/gid is generated dynamically (centos has it hardcoded to 189)
Found when comparing straces for both cases where it's trying to open /dev/shm/* and there are different uuids for below /dev/shm/* in containers.
Sample from masakari:
openat(AT_FDCWD, "/dev/shm/qb-22-0-14-6PCL3S/qb-response-crmd-header", O_RDWR) = 4
ftruncate(4, 8248) = 0
fallocate(4, 0, 0, 8248) = 0
mmap(NULL, 8248, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7f6c28756000
openat(AT_FDCWD, "/dev/shm/qb-22-0-14-6PCL3S/qb-response-crmd-data", O_RDWR) = 5
ftruncate(5, 135168) = 0
fallocate(5, 0, 0, 135168) = 0
mmap(NULL, 270336, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6c25674000
mmap(0x7f6c25674000, 135168, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6c25674000
mmap(0x7f6c25695000, 135168, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6c25695000
close(5) = 0
close(4) = 0
openat(AT_FDCWD, "/dev/shm/qb-22-0-14-6PCL3S/qb-event-crmd-header", O_RDWR) = 4
ftruncate(4, 8248) = 0
fallocate(4, 0, 0, 8248) = 0
mmap(NULL, 8248, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7f6c25671000
openat(AT_FDCWD, "/dev/shm/qb-22-0-14-6PCL3S/qb-event-crmd-data", O_RDWR) = 5
ftruncate(5, 135168) = 0
fallocate(5, 0, 0, 135168) = 0
mmap(NULL, 270336, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6c2562f000
mmap(0x7f6c2562f000, 135168, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6c2562f000
mmap(0x7f6c25650000, 135168, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6c25650000
close(5) = 0
close(4) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(4) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(4) = 0
openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_dev=makedev(0, 0x1be), st_ino=9206909, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=494, st_atime=1565953052 /* 2019-08-16T10:57:32+0000 */, st_atime_nsec=0, st_mtime=1565953052 /* 2019-08-16T10:57:32+0000 */, st_mtime_nsec=0, st_ctime=1624734299 /* 2021-06-26T19:04:59.661304860+0000 */, st_ctime_nsec=661304860}) = 0
read(4, "# /etc/nsswitch.conf\n#\n# Example"..., 4096) = 494
read(4, "", 4096) = 0
close(4) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_dev=makedev(0, 0x1be), st_ino=5662353, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=48, st_size=21470, st_atime=1625232254 /* 2021-07-02T13:24:14.327942765+0000 */, st_atime_nsec=327942765, st_mtime=1625232253 /* 2021-07-02T13:24:13.759943609+0000 */, st_mtime_nsec=759943609, st_ctime=1625232253 /* 2021-07-02T13:24:13.851943473+0000 */, st_ctime_nsec=851943473}) = 0
mmap(NULL, 21470, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7f6c25629000
close(4) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 4
read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0003\0\0\0\0\0\0"..., 832) = 832
fstat(4, {st_dev=makedev(0, 0x1be), st_ino=9207117, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=104, st_size=51696, st_atime=1619902566 /* 2021-05-01T20:56:06+0000 */, st_atime_nsec=0, st_mtime=1619902566 /* 2021-05-01T20:56:06+0000 */, st_mtime_nsec=0, st_ctime=1624734300 /* 2021-06-26T19:05:00.025304882+0000 */, st_ctime_nsec=25304882}) = 0
mmap(NULL, 79672, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x7f6c25615000
mmap(0x7f6c25618000, 28672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x3000) = 0x7f6c25618000
mmap(0x7f6c2561f000, 8192, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0xa000) = 0x7f6c2561f000
mmap(0x7f6c25621000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0xb000) = 0x7f6c25621000
mmap(0x7f6c25623000, 22328, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f6c25623000
close(4) = 0
mprotect(0x7f6c25621000, 4096, PROT_READ) = 0
munmap(0x7f6c25629000, 21470) = 0
openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 4
lseek(4, 0, SEEK_CUR) = 0
fstat(4, {st_dev=makedev(0, 0x1be), st_ino=5662384, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16, st_size=5347, st_atime=1625232120 /* 2021-07-02T13:22:00.484148736+0000 */, st_atime_nsec=484148736, st_mtime=1625232120 /* 2021-07-02T13:22:00.352148947+0000 */, st_mtime_nsec=352148947, st_ctime=1625232120 /* 2021-07-02T13:22:00.388148890+0000 */, st_ctime_nsec=388148890}) = 0
read(4, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 4096
lseek(4, 0, SEEK_CUR) = 4096
read(4, "home/qemu:/usr/sbin/nologin\nrabb"..., 4096) = 1251
close(4) = 0
getsockopt(3, SOL_SOCKET, SO_PEERCRED, {pid=0, uid=101, gid=105}, [12]) = 0
poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)
shutdown(3, SHUT_RDWR) = 0
close(3) = 0
munmap(0x7f6c256b6000, 270336) = 0
munmap(0x7f6c28759000, 8248) = 0
munmap(0x7f6c25674000, 270336) = 0
munmap(0x7f6c28756000, 8248) = 0
munmap(0x7f6c2562f000, 270336) = 0
munmap(0x7f6c25671000, 8248) = 0
munmap(0x7f6c256f8000, 135168) = 0
write(2, "error: Could not connect to cont"..., 76error: Could not connect to controller: Transport endpoint is not connected
From centos:
[root@centos8 centos]# rpm -q --scripts pacemaker-libs
preinstall scriptlet (using /bin/sh):
getent group haclient >/dev/null || groupadd -r haclient -g 189
getent passwd hacluster >/dev/null || useradd -r -g haclient -u 189 -s /sbin/nologin -c "cluster user" hacluster
exit 0