Galera server fail to start on CentOS.

Bug #1697531 reported by Marc Gariépy on 2017-06-12
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
openstack-ansible
High
Major Hayden

Bug Description

Galera is unable to start on CentOS

Jun 12 18:40:10 localhost systemd: Starting MariaDB database server...
Jun 12 18:40:10 localhost systemd: Failed at step NAMESPACE spawning /bin/sh: Invalid argument
Jun 12 18:40:10 localhost systemd: mariadb.service: control process exited, code=exited status=226
Jun 12 18:40:10 localhost systemd: Failed to start MariaDB database server.
Jun 12 18:40:10 localhost systemd: Unit mariadb.service entered failed state.

[root@infra1 log]# yum list installed|grep "mysql\|galera\|maria" -i
MariaDB-client.x86_64 10.1.24-1.el7.centos @MariaDB
MariaDB-common.x86_64 10.1.24-1.el7.centos @MariaDB
MariaDB-devel.x86_64 10.1.24-1.el7.centos @MariaDB
MariaDB-server.x86_64 10.1.24-1.el7.centos @MariaDB
MariaDB-shared.x86_64 10.1.24-1.el7.centos @MariaDB
galera.x86_64 25.3.16-1.el7 @openstack-ocata
perl-DBD-MySQL.x86_64 4.023-5.el7 @base

I've installed a system using a manual method and it works: https://mariadb.com/kb/en/mariadb/yum/

Comparing the systemd unit files between them, the results are the same. I've also confirmed that the user is setup with the same shell, so it's not that.

Our configs are definitely different, but I haven't isolated a cause just yet.

From our deployed node:

[root@container1 ~]# systemctl status mysql
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─limits.conf, migrated-from-my.cnf-settings.conf, timeout.conf
   Active: failed (Result: exit-code) since Tue 2017-06-13 14:52:42 UTC; 1min 3s ago
  Process: 9435 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=226/NAMESPACE)

Jun 13 14:52:42 container1 systemd[1]: Starting MariaDB database server...
Jun 13 14:52:42 container1 systemd[9435]: Failed at step NAMESPACE spawning /bin/sh: Invalid argument
Jun 13 14:52:42 container1 systemd[1]: mariadb.service: control process exited, code=exited status=226
Jun 13 14:52:42 container1 systemd[1]: Failed to start MariaDB database server.
Jun 13 14:52:42 container1 systemd[1]: Unit mariadb.service entered failed state.
Jun 13 14:52:42 container1 systemd[1]: mariadb.service failed.

-- Subject: Unit mariadb.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit mariadb.service has begun starting up.
Jun 13 14:52:42 container1 systemd[9435]: Failed at step NAMESPACE spawning /bin/sh: Invalid argument
-- Subject: Process /bin/sh could not be executed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The process /bin/sh could not be executed and failed.
--
-- The error number returned by this process is 22.
Jun 13 14:52:42 container1 systemd[1]: mariadb.service: control process exited, code=exited status=226
Jun 13 14:52:42 container1 systemd[1]: Failed to start MariaDB database server.
-- Subject: Unit mariadb.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit mariadb.service has failed.
--
-- The result is failed.

Here's the workaround for now. In the containers, add the following to the file in the comment:

#filename: /etc/systemd/system/mariadb.service.d/without-caps.conf
[Service]
PrivateDevices=false

Then execute 'systemctl daemon-reload' and mariadb will start with 'systemctl start mysql'.

Changed in openstack-ansible:
assignee: nobody → Marc Gariépy (mgariepy)
status: New → In Progress
Changed in openstack-ansible:
assignee: Marc Gariépy (mgariepy) → Jesse Pretorius (jesse-pretorius)
Major Hayden (rackerhacker) wrote :

I can verify that this workaround fixes the issue for me. The patch has some issues with checking for is_metal, though.

OK, my previous report of it working in a container I created was nonsense. It was working on a raw host. That rules out the packages form MariaDB as the culprit.

The other suspects are:
- a change in systemd (unlikely)
- a change in LXC (we use the COPR, so possible)
- a change in the LXC image downloaded (updated daily, so possible)

The periodic deploy for centos/ocata worked on 8 June and not on 9 Jun.

working: http://logs.openstack.org/periodic/periodic-openstack-ansible-deploy-aio-ocata-centos-7/c25496c/console.html.gz

failed: http://logs.openstack.org/periodic/periodic-openstack-ansible-deploy-aio-ocata-centos-7/52eb5a5/console.html.gz

As it turns out, the CORP updated packages on 8 June:
https://copr-be.cloud.fedoraproject.org/results/thm/lxc2.0/epel-7-x86_64/

There are plenty of changes involving caps between 2.0.6 and 2.0.8:
https://github.com/lxc/lxc/compare/lxc-2.0.6...lxc-2.0.8

This one looks like a pretty serious about turn, although perhaps context is needed:
https://github.com/lxc/lxc/commit/4645c74c8a4a7b2dc3df0c49a7ab8add891dcaad

Tested using Ubuntu Xenial host & guest with the LXC PPA to get v2.0.8 and it appears to be fine.

Here're my steps in case I missed something:

### on the host

apt-get update
apt-get install -y software-properties-common
add-apt-repository ppa:ubuntu-lxc/stable
apt-get update
apt-get install -y
lxc-create --template download --name 208
lxc-start --name 208
lxc-attach --name 208

### in the guest (208)

apt-get update
apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xcbcb082a1bb943db

echo 'deb http://ftp.osuosl.org/pub/mariadb/repo/10.1/ubuntu xenial main' > /etc/apt/sources.list.d/mariadb.list

apt-get update
apt-get install -y --allow-unauthenticated mariadb-server
systemctl status mariadb

Major Hayden (rackerhacker) wrote :

We can rule out MariaDB since it has had the PrivateDevices setting in place since the summer of 2016. Also, we should be able to rule out systemd since the same systemd version works with the working LXC (2.0.6) and fails with the bad LXC (2.0.8) version.

It looks like systemd is trying to mount some things in /dev and is having a tough time doing it: https://gist.github.com/major/6a097c0deb809cebbea90c97b93e45b0

Changed in openstack-ansible:
importance: Undecided → High
Changed in openstack-ansible:
assignee: Jesse Pretorius (jesse-pretorius) → Major Hayden (rackerhacker)

Reviewed: https://review.openstack.org/473914
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-lxc_hosts/commit/?id=5b171d9800830c6bbb4abcfe86f7dde50375a736
Submitter: Jenkins
Branch: master

commit 5b171d9800830c6bbb4abcfe86f7dde50375a736
Author: Jesse Pretorius <email address hidden>
Date: Tue Jun 13 18:07:19 2017 +0100

    Use LXC v2.0.6 on CentOS

    Both MariaDB and MemcacheD fail to start with
    the newer versions available in the COPR, so
    we're pinning to the older one to allow
    development to continue until the solution is
    identified.

    Related-Bug: 1697531
    Change-Id: If34240bf30f7f5bdc85d667d2ee79cca5b04dc45

Reviewed: https://review.openstack.org/473937
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-lxc_hosts/commit/?id=1350cf28fdfeaae07a2e570d0bf91843841a1350
Submitter: Jenkins
Branch: stable/ocata

commit 1350cf28fdfeaae07a2e570d0bf91843841a1350
Author: Jesse Pretorius <email address hidden>
Date: Tue Jun 13 18:07:19 2017 +0100

    Use LXC v2.0.6 on CentOS

    Both MariaDB and MemcacheD fail to start with
    the newer versions available in the COPR, so
    we're pinning to the older one to allow
    development to continue until the solution is
    identified.

    Related-Bug: 1697531
    Change-Id: If34240bf30f7f5bdc85d667d2ee79cca5b04dc45
    (cherry picked from commit 5b171d9800830c6bbb4abcfe86f7dde50375a736)

tags: added: in-stable-ocata
Major Hayden (rackerhacker) wrote :

This commit is the problematic one:

https://github.com/lxc/lxc/commit/a32f7894989090dfbd38037cc220b99af4942bfe

It is included in LXC 2.0.8.

Major Hayden (rackerhacker) wrote :

Opened an issue with upstream LXC: https://github.com/lxc/lxc/issues/1623

It appears that https://review.openstack.org/473914 is not working properly - I'm looking into it. I'm also building an environment to give feedback in the upstream bug.

Changed in openstack-ansible:
assignee: Major Hayden (rackerhacker) → Jesse Pretorius (jesse-pretorius)

Change abandoned by Major Hayden (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/473953
Reason:

Changed in openstack-ansible:
assignee: Jesse Pretorius (jesse-pretorius) → Major Hayden (rackerhacker)
Major Hayden (rackerhacker) wrote :

Opened a CentOS bug in the hopes of getting the systemd fix backported: https://bugs.centos.org/view.php?id=13419

Reviewed: https://review.openstack.org/473879
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=f46e1525069299297eec1df704607848dd7a7b2f
Submitter: Jenkins
Branch: master

commit f46e1525069299297eec1df704607848dd7a7b2f
Author: Major Hayden <email address hidden>
Date: Wed Jun 14 09:02:03 2017 -0500

    Disable PrivateDevices for Galera on CentOS 7

    This patch adds the `galera_disable_privatedevices` variable that
    allows deployers to disable PrivateDevices in the systemd unit file
    shipped with MariaDB 10.1+ on CentOS 7 systems.

    This is a workaround to fix the systemd/LXC issues with bind
    mounting an already bind mounted `/dev/ptmx` inside the LXC
    container.

    See Launchpad bug, lxc/lxc#1623, or systemd/systemd#6121 for more
    details.

    Co-Authored-By: Major Hayden <email address hidden>
    Closes-bug: 1697531
    Change-Id: I8a74113bd16a768a4754fb1f6ee04caf1ac82920

Changed in openstack-ansible:
status: In Progress → Fix Released

Change abandoned by Major Hayden (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/474251
Reason: Oops, Jesse got this already.

Reviewed: https://review.openstack.org/474251
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=26d74094a771ae86122f5815810a59c13998b2b5
Submitter: Jenkins
Branch: stable/ocata

commit 26d74094a771ae86122f5815810a59c13998b2b5
Author: Major Hayden <email address hidden>
Date: Wed Jun 14 09:02:03 2017 -0500

    Disable PrivateDevices for Galera on CentOS 7

    This patch adds the `galera_disable_privatedevices` variable that
    allows deployers to disable PrivateDevices in the systemd unit file
    shipped with MariaDB 10.1+ on CentOS 7 systems.

    This is a workaround to fix the systemd/LXC issues with bind
    mounting an already bind mounted `/dev/ptmx` inside the LXC
    container.

    See Launchpad bug, lxc/lxc#1623, or systemd/systemd#6121 for more
    details.

    Co-Authored-By: Major Hayden <email address hidden>
    Closes-bug: 1697531
    Change-Id: I8a74113bd16a768a4754fb1f6ee04caf1ac82920
    (cherry picked from commit f46e1525069299297eec1df704607848dd7a7b2f)

Change abandoned by Major Hayden (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/474314
Reason: Put this in the wrong branch. :/

Reviewed: https://review.openstack.org/474538
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-memcached_server/commit/?id=a9acd22e8287d7ebeccae74cbff1645648291085
Submitter: Jenkins
Branch: master

commit a9acd22e8287d7ebeccae74cbff1645648291085
Author: Jesse Pretorius <email address hidden>
Date: Thu Jun 15 11:50:44 2017 +0100

    Disable PrivateDevices for MemcacheD on CentOS 7

    This patch adds the `memcached_disable_privatedevices` variable that
    allows deployers to disable PrivateDevices in the systemd unit file.

    This is a workaround to fix the systemd/LXC issues with bind
    mounting an already bind mounted `/dev/ptmx` inside the LXC
    container.

    See Launchpad bug, lxc/lxc#1623, or systemd/systemd#6121 for more
    details.

    The is_metal variable is removed as it is unused.

    Related-bug: 1697531
    Change-Id: Id7c148bf901354a3dfc2f189ec659f2b92fc7985

Reviewed: https://review.openstack.org/474561
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-memcached_server/commit/?id=5363432f58334823f7e6c6c88617bb908ca48359
Submitter: Jenkins
Branch: stable/ocata

commit 5363432f58334823f7e6c6c88617bb908ca48359
Author: Jesse Pretorius <email address hidden>
Date: Thu Jun 15 11:50:44 2017 +0100

    Disable PrivateDevices for MemcacheD on CentOS 7

    This patch adds the `memcached_disable_privatedevices` variable that
    allows deployers to disable PrivateDevices in the systemd unit file.

    This is a workaround to fix the systemd/LXC issues with bind
    mounting an already bind mounted `/dev/ptmx` inside the LXC
    container.

    See Launchpad bug, lxc/lxc#1623, or systemd/systemd#6121 for more
    details.

    The is_metal variable is removed as it is unused.

    Related-bug: 1697531
    Change-Id: Id7c148bf901354a3dfc2f189ec659f2b92fc7985
    (cherry picked from commit a9acd22e8287d7ebeccae74cbff1645648291085)

Reviewed: https://review.openstack.org/474316
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=d10f52bb186b5476e2be8f3b3be7c226a7841c4d
Submitter: Jenkins
Branch: master

commit d10f52bb186b5476e2be8f3b3be7c226a7841c4d
Author: Major Hayden <email address hidden>
Date: Wed Jun 14 13:52:10 2017 -0500

    Set PrivateDevices=false for Galera

    This patch sets the `galera_disable_privatedevices` variable in the
    galera_server role. If galera is deployed with a container, the
    PrivateDevices configuration will be disabled in MariaDB's systemd
    unit file.

    Related-Bug: 1697531
    Change-Id: I3dce66a5fa94d8a1a27280244622ca68036e6ad1

Reviewed: https://review.openstack.org/474583
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=14ae2dd53421971a273f7a3213fe1c3726fe12dd
Submitter: Jenkins
Branch: master

commit 14ae2dd53421971a273f7a3213fe1c3726fe12dd
Author: Jesse Pretorius <email address hidden>
Date: Thu Jun 15 14:50:44 2017 +0100

    Set PrivateDevices=false for MemcacheD

    This patch sets the `memcached_disable_privatedevices` variable in the
    memcached_server role. If memcached is deployed with a container, the
    PrivateDevices configuration will be disabled in the systemd unit file.

    Change-Id: Idc153c45f5da2ee44b49dbd5ef4577f749550556
    Related-Bug: 1697531

Reviewed: https://review.openstack.org/474760
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=588b9cead7637cc73cc8607e2460fea24cc6e2b0
Submitter: Jenkins
Branch: master

commit 588b9cead7637cc73cc8607e2460fea24cc6e2b0
Author: Marc Gariepy <email address hidden>
Date: Thu Jun 15 15:55:23 2017 -0400

    Move PrivateDevices before mysql password set

    Move the PrivateDevices before we try to start the service the first time

    Related-Bug: 1697531
    Change-Id: I67ef7ba02ee652e9855b9cf4ba7a44a361844a83

Reviewed: https://review.openstack.org/475025
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=c6d36689655586971fbd4be8a6cba2f54b5f44b3
Submitter: Jenkins
Branch: stable/ocata

commit c6d36689655586971fbd4be8a6cba2f54b5f44b3
Author: Marc Gariepy <email address hidden>
Date: Thu Jun 15 15:55:23 2017 -0400

    Move PrivateDevices before mysql password set

    Move the PrivateDevices before we try to start the service the first time

    Related-Bug: 1697531
    Change-Id: I67ef7ba02ee652e9855b9cf4ba7a44a361844a83
    (cherry picked from commit 588b9cead7637cc73cc8607e2460fea24cc6e2b0)

Reviewed: https://review.openstack.org/474765
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-tests/commit/?id=7938542000caeb20583033b7bed4970488565891
Submitter: Jenkins
Branch: master

commit 7938542000caeb20583033b7bed4970488565891
Author: Marc Gariepy <email address hidden>
Date: Thu Jun 15 16:04:03 2017 -0400

    Disable PrivateDevices for galera and Memcached

    Change-Id: I93318a73c43ff7c6ae423271bd8c252ad94b0149
    Depends-On: I67ef7ba02ee652e9855b9cf4ba7a44a361844a83
    Related-Bug: 1697531

Reviewed: https://review.openstack.org/475093
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-tests/commit/?id=7162f3d86c2f51f5b98275c07d7508e4eaada15f
Submitter: Jenkins
Branch: stable/ocata

commit 7162f3d86c2f51f5b98275c07d7508e4eaada15f
Author: Marc Gariepy <email address hidden>
Date: Thu Jun 15 16:04:03 2017 -0400

    Disable PrivateDevices for galera and Memcached

    Change-Id: I93318a73c43ff7c6ae423271bd8c252ad94b0149
    Depends-On: I67ef7ba02ee652e9855b9cf4ba7a44a361844a83
    Related-Bug: 1697531
    (cherry picked from commit 7938542000caeb20583033b7bed4970488565891)

Reviewed: https://review.openstack.org/475101
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=c66391662cae01fd0ca716164f2adeea1e851ce6
Submitter: Jenkins
Branch: stable/ocata

commit c66391662cae01fd0ca716164f2adeea1e851ce6
Author: Jesse Pretorius <email address hidden>
Date: Thu Jun 15 14:50:44 2017 +0100

    Set PrivateDevices=false for MemcacheD

    This patch sets the `memcached_disable_privatedevices` variable in the
    memcached_server role. If memcached is deployed with a container, the
    PrivateDevices configuration will be disabled in the systemd unit file.

    Change-Id: Idc153c45f5da2ee44b49dbd5ef4577f749550556
    Related-Bug: 1697531
    (cherry picked from commit 14ae2dd53421971a273f7a3213fe1c3726fe12dd)

Reviewed: https://review.openstack.org/474314
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=500c25d530142dde273813e88059e1a9d449c31a
Submitter: Jenkins
Branch: stable/ocata

commit 500c25d530142dde273813e88059e1a9d449c31a
Author: Major Hayden <email address hidden>
Date: Wed Jun 14 13:52:10 2017 -0500

    Set PrivateDevices=false for Galera

    This patch sets the `galera_disable_privatedevices` variable in the
    galera_server role. If galera is deployed with a container, the
    PrivateDevices configuration will be disabled in MariaDB's systemd
    unit file.

    Related-Bug: 1697531
    Change-Id: I3dce66a5fa94d8a1a27280244622ca68036e6ad1
    (cherry picked from commit d10f52bb186b5476e2be8f3b3be7c226a7841c4d)

This issue was fixed in the openstack/openstack-ansible-galera_server 15.1.6 release.

This issue was fixed in the openstack/openstack-ansible-galera_server 16.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.